Machine Learning

Hugging Face Transformers in Action: Learning How to Use AI for NLP

(NLP) has changed the way we interact with technology.

Remember when chatbots first appeared and sounded like robots? Thankfully, that happened in the past!

Transformer models wave their magic wand and reshape NLP practices. But before you throw away this post think “Geez, transformers are too thick to read”, bear with me. We won't go into another technical article trying to teach you the math behind this amazing technology, but instead, we learn by doing what it can do for us.

With the Transformers pipeline from Hugging Face, NLP tasks are easier than ever.

Let's check it out!

The Only Explanation About What A Transformer Is

Think about it transformer models as the top of the NLP world.

Transformers excel because of their ability to focus on different parts of the input sequence using a technique called “self-focusing.”

Transformers are powerful because “self-care,” a feature that allows them to decide which parts of a sentence are most important to focus on at any given time.

Ever heard of BERT, GPT, or RoBERTa? That's them! BERT (Bidirectional encoder presentations from Transformers) is an adaptive Google AI language model from 2018 that understands text context by reading words from left to right and right to left simultaneously.

Enough talking, let's dive in transformers the parcel [1].

Introduction to the Conversion Pipeline

The Transformers library provides a complete toolkit for training and using pre-trained models. The Pipeline class, which is our main topic, provides an easy-to-use interface for various functions, e.g.:

  • Text creation
  • Image segmentation
  • Speech recognition
  • Document QA.

Preparation

Before starting, let's run through the basics and gather our tools. We'll need Python, a library of converters, and maybe PyTorch or TensorFlow. Installation is business as usual: pip install transformers.

IDEs like Anaconda or platforms like Google Colab already deliver those as standard installations. No problem.

The Pipeline class allows you to perform many machine learning tasks using any model available in the Hugging Face Hub. It's as easy as plug and play.

Although every function comes with a pre-configured default model and processor, you can easily customize this by using the model parameter to change to a different model of your choice.

The code

Let's start with transformers 101 and see how they work before we go deeper. The first task we will do is to analyze the sentiment on any news topic.

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("Instagram wants to limit hashtag spam.")

The answer is the following.

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (
Using a pipeline without specifying a model name and revision in production is not recommended.

Device set to use cpu
[{'label': 'NEGATIVE', 'score': 0.988932728767395}]

Since we did not provide a model parameter, went with the default option. As a category, we found that the sentiment on this topic is 98% NEGATIVE. Additionally, we can have a list of sentences to split, not just one.

It's pretty simple, isn't it? But that's not all. We can continue to explore other great activities.

Zero-Shot classification

Zero shot classification means labeling text that has not yet been labeled. So we don't have a clear pattern for that. All we have to do is pass a few classes so that the model selects one. This can be very useful when creating datasets for machine learning training.

This time, we feed the way with model an argument and a list of sentences to be separated.

classifier = pipeline("zero-shot-classification", model = 'facebook/bart-large-mnli')
classifier(
    ["Inter Miami wins the MLS", "Match tonight betwee Chiefs vs. Patriots", "Michael Jordan plans to sell Charlotte Hornets"],
    candidate_labels=["soccer", "football", "basketball"]
    )
[{'sequence': 'Inter Miami wins the MLS',
  'labels': ['soccer', 'football', 'basketball'],
  'scores': [0.9162040948867798, 0.07244189083576202, 0.011354007758200169]},
 {'sequence': 'Match tonight betwee Chiefs vs. Patriots',
  'labels': ['football', 'basketball', 'soccer'],
  'scores': [0.9281435608863831, 0.0391676239669323, 0.032688744366168976]},
 {'sequence': 'Michael Jordan plans to sell Charlotte Hornets',
  'labels': ['basketball', 'football', 'soccer'],
  'scores': [0.9859175682067871, 0.009983371943235397, 0.004099058918654919]}]

Looks like the model did a great job labeling these sentences!

Text Generation

The package can also generate text. This is a great way to create a cute little story generator to tell our kids before bed. We increase the temperature parameter to make the model more creative.

generator = pipeline("text-generation", temperature=0.8)
generator("Once upon a time, in a land where the King Pineapple was")
[{'generated_text': 
"Once upon a time, in a land where the King Pineapple was a common
 crop, the Queen of the North had lived in a small village. The Queen had always 
lived in a small village, and her daughter, who was also the daughter of the Queen,
 had lived in a larger village. The royal family would come to the Queen's village,
 and then the Queen would return to her castle and live there with her daughters. 
In the middle of the night, she would lay down on the royal bed and kiss the princess
 at least once, and then she would return to her castle to live there with her men. 
In the daytime, however, the Queen would be gone forever, and her mother would be alone.
The reason for this disappearance, in the form of the Great Northern Passage 
and the Great Northern Passage, was the royal family had always wanted to take 
the place of the Queen. In the end, they took the place of the Queen, and went 
with their daughter to meet the King. At that time, the King was the only person 
on the island who had ever heard of the Great Northern Passage, and his return was
 in the past.
After Queen Elizabeth's death, the royal family went to the 
Great Northern Passage, to seek out the Princess of England and put her there. 
The Princess of England had been in"}]

Name and Business Recognition

This function can identify a person (PER), place (LOC), or entity (ORG) in a document. That's great for creating a quick marketing list of leads, for example.

ner = pipeline("ner", grouped_entities=True)
ner("The man landed on the moon in 1969. Neil Armstrong was the first man to step on the Moon's surface. He was a NASA Astronaut.")
[{'entity_group': 'PER', 'score': np.float32(0.99960065),'word': 'Neil Armstrong',
  'start': 36,  'end': 50},

 {'entity_group': 'LOC',  'score': np.float32(0.82190216),  'word': 'Moon',
  'start': 84,  'end': 88},

 {'entity_group': 'ORG',  'score': np.float32(0.9842771),  'word': 'NASA',
  'start': 109,  'end': 113},

 {'entity_group': 'MISC',  'score': np.float32(0.8394754),  'word': 'As',
  'start': 114,  'end': 116}]

To summarize

Perhaps one of the most used functions, summarizing let's reduce the text, keeping its essence and important parts. Let's summarize this Wikipedia page about Transformers.

summarizer = pipeline("summarization")
summarizer("""
In deep learning, the transformer is an artificial neural network architecture based
on the multi-head attention mechanism, in which text is converted to numerical
 representations called tokens, and each token is converted into a vector via lookup
 from a word embedding table.[1] At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Transformers have the advantage of having no recurrent units, therefore requiring 
less training time than earlier recurrent neural architectures (RNNs) such as long 
short-term memory (LSTM).[2] Later variations have been widely adopted for training
 large language models (LLMs) on large (language) datasets.[3]
""")
[{'summary_text': 
' In deep learning, the transformer is an artificial neural network architecture 
based on the multi-head attention mechanism . Transformerers have the advantage of
 having no recurrent units, therefore requiring less training time than earlier 
recurrent neural architectures (RNNs) such as long short-term memory (LSTM)'}]

Very nice!

Image Recognition

There are some more complex functions, such as image recognition. And it's as easy to use as any.

image_classifier = pipeline(
    task="image-classification", model="google/vit-base-patch16-224"
)
result = image_classifier(
    "
)
print(result)
Photo by Vitalii Khodzinskyi on Unsplash
[{'label': 'Yorkshire terrier', 'score': 0.9792122840881348}, 
{'label': 'Australian terrier', 'score': 0.00648861238732934}, 
{'label': 'silky terrier, Sydney silky', 'score': 0.00571345305070281}, 
{'label': 'Norfolk terrier', 'score': 0.0013639888493344188}, 
{'label': 'Norwich terrier', 'score': 0.0010306559270247817}]

So, with these few examples, it's easy to see how easy it is to use the Transformers library to perform different tasks with very little code.

Wrapping up

What if we wrapped up our knowledge by applying it to a practical, small task?

Let's create a simple Streamlit app that can read a resume and return sentiment analysis and categorize text tone as ["Senior", "Junior", "Trainee", "Blue-collar", "White-collar", "Self-employed"]

In the following code:

  • Import packages
  • Create a Title and Subtitle for the page
  • Add a text input field
  • Tokenize the text and split it into parts for the transformer function. See the list of models [4].
import streamlit as st
import torch
from transformers import pipeline
from transformers import AutoTokenizer
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

st.title("Resumé Sentiment Analysis")
st.caption("Checking the sentiment and language tone of your resume")

# Add input text area
text = st.text_area("Enter your resume text here")

# 1. Load your desired tokenizer
model_checkpoint = "bert-base-uncased" 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# 2. Tokenize the text without padding or truncation
# We return tensors or lists to slice them manually
tokens = tokenizer(text, add_special_tokens=False, return_tensors="pt")["input_ids"][0]

# 3. Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# 4. Split into chunks for efficient retrieval
chunks = text_splitter.split_documents(text)

# 5. Convert back to strings or add special tokens for model input
decoded_chunks = []
for chunk in chunks:
    # This adds [CLS] and [SEP] and converts back to a format the model likes
    final_input = tokenizer.prepare_for_model(chunk.tolist(), add_special_tokens=True)
    decoded_chunks.append(tokenizer.decode(final_input['input_ids']))

st.write(f"Created {len(decoded_chunks)} chunks.")

Next, we'll run the transformer pipeline to:

  • Do emotional analysis and restore confidence %.
  • Adjust text tone and restore confidence %.
# Initialize sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# Perform sentiment analysis    
if st.button("Analyze"):
    col1, col2 = st.columns(2)

    with col1:  
        # Sentiment analysis
        sentiment = sentiment_pipeline(decoded_chunks)[0]
        st.write(f"Sentiment: {sentiment['label']}")
        st.write(f"Confidence: {100*sentiment['score']:.1f}%")
    
    with col2:
        # Categorize tone
        tone_pipeline = pipeline("zero-shot-classification", model = 'facebook/bart-large-mnli',
                                candidate_labels=["Senior", "Junior", "Trainee", "Blue-collar", "White-collar", "Self-employed"])
        tone = tone_pipeline(decoded_chunks)[0]
        
        st.write(f"Tone: {tone['labels'][0]}")
        st.write(f"Confidence: {100*tone['scores'][0]:.1f}%")

Here is a screenshot.

Sentiment Analysis and Language. Author's photo.

Before You Go

Hugging Face Transformers Pipelines (HF) Transformers Pipelines are truly a game changer for data workers. They provide an incredibly streamlined way to tackle complex machine learning tasks, such as text generation or image classification, using just a few lines of code.

HF has already done the heavy lifting by collapsing a complex model into simple, intuitive methods.

This takes the focus off low-level coding and allows us to focus on what's really important: using our creativity to build impactful, real-world applications.

If you liked this content, find out more about me on my website.

GitHub Repository

References

[1. Transformers package]

[2. Transformers Pipelines]

[3. Pipelines Examples]

[3. HF Models] huggingface.co/models

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button