Transformer Pipeline Steps
Table of contents
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier(
['I have been waiting to start Hugging Face course',
'That is such a bad news!'])
[{'label': 'POSITIVE', 'score': 0.9987032413482666},
{'label': 'NEGATIVE', 'score': 0.9998062252998352}]
When we ran the above sentiment analysis task on few examples, we got the above results. Now, we will look behind the scenes at how this pipeline works.
The pipeline groups together below three steps:
- Preprocessing raw input text
- Passing input to model
- Postprocessing
Let’s dive into each of them.
Preprocessing with a tokenizer
Tokenizer is responsible for:
- Splitting the input into words, subwords or symbols (like punctuation) called tokens
- Mapping tokens to sn integer
To do this, we use the AutoTokenizer class and its from_pretrained() method. And we use distilbert-base-uncased-finetuned-sst-2-english checkpoint, as it is the default checkpoint for the sentiment-analysis pipeline.
from transformers import AutoTokenizer
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
With the help of the tokenizer, we will have the dictionary that is ready to feed our model. Now, we need to convert the list of input IDs to tensors, as Transformer models only accept tensor as input.
We can specify the type of tensor using return_tensors argument. If not type is passed, we will get a list of lists as a result.
sentences = [
'I have been waiting to start Hugging Face course',
'That is such a bad news!']
inputs = tokenizer(
sentences,
padding=True,
truncation=True,
return_tensors='pt')
print(inputs)
{'input_ids': tensor([
[ 101, 1045, 2031, 2042, 3403, 2000, 2707, 17662, 2227, 2607, 102],
[ 101, 2008, 2003, 2107, 1037, 2919, 2739, 999, 102, 0, 0]]),
'attention_mask': tensor([
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
Passing Input to Model
Now, we download the pretrained model as we did for the tokenizer with the help of AutoModel class.
from transformers import AutoModel
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)
The model output has usually three dimensions specifying the following:
- Batch size: Number of samples processed at a time
- Sequence length: Length of numerical representation of the sample
- Hidden size: Vector dimension of each model input
- for smaller models, the hidden size can be 768.
- for larger models, it can reach up to 3072 or more.
outputs = model(**inputs)
We can see the shape of ouput as we feed the processed input to the model
print(outputs.last_hidden_state.shape)
torch.Size([2, 11, 768])
There are many architectures available in Transformers, designed to tackle specific task.
For example, In above case, If we wanted to use a model with sequence classification (i.e the model will classify the input to positive or negative), then we can utilize AutoModelForSequenceClassification instead of AutoModel class.
from transformers import AutoModelForSequenceClassification
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
Now, we will look into the output shape generated by our model.
print(outputs.logits.shape)
torch.Size([2, 2])
The result shape we got from our model is 2 x 2 as we have just two sentences and two labels.
Postprocessing
printing the output we got from our model
print(outputs.logits)
tensor([[-3.2289, 3.4177],
[ 4.7612, -3.7872]], grad_fn=<AddmmBackward0>)
All Transformer models output logits, the raw, unnormalized score. For converting those numbers into probablities we need to utilize SoftMax layer.
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=1)
print(predictions)
tensor([[1.2968e-03, 9.9870e-01],
[9.9981e-01, 1.9383e-04]], grad_fn=<SoftmaxBackward0>)
We can explore the id2label attribute to get the labels corresponding to each position.
print(model.config.id2label)
{0: 'NEGATIVE', 1: 'POSITIVE'}
The model predicted the following:
For first sentence,
Negative: 0.0012968
Positive: 0.9987
For second sentence,
Negative: 0.99981
Positive: 0.00019383
In this post, we have successfully explored three steps involved in the pipeline with examples.