Transformer Pipeline Steps

Table of contents

Preprocessing with a tokenizer
Passing Input to Model
Postprocessing

from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier(
    ['I have been waiting to start Hugging Face course',
     'That is such a bad news!'])

[{'label': 'POSITIVE', 'score': 0.9987032413482666},
{'label': 'NEGATIVE', 'score': 0.9998062252998352}]

When we ran the above sentiment analysis task on few examples, we got the above results. Now, we will look behind the scenes at how this pipeline works.

The pipeline groups together below three steps:

Preprocessing raw input text
Passing input to model
Postprocessing

Let’s dive into each of them.

Preprocessing with a tokenizer

Tokenizer is responsible for:

Splitting the input into words, subwords or symbols (like punctuation) called tokens
Mapping tokens to sn integer

To do this, we use the AutoTokenizer class and its from_pretrained() method. And we use distilbert-base-uncased-finetuned-sst-2-english checkpoint, as it is the default checkpoint for the sentiment-analysis pipeline.

from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

With the help of the tokenizer, we will have the dictionary that is ready to feed our model. Now, we need to convert the list of input IDs to tensors, as Transformer models only accept tensor as input.

We can specify the type of tensor using return_tensors argument. If not type is passed, we will get a list of lists as a result.

sentences = [
    'I have been waiting to start Hugging Face course',
    'That is such a bad news!']

inputs = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    return_tensors='pt')
print(inputs)

{'input_ids': tensor([
    [  101,  1045,  2031,  2042,  3403,  2000,  2707, 17662,  2227,  2607,  102],
    [  101,  2008,  2003,  2107,  1037,  2919,  2739,   999,   102,    0,    0]]), 
'attention_mask': tensor([
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}

Passing Input to Model

Now, we download the pretrained model as we did for the tokenizer with the help of AutoModel class.

from transformers import AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)

The model output has usually three dimensions specifying the following:

Batch size: Number of samples processed at a time
Sequence length: Length of numerical representation of the sample
Hidden size: Vector dimension of each model input
- for smaller models, the hidden size can be 768.
- for larger models, it can reach up to 3072 or more.

outputs = model(**inputs)

We can see the shape of ouput as we feed the processed input to the model

print(outputs.last_hidden_state.shape)

torch.Size([2, 11, 768])

There are many architectures available in Transformers, designed to tackle specific task.

For example, In above case, If we wanted to use a model with sequence classification (i.e the model will classify the input to positive or negative), then we can utilize AutoModelForSequenceClassification instead of AutoModel class.

from transformers import AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)

Now, we will look into the output shape generated by our model.

print(outputs.logits.shape)

torch.Size([2, 2])

The result shape we got from our model is 2 x 2 as we have just two sentences and two labels.

Postprocessing

printing the output we got from our model

print(outputs.logits)

tensor([[-3.2289,  3.4177],
        [ 4.7612, -3.7872]], grad_fn=<AddmmBackward0>)

All Transformer models output logits, the raw, unnormalized score. For converting those numbers into probablities we need to utilize SoftMax layer.

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=1)
print(predictions)

tensor([[1.2968e-03, 9.9870e-01],
        [9.9981e-01, 1.9383e-04]], grad_fn=<SoftmaxBackward0>)

We can explore the id2label attribute to get the labels corresponding to each position.

print(model.config.id2label)

{0: 'NEGATIVE', 1: 'POSITIVE'}

The model predicted the following:

For first sentence,

Negative: 0.0012968
Positive: 0.9987

For second sentence,

Negative: 0.99981
Positive: 0.00019383

In this post, we have successfully explored three steps involved in the pipeline with examples.