Handling Multiple Sequences

Table of contents

Create batch of inputs and send it to Model

First, let’s convert the list of numbers into a tensor for the sample sequence and send it to the model.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = "I am currently learning machine learning."

# create tokens
tokens = tokenizer.tokenize(sequence)
print(tokens)
['i', 'am', 'currently', 'learning', 'machine', 'learning', '.']
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)
[1045, 2572, 2747, 4083, 3698, 4083, 1012]
ids = torch.tensor(input_ids)
print(ids)
tensor([1045, 2572, 2747, 4083, 3698, 4083, 1012])
model(ids)
---------------------------------------------------------------------------

Traceback (most recent call last)

<ipython-input-8-527d0145ab42> in <cell line: 1>()
----> 1 model(ids)

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
    1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
    1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
    1533 
    1534     def _call_impl(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
    1539                 or _global_backward_pre_hooks or _global_backward_hooks
    1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
    1542 
    1543         try:

/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
    988         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    989 
--> 990         distilbert_output = self.distilbert(
    991             input_ids=input_ids,
    992             attention_mask=attention_mask,

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
    1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
    1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
    1533 
    1534     def _call_impl(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
    1539                 or _global_backward_pre_hooks or _global_backward_hooks
    1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
    1542 
    1543         try:

/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict)
    788             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
    789         elif input_ids is not None:
--> 790             self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
    791             input_shape = input_ids.size()
    792         elif inputs_embeds is not None:

/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py in warn_if_padding_and_no_attention_mask(self, input_ids, attention_mask)
    4539 
    4540         # Check only the first and last input IDs to reduce overhead.
-> 4541         if self.config.pad_token_id in input_ids[:, [-1, 0]]:
    4542             warn_string = (
    4543                 "We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See "

IndexError: too many indices for tensor of dimension 1

Why is Model failing?

Here, we sent one sequence to the model, and Transformers models require many sentences by default. Let’s see how tokenizer works by returning PyTorch tensor format.

tokenized_sequence = tokenizer(sequence, return_tensors = 'pt')
print(tokenized_sequence['input_ids'])
tensor([[ 101, 1045, 2572, 2747, 4083, 3698, 4083, 1012,  102]])

From the output we got from the tokenizer, we can clearly see that the tokenizer didn’t just convert input IDs into tensors, but it added extra dimension on top of it.

Let’s try again by adding a dimension to the list of input IDs.

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# create tokens
tokens = tokenizer.tokenize(sequence)
print(tokens)
['i', 'am', 'currently', 'learning', 'machine', 'learning', '.']
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)
[1045, 2572, 2747, 4083, 3698, 4083, 1012]
ids = torch.tensor([input_ids])
print(ids)
tensor([[1045, 2572, 2747, 4083, 3698, 4083, 1012]])
output = model(ids)
print(output.logits)
tensor([[-0.6310,  0.7762]], grad_fn=<AddmmBackward0>)

Batching

It is the process of sending many sequences to the model at once. We have just built a batch with a single sequence.

Padding the sequences

padding make sure all our sequences have the same length as we make our tensors into rectangles by adding a special token called a padding token.

The padding token ID can be found in pad_token_id.

For example, if we have a batch of sequences having input IDs as below:

batch_ids = [
    [100, 100, 100],
    [100, 100]
    ]

Now, let’s send our sequences to the model individually and then in the form of batches.

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
seq1_ids = [[100, 100, 100]]
seq2_ids = [[100, 100]]

batch_ids = [
  [100, 100, 100],
  [100, 100, tokenizer.pad_token_id]
]
print(model(torch.tensor(seq1_ids)).logits)
print(model(torch.tensor(seq2_ids)).logits)
print(model(torch.tensor(batch_ids)).logits)
tensor([[ 1.4738, -1.3271]], grad_fn=<AddmmBackward0>)
tensor([[ 1.2205, -1.1099]], grad_fn=<AddmmBackward0>)
tensor([[ 1.4738, -1.3271],
        [ 1.7130, -1.4950]], grad_fn=<AddmmBackward0>)

From the above output, we observe that we got completely different results for our second sequence. It is because our Transformer models take padding tokens into consideration while they attend to all tokens of a sequence.

So, we explicitly need to tell attention layers to ignore padding tokens we applied to the sequence for getting the same results for the second sequence passed in a batch or passed individually to the model.

Let’s take an example where we have multiple sequences and we pad them according to multiple objectives:

sequences = [
    "I am learning machine learning",
    "I am excited to lern new frameworks in ML"
    ]

# pad sequences up to maximum sequence length
inputs = tokenizer(sequences, padding='longest')
print(inputs)
{
    'input_ids': [
            [101, 1045, 2572, 4083, 3698, 4083, 102, 0, 0, 0, 0, 0, 0], 
            [101, 1045, 2572, 7568, 2000, 3393, 6826, 2047, 7705, 2015, 1999, 19875, 102]
            ], 
    'attention_mask': [
            [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 
            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
}
# pad sequences up to model maximum length (512 for BERT)
inputs = tokenizer(sequences, padding='max_length')
print(inputs)
{'input_ids': [[101, 1045, 2572, 4083, 3698, 4083, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 7568, 2000, 3393, 6826, 2047, 7705, 2015, 1999, 19875, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
# pad sequences up to specified maximum length
inputs = tokenizer(sequences, padding='max_length', max_length=8)
print(inputs)
{'input_ids': [[101, 1045, 2572, 4083, 3698, 4083, 102, 0], [101, 1045, 2572, 7568, 2000, 3393, 6826, 2047, 7705, 2015, 1999, 19875, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

We can also trucate sequences

# truncate sequences that are longer than the model max length
inputs = tokenizer(
    sequences,
    truncation=True
    )
print(inputs)
{'input_ids': [[101, 1045, 2572, 4083, 3698, 4083, 102], [101, 1045, 2572, 7568, 2000, 3393, 6826, 2047, 7705, 2015, 1999, 19875, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
# truncate sequences that are longer than the specified max length
inputs = tokenizer(
    sequences,
    max_length=8,
    truncation=True
)
print(inputs)
{'input_ids': [[101, 1045, 2572, 4083, 3698, 4083, 102], [101, 1045, 2572, 7568, 2000, 3393, 6826, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

Moreover, we can utilize the tokenizer object to return tensors from different frameworks, like

  • pt return PyTorch tensors
  • tf return Tensorflow tensors
  • np returns NumPy tensors.
# return Pytorch tensors
inputs = tokenizer(sequences, padding=True, return_tensors='pt')
print(inputs)
{'input_ids': tensor([
    [  101,  1045,  2572,  4083,  3698,  4083,   102,     0,     0,     0,     0,     0,     0],
    [  101,  1045,  2572,  7568,  2000,  3393,  6826,  2047,  7705,  2015,    1999, 19875,   102]]), 
'attention_mask': tensor([
    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
# return tensorflow tensors
inputs = tokenizer(sequences, padding=True, return_tensors='tf')
print(inputs)
{'input_ids': <tf.Tensor: shape=(2, 13), dtype=int32, 
            numpy=array([[  101,  1045,  2572,  4083,  3698,  4083,   102,     0,     0,
                            0,     0,     0,     0],
                        [  101,  1045,  2572,  7568,  2000,  3393,  6826,  2047,  7705,
                            2015,  1999, 19875,   102]], dtype=int32)>, 
'attention_mask': <tf.Tensor: shape=(2, 13), dtype=int32, 
numpy=array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}
# return numpy tensors
inputs = tokenizer(sequences, padding=True, return_tensors='np')
print(inputs)
{'input_ids': array([
    [  101,  1045,  2572,  4083,  3698,  4083,   102,     0,     0,    0,     0,     0,     0],
    [  101,  1045,  2572,  7568,  2000,  3393,  6826,  2047,  7705,    2015,  1999, 19875,   102]]), 
'attention_mask': array([
    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Attention masks

Attention masks are defined as tensors with a value of 1s or 0s, where,

  • 1s indicate the token should be attended to, and
  • 0s represent the token should not be attended to.

Let’s see how we use attention for our above example

batch_ids = [
  [100, 100, 100],
  [100, 100, tokenizer.pad_token_id]
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0]
]
outputs = model(
    torch.tensor(batch_ids),
    attention_mask = torch.tensor(attention_mask)
    )

print(outputs.logits)
tensor([[ 1.4738, -1.3271],
        [ 1.2205, -1.1099]], grad_fn=<AddmmBackward0>)

Now, we have the same logits for the second sequence as we expected to see.

Longer sequence

Transformer models can hadle sequence length up to 512 or 1024 tokens and will crash if applied longer sequence length.

Solution:

  • Use model with longer supported sequence length, or
  • Truncate the sequence length.