Transformers Models and Tokenizers

Table of contents

Models
Tokenizers
Encoding
- Tokenization
- Converting Tokens into input IDs
Decoding

Models

Creating a Transformer

Initializing the BERT model is loading a configuration object.

from transformers import BertConfig, BertModel

config = BertConfig()
print(config)

BertConfig {
    "attention_probs_dropout_prob": 0.1,
    "classifier_dropout": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "layer_norm_eps": 1e-12,
    "max_position_embeddings": 512,
    "model_type": "bert",
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "pad_token_id": 0,
    "position_embedding_type": "absolute",
    "transformers_version": "4.42.4",
    "type_vocab_size": 2,
    "use_cache": true,
    "vocab_size": 30522
}

The configuration contains many attributes, as we can see from the above output.

# building model from configuration
model = BertModel(config)

Model loading methods

Building model from default configuration initalizes it with random values:

config = BertConfig()
model = BertModel(config)

Here, the model is randomly initialized. We can load the model using the from_pretrained() method in Transformers.

model = BertModel.from_pretrained('bert-base-cased')

In the above code, we didn’t use the BertConfig class; instead, we loaded the model using the bert-base-cased identifier. This model is now initialized with all the weights of the checkpoint.

Saving the model

we can use save_pretrained() method

model.save_pretrained('model_weights')

The above code saves teo files in a specified directory:

ls model_weights

config.json  model.safetensors

Config.json: contains metadata, like where the checkpoint originated and what Transformers version you were using when you last saved the checkpoint.
model.safetensors: it contains the model’s weights.

cat model_weights/config.json

{
    "_name_or_path": "bert-base-cased",
    "architectures": [
    "BertModel"
    ],
    "attention_probs_dropout_prob": 0.1,
    "classifier_dropout": null,
    "gradient_checkpointing": false,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "layer_norm_eps": 1e-12,
    "max_position_embeddings": 512,
    "model_type": "bert",
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "pad_token_id": 0,
    "position_embedding_type": "absolute",
    "torch_dtype": "float32",
    "transformers_version": "4.42.4",
    "type_vocab_size": 2,
    "use_cache": true,
    "vocab_size": 28996
}

Using Transformer model for Inference

Foe example, we have a couple of sequences

sequences = ["Hello", "Well done", "thank you!"]

Now, the tokenizer converts these sequences to vovabulary indices (i.e. input IDs).

from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

encoded_sequences = tokenizer(
                              sequences,
                              padding=True,
                              truncation=True
                            )

input_ids = encoded_sequences["input_ids"]
input_ids

[[101, 7592, 102, 0, 0],
    [101, 2092, 2589, 102, 0],
    [101, 4067, 2017, 999, 102]]

The input_ids is a list of encoded sequences. We will now convert it into a tensor.

import torch

inputs = torch.tensor(input_ids)

Using tensors as inputs to the model

output = model(inputs)
print(output.last_hidden_state.shape)

torch.Size([3, 5, 768])

As we can see from the above output that we have,

3 sequences
Length of sequence is 5, and
Hidden size is 768.

Tokenizers

Word-based

text = "It's his favorite sport!"

# splitting text on spaces
tokenized_text = text.split()
print(tokenized_text)

["It's", 'his', 'favorite', 'sport!']

Each word gets assigned an ID, starting from zero and going up to the size of the vocabulary. The model utilizes these IDs to recognize each word.

Character-based

Split text into characters instead of words. It has two major benefits:

Now, vocabulary is much smaller.
There are fewer out-of-vocab tokens, as every word can be created from characters.

Limitation:

It will end up with a very large amount of tokens to be processed.
It’s less meaningful as each character doesn’t mean a lot on its own.

Subword Tokenization

On a principle that frequently used words should not be split into smaller subwords. However, rare words should be split into meaningful words.

Loading and Saving Tokenization

Loading the BERT tokenizer using BertTokenizer class

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

We can also load tokenizer using AutoTokenizer class

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Now, we will be using the tokenizer defined above on the sequence

tokenizer('Today, I learned techniques for natural language processing.')

{'input_ids': [101, 3570, 117, 146, 3560, 4884, 1111, 2379, 1846, 6165, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Saving a tokenizer

tokenizer.save_pretrained('tokenizer-dir')

('tokenizer-dir/tokenizer_config.json',
    'tokenizer-dir/special_tokens_map.json',
    'tokenizer-dir/vocab.txt',
    'tokenizer-dir/added_tokens.json',
    'tokenizer-dir/tokenizer.json')

Encoding

Encoding is the process of translating text into numbers. It is a two-step process:

Tokenization, and
Conversion to input IDs

Tokenization

We will use tokenize() method for converting sequence into tokens

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

sequence =  "It's his favorite sport!"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['It', "'", 's', 'his', 'favorite', 'sport', '!']

Converting Tokens into input IDs

Here, we will use convert_tokens_to_ids() method

input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[1135, 112, 188, 1117, 5095, 4799, 106]

Decoding

Decoding is a process of converting ids back to the string. Let’s do it using decode() method

decoded_sequence = tokenizer.decode(input_ids)
print(decoded_sequence)

It's his favorite sport!

This decoder not only converts the input ids back to the sequence but also puts together the tokens that were part of the same words.