A Beginner’s Guide to Tokens, Vectors, and Embeddings in NLP
If you’re working with text data, you may have come across the terms “tokens,” “vectors,” and “embeddings.” These concepts are important in natural language processing (NLP) and are used to represent and analyze text in various ways.
In this post, we’ll dive into what tokens, vectors, and embeddings are and explain how to create them. By the end, you’ll have a better understanding of how these concepts are used in NLP and how you can use them in your next project.
How can machines read and understand texts?
For machines and NLP models like BERT or GPT to understand language, we need to represent written words as numbers (because computers only understand numbers).
Tokenization is the first step in natural language processing (NLP) projects. It involves dividing a text into individual units, known as tokens. Tokens can be words or punctuation marks. These tokens are then transformed into vectors, which are numerical representations of these words.
To give these tokens meaning, a deep learning model, often a transformer model, is trained on these vectors. This allows the model to understand the meaning of words and how they relate to each other.
The goal of this process is to enable NLP models to understand the meaning and semantics of different words and their context within a sentence or text.
Adrian Colyer explained this concept very nicely in his blog post “The amazing power of word vectors.”
An example of word vectors and vector composition. Source: Adrian Colyer.
An example of word vectors and vector composition. Source: Adrian Colyer.
The vectors for the words “King” and “Man” may be similar, as might the vectors for “Queen” and “Woman.” These vectors also have certain properties that can be useful for training language models.
For example, you can subtract the vector for “Man” from the vector for “King” and add the vector for “Woman” to get the vector for “Queen.” These properties allow the model to understand the different meanings of words.
However, how does an NLP model determine which words are more similar to each other? This is because NLP models do not use a single number to represent a word. In fact, they often use more than 1,000 numbers to represent a single word.
This is why it is called a “word vector.” In mathematical terms, a single number is called a scalar, and a list of numbers is called a vector.
Example of different word vectors and their different dimensions. Source: Adrian Colyer.
A word vector can be thought of as a point in a multi-dimensional space, where each dimension represents a particular aspect or characteristic of the word.
For example, a word vector for the word “queen” might have high values for dimensions representing “femininity” and “royalty” and low values for dimensions representing “masculinity.”
Combining all these dimensions makes up the vector for “queen,” which allows our model to understand the word’s meaning and how it relates to other words.
The number of dimensions in a word vector is often called “dimensionality.” A high-dimensional word vector would have many dimensions, allowing it to capture a wide range of characteristics and nuances of the word.
However, this also means that it would require more data and computation to train and use. On the other hand, a low-dimensional word vector would have fewer dimensions, making it simpler and more efficient to work with but potentially sacrificing some of the richness and detail of the word’s meaning.
An analogy for this concept might be trying to describe an object using different characteristics. For example, you might describe a ball as round, bouncy, and small.
Each of these characteristics could be considered a dimension in a word vector for the word “ball.” A high-dimensional word vector for “ball” might include dimensions for color, texture, shape, and function, while a low-dimensional word vector might only include dimensions for size and material.
Tokens in Natural Language Processing: What they are, what they do, and how you can use them
Tokens are created by dividing the text into smaller units. For example, the sentence “It is sunny outside” might be tokenized as [‘It’, ‘is’, ‘sunny’, ‘outside’]. The process of dividing a text into tokens is called tokenization.
There are various methods for creating tokens, and which method you should use depends on your specific use case. For example, dividing text by whitespace is one common approach, but there are many others.
The best tokenization method for a given dataset and task is not always clear, and different methods have their own strengths and weaknesses.
It is important to experiment with different tokenization methods to determine which one is most effective for your needs. There is no one-size-fits-all approach to tokenization, and the best method may vary depending on the characteristics of your data and the task at hand.
The different types of tokens: Subwords, Words, and Sentences
As previously mentioned, there are various methods for tokenizing text. These methods can be broadly grouped into two categories: sentence tokenization and word tokenization.
Sentence tokenization involves dividing the text into individual sentences, while word tokenization involves dividing the text into individual words or even subwords.
In the next section, we will focus on word tokenization and its different forms: character, word, and subword tokenization.
Character Tokenization
Character tokenization is the most basic tokenization method, which treats each character as a separate token. In Python, this can be easily done with a single line of code.
text = "Hello World"
tokenized_text = list(text)
print(tokenized_text)
# ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
Character-level tokenization does not consider any text structure and treats the entire string as a sequence of individual characters. While this approach can be useful for handling misspellings and uncommon words, the main drawback is that it requires a lot of computing power, memory, and data to learn words character by character. As a result, character tokenization is not commonly used in practice.
Word Tokenization
An alternative to character tokenization is word tokenization, which involves dividing the text into individual words. This simplifies the training process by eliminating the need for the model to learn words character by character.
text = "Hello World"
tokenized_text = text.split(" ")
print(tokenized_text)
# ['Hello', 'World']
However, word tokenization can result in a very large vocabulary, especially if the corpus contains many rare words, such as a text corpus of medical terms.
A big vocabulary can be a problem for neural networks because it requires more parameters. In their book “Natural Language Processing with Transformers,” the authors Lewis Tunstall, Leandro von Werra, and Thomas Wolf illustrated this problem nicely.
Suppose we use all the words in the English language as input for a neural network. There are approximately 1 million unique English words, and let’s say each word vector has 1,000 dimensions. This would result in a weight matrix for the input layer with 1 million x 1,000 = 1 billion weights.
This is almost as many weights as the largest GPT-2 model, which has around 1.5 billion parameters. Models with such a high number of input parameters can be expensive to maintain and may be difficult to train effectively.
One way to address this issue is to limit the vocabulary by considering only the most common words in the corpus, such as the 100,000 most common words. Words that are not part of the vocabulary are classified as “unknown” and mapped to a shared unknown token.
However, this approach sacrifices some potentially important information in the process. Wouldn’t it be great if there was a way to preserve some of the information and structure without losing all the information?
Subword Tokenization
Subword tokenization combines the benefits of character and word tokenization by breaking down rare words into smaller units while keeping frequent words as unique entities.
This allows the model to handle complex words and misspellings while keeping the length of the inputs manageable.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer_output = tokenizer.tokenize("This is an example of the bert tokenizer")
print(tokenizer_output)
# ['this', 'is', 'an', 'example', 'of', 'the', 'bert', 'token', '#\#izer']
For example, in the text above, the word “tokenizer” was split into the known words “token” and “##izer,” where “##” indicates that the token should be attached to the previous one.
By using subword tokenization, we can take advantage of the benefits of word tokenization while keeping the vocabulary size reasonable. For example, the bert-case-uncased
tokenizer used in the example above has a vocabulary size of only 30,522 tokens.
Corpus & Vocabulary
The purpose of tokenization is to create a vocabulary from a corpus. A corpus is a collection of texts (such as the dataset used to train an NLP model), and a vocabulary is the set of unique tokens found within the corpus.
In order for a corpus to be effective for training an NLP model, it is important for it to be large in size and contain high-quality data. While a large corpus can be beneficial for training, the data quality is even more important. Small errors in the training data can lead to significant errors in the final model, so it is important to use data that is as accurate as possible.
Challenges of tokenization
There are many factors to consider when selecting or creating a tokenizer, such as how it handles punctuation, whether it uses singular forms, etc. These choices can have a significant impact on the vocabulary, training, and final performance of the NLP model.
One common problem with tokenization is handling spelling mistakes. For example, if the corpus includes the word “hepl” instead of “help,” the model may treat it as an out-of-vocabulary (OOV) word. This can significantly decrease model performance.
The paper “User Generated Data: Achilles’ Heel of BERT” examined this problem in detail. It shows the following reduction in model performance, based on spelling error rate (the percentage of words in a given input string that are misspelled):
Spelling Error Rate | Classification Score
--------------------+---------------------
0% | 0.89
5% | 0.78
10% | 0.60
15% | 0.45
20% | 0.35
As you can see, even a 5% error rate can lead to a significant decrease in model performance by 11% in this case. This is because when the tokenizer encounters a misspelled word like “hepl,” it creates tokens like [‘he’, ‘##pl’] rather than using the correct token, “help.”
This can cause problems with abbreviations as well. For example, if the word “cmd” is used instead of “command,” it would be split into two tokens. To address these issues, it is important to have high-quality data or a large enough dataset for the model to learn the correct meanings of words.
Now that we have a better understanding of tokens and vectors, there is one more thing we need to learn: embeddings.
What are Embeddings in NLP?
As we learned already, we create vectors from tokens. These vectors are numerical representations of the meaning of a token and how it relates to other tokens in our vocabulary.
However, the model must first learn these representations. Initially, the vectors are randomly initialized, but through training, they are adapted to incorporate some meaning.
During training, these numerical representations, similar to weights in a neural network, are adjusted by the model to more accurately represent the meaning of a token and its relationship to other tokens in our vocabulary.
In natural language processing, “embedding” refers to the process of mapping non-vectorized data, such as tokens, into a vector space that has meaning for a machine learning model or neural network. By doing this, the model can learn the relationships between words and their meanings automatically rather than having to specify them manually.
When you visualize word embeddings, you can think of the result as a map of the vocabulary that shows how one token is related to other tokens in terms of meaning. Each token is positioned near other tokens with similar meanings.
On projector.tensorflow.org, you can see a visual representation of this map of words. Each grey point represents a word embedded in a three-dimensional space with all the other words in that vocabulary.
The Difference Between a Token, a Vector, and an Embedding
To get to a point where your model can understand text, you first have to tokenize it, vectorize it and create embeddings from these vectors.
- Tokenization: This is the process of dividing the original text into individual pieces called tokens. Each token is assigned a unique id to represent it as a number.
- Vectorization: The unique ids are then assigned to randomly initialized n-dimensional vectors.
- Embedding: To give tokens meaning, the model must be trained on them. This allows the model to learn the meanings of words and how they relate to other words. To achieve this, the word vectors are “embedded” into an embedding space. As a result, similar words should have similar vectors after training.
The process of getting from a word to a word vector.
From Text to Vectors: A How-To Guide
As previously mentioned, the first step of an NLP project is to tokenize our dataset. But before training a custom tokenizer from scratch, it is worth considering whether this is the most efficient approach. Creating a high-quality corpus and training a tokenizer can be time-consuming and may not always produce the best results.
One alternative is to use pre-trained tokenizers, which have already been trained on a large dataset and can understand the meanings of words and sentences without additional training. This can save time and resources and may even provide better results.
If you are interested in training your own tokenizer and transformer model, you can refer to Chapter 10 of transformersbook.com. If you decide to use a pre-trained tokenizer, this is how you could do this with the huggingface library:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("This is an example of the bert tokenizer")
print(tokens)
# ['this', 'is', 'an', 'example', 'of', 'the', 'bert', 'token', '#\#izer']
In this example, we use the pre-trained bert-base-uncased tokenizer to tokenize a sample sentence. Then, we will use the convert_tokens_to_ids function of the tokenizer to convert the tokens to their corresponding numerical values.
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)
# [2023, 2003, 2019, 2742, 1997, 1996, 14324, 19204, 17629]
The encode function is similar to convert_tokens_to_ids, but it also includes special tokens such as (beginning of the sequence) and (end of the sequence). These special tokens help the model understand where a sequence starts and ends.
token_ids = tokenizer.encode("This is an example of the bert tokenizer")
print(token_ids)
# [101, 2023, 2003, 2019, 2742, 1997, 1996, 14324, 19204, 17629, 102]
tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(tokens)
# ['[CLS]', 'this', 'is', 'an', 'example', 'of', 'the', 'bert', 'token', '#\#izer', '[SEP]']
The **encode**
function translates the token with id 101 to the special **[CLS]**
token, which represents the beginning of the sequence token. The token with id 102 is the end of the sequence () token.
These tokens are mapped to dense vectors in the embedding space, which encode the meanings of the words and their relationships to other words.
To obtain an embedding for a token, you first need to create a model. You can do this by importing the **BertModel**
and using the **from_pretrained**
method. This will download the pre-trained **bert-base-uncased**
model with its weights and its embeddings.
Then, you can use the **embeddings.word_embeddings**
method to return its word vector.
import torch
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-uncased")
# get the embedding vector for the word "example"
example_token_id = tokenizer.convert_tokens_to_ids(["example"])[0]
example_embedding = model.embeddings.word_embeddings(torch.tensor([example_token_id]))
print(example_embedding.shape)
# torch.Size([1, 768])
The returned word vector has a size of 768 dimensions, the same as the BERT model. We can use these vectors to compare their similarities by using PyTorch’s cosine similarity function.
Cosine similarity is a way to measure how similar two things are. It’s often used in natural language processing to compare the content of two texts.
To calculate the cosine similarity, we look at the angle between two vectors. If the vectors point in the same direction, they are more similar, and if they point in opposite directions, they are less similar.
The result is a number between -1 and 1, where 1 means the vectors are identical and -1 means they are completely different.
king_token_id = tokenizer.convert_tokens_to_ids(["king"])[0]
king_embedding = model.embeddings.word_embeddings(torch.tensor([king_token_id]))
queen_token_id = tokenizer.convert_tokens_to_ids(["queen"])[0]
queen_embedding = model.embeddings.word_embeddings(torch.tensor([queen_token_id]))
cos = torch.nn.CosineSimilarity(dim=1)
similarity = cos(king_embedding, queen_embedding)
print(similarity[0])
# 0.6469
We can see that the vectors for king and queen have a similarity score of 0.6469.
similarity = cos(example_embedding, queen_embedding)
print(similarity[0])
# 0.2392
The queen and example vectors have a similarity of 0.2392. This means that the king and queen vectors are more similar in our vector space than the example and queen vectors. This show that our model successfully learned the “meaning” of these words and can differentiate between them.
Conclusion
In this blog post, we explored the concepts of tokens, embeddings, and vectors in natural language processing (NLP). We discussed what they are and how they are used in training neural networks. We also learned how to use them in an NLP project.
I hope this blog post was helpful and provided a better understanding of these concepts. If you have any questions or feedback, please don’t hesitate to let me know.
References:
- Natural Language Processing with Transformers Book: https://transformersbook.com
- What is NLP (Natural Language Processing) Tokenization? — tokenex: https://www.tokenex.com/blog/ab-what-is-nlp-natural-language-processing-tokenization/
- What is Tokenization | Tokenization In NLP: https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/
- CORPUS: https://hypersense.subex.com/aiglossary/corpus
- Word embedding: https://en.wikipedia.org/wiki/Word_embedding
- Text similarity search in Elasticsearch using vector fields | Elastic Blog: https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
- Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space: http://arxiv.org/pdf/1504.06654v1
- Comparison Between BagofWords and Word2Vec — PyImageSearch: https://pyimagesearch.com/2022/07/18/comparison-between-bagofwords-and-word2vec/
- Word, Subword, and Character-Based Tokenization: Know the Difference: https://towardsdatascience.com/word-subword-and-character-based-tokenization-know-the-difference-ea0976b64e17
- Word embeddings: https://www.tensorflow.org/text/guide/word_embeddings
- The amazing power of word vectors: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
- Understanding NLP Word Embeddings — Text Vectorization: https://towardsdatascience.com/understanding-nlp-word-embeddings-text-vectorization-1a23744f7223
- How to use [HuggingFace’s] Transformers Pre-Trained tokenizers?: https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa