THE TRANSFORMERS: A Secret Code AI Uses To Understand Us All!

6 min read2 days ago

Written By: Muhammad Awwab Khan

The Transformer Architecture: A Simplified Guide

Transformers are one of the coolest new advancements in machine learning, introduced in the paper “Attention is All You Need.” They can write stories, essays, and poems, answer questions, translate languages, chat with people, and even ace tough exams! Surprisingly, their architecture isn’t overly complicated; it’s just a mix of some very handy parts, each with its own role. In this section, you’ll learn about all these components.

So, what exactly does a transformer do? Imagine you’re typing a text message on your phone. After each word, your phone suggests the next few words, like “you” or “your” after “Hello, how are.” But if you keep picking those suggestions, the sentence quickly becomes nonsense. That’s because your phone’s model doesn’t understand the full context. It just guesses the next word based on the last few. Transformers, however, keep track of the entire context, making their text coherent and meaningful. So, why do transformers generate text word by word? Because it works incredibly well! Transformers excel at keeping context, ensuring each next word fits perfectly.

Gmail Smart Compose using transformer model to autocomplete based on previous text.

Transformers are trained on a vast amount of data from the internet. When you give a transformer a prompt like “Hello, how are,” it predicts “you” as the next word based on all the text it has learned from. For a prompt like “Write a story,” it might start with “Once,” then add “upon,” and so on, creating coherent stories word by word.

Now that we know what transformers do, let’s dive into their architecture. At first glance, it might seem daunting, let me make it simple for you. Transformers consist of five main components:

1. Tokenization

2. Embedding

3. Positional encoding

4. Transformer blocks (multiple)

5. Softmax

The transformer block is the trickiest part, as it contains many layers. Each block includes two key elements: attention and feedforward components. These blocks are stacked to create the full model.

The Key Steps

Tokenization

Tokenization is the first and simplest step. It involves breaking down the text into tokens, which include words, prefixes, suffixes, and punctuation. Each token is then mapped to a known token from a predefined library.

Embedding

After tokenization, the next step is embedding, where tokens are converted into numerical vectors. These embeddings represent each token as a list of numbers. If two pieces of text are similar, their vectors will be similar as well. Conversely, different texts will have distinct vectors. To learn more about embedding and vector databases, read this article.

Positional Encoding

Once we have vectors for each token, the next step is to combine them into one vector. The simplest method is to add the vectors componentwise. For instance, adding vectors [1,2] and [3,4] gives [4,6]. However, this addition is commutative, meaning the order of words doesn’t matter, which can cause issues. To fix this, we use positional encoding, which adds predefined vectors to each token’s embedding. This ensures unique vectors for sentences, even if they use the same words in different orders. For example, “Write (1)”, “a (2)”, “story (3)”, and “. (4)” represent their positions.

Transformer Block

Let’s summarize what we’ve discussed so far. First, words are turned into tokens (tokenization), then these tokens are converted into numbers (embeddings), and their order is considered using positional encoding. This process creates a vector for each token inputted into the model. Next, the model predicts the next word in a sentence using a large neural network trained specifically for this task.

To improve the model further, we incorporate the attention component. This component, introduced in the influential paper “Attention is All you Need,” is essential in transformer models for enhancing their performance. While the details of attention are explained elsewhere, it essentially provides context to each word in the text.

The attention component is added to every block of the feedforward network. Imagine a large feedforward neural network designed to predict the next word, composed of several blocks of smaller neural networks. Each of these blocks includes an attention component. Thus, each unit of the transformer, called a transformer block, consists of two main components:

1. The attention component.

2. The feedforward component.

Attention Mechanism

The next important step is attention. The attention mechanism addresses a critical issue: the problem of context. Words can have different meanings in different contexts, which can confuse language models because embeddings usually assign words to vectors without distinguishing between meanings.

Attention is a valuable technique that enables language models to grasp context better. To understand how it works, consider these two sentences:

She plays the bass guitar.
He caught a big bass at the lake.

In these sentences, the word “bass” has different meanings: in sentence 1, it refers to a musical instrument, while in sentence 2, it denotes a type of fish. Computers lack the ability to discern these meanings, so we need a mechanism to provide this understanding. We can rely on surrounding words of context. For instance, in sentence 1, “guitar” clarifies that “bass” refers to the musical instrument. In sentence 2, the phrase “caught a big” indicates that “bass” refers to the fish.

In essence, attention grabs the contextual relationship of words within a sentence or text segment in the embedding space. This means that in “He caught a big bass at the lake,” “bass” will be closer to “fish” in the embedding space. Similarly, in “She plays the bass guitar,” “bass” will be closer to “guitar.” This adjustment allows the word “bass” in each sentence to capture contextual information from nearby words, and hence enriching its meaning.

However, the transformers use a slightly different and greatly powerful technique called Multi-head attention. In multi-headed attention, several different embeddings are used to modify the vectors and include context to them. Multi-head attention has helped language models reach much higher efficiency when processing and generating text.

The Softmax Layer

Now that you understand a transformer as composed of multiple layers of transformer blocks, each containing an attention and feedforward layer, think of it as a large neural network that predicts the next word in a sentence. The transformer generates scores for all words, assigning higher scores to those most likely to come next in the sentence.

The final stage of the transformer involves a softmax layer, which converts these scores into probabilities that sum to 1. The words with the highest scores are associated with the highest probabilities. From these probabilities, we can sample the next word. For instance, if the transformer assigns a probability of 0.5 to “Once”, and probabilities of 0.3 and 0.2 to “Somewhere” and “There” respectively, sampling would likely select “Once” as the output.

What happens next? For instance you wrote “Once upon a time, “ the next word the model predicts would most likely be “there” then “was” and so on…

Training and Post-Training

Now that you understand how transformers work, there is still a thing which is left called fine-tuning or post-training. Transformers are then trained on vast datasets from the internet, making them learn to predict the next word based on context. Post-training fine-tunes the model for specific tasks, such as question answering or conversation.

Conclusion

To sum up, transformers are a big step forward in how machines understand and use language. They use smart techniques like attention and specialized training to do things like translating languages and answering questions.