Explained: Transformers for Everyone

The underlying architecture of modern LLMs

Published in

The Research Nest

15 min readMar 7, 2024

Originally appearing in the famous paper, ‘Attention is all you need,’ transformer-based architectures have become essential in most successful AI models.

Yet, many users and even creators of AI-based products might not understand what a transformer is or how it works.

Well, reading research papers and understanding everything is pretty difficult.

This article aims to explain the fundamental concepts of transformers to everyone, whether you’re a high school student or an experienced technologist. We will focus on the basic intuition and logical blocks that make up a transformer.

We’ll start with simple explanations and gradually delve into more complex aspects of AI. No prior knowledge is required.

Note: Unless otherwise specified, all images used in the article are created using DALLE 3 or Imgflip meme generator.

What we want to build

A computer that can think and respond like a human.

Humans think and communicate with language, which contains all the knowledge about the world. Teaching and learning also happen through language. If computers can process and understand language, we could create artificial intelligence that behaves like humans. On the internet, language is mainly in the form of text.

Understanding all these texts can naturally give a computer the ability to understand questions, instructions, and prompts.

Researchers have developed various algorithms and neural networks to achieve this, like CNNs, RNNs, and LSTMs (Don’t worry about what they are). The patterns in data, reflecting the underlying logic of language and human thought, are too complex to figure out manually. Neural networks are designed to detect the same.

In a remarkable observation, it turns out that language understanding starts from the simple yet sophisticated ability to predict the next word in a given sentence.

One effective architecture for this task is the transformer. It was initially developed for language translation but has proven useful for many tasks.

The goal is to create an AI model to receive input and produce relevant and sensible output. These models have shown abilities that resemble human thought and expression when operated at a large scale.

Almost all AI models are built on top of a few fundamentals:

Computers understand numbers; hence, the data (like text) is converted into numbers using various methods.
These numbers have hidden statistical or logical patterns within them that can be figured out mathematically. It’s not random gibberish and is based on some fundamental rules of language and human interactions.
Finding these patterns manually is nearly impossible. So, we try to create a neural network architecture to figure these things out.
Once that’s done, you have your “AI model” to generate new responses based on everything learned from the training data.

How to create an AI model

For a moment, let’s take a small diversion and forget about the Transformer.

Here’s how to make an AI model in simple (yet complex) steps:

Choose what you want the AI to do — like generating text.
Collect the right kind of data for training. In this case, it would be all the text on the internet.
Turn all the data into numbers that a computer can understand. Figure out how to feed these numbers into a neural network.
A neural network is basically a fancy calculator that uses math to produce a result based on the data you give it.
Train this network (basically, adjust the variables in the math used) with various methods so it learns to give the correct result for the data it receives. The results are in numbers, which we then change back into the original format, like words.
Finally, the trained AI will be tested with new data to see if it gives the right outputs.

Let’s explore how each of these steps happens in the context of a Transformer.

While you read further, remember that innovation can be done with every step here to push the boundaries further.

I’ll introduce a series of technical topics to build a base and then connect the dots to get the complete picture.

Sequence Modeling

A sentence is essentially a sequence of words. Building a system that can predict the next word in this sequence comes under “sequence modeling.”

Our goal is to create a great sequence model for human languages.

Transduction (in the context of AI)

Imagine you have a bunch of questions (some with answers and some without) for a quiz tomorrow. Transduction would be like studying just enough to answer those specific questions rather than learning all the material that could possibly appear on the quiz.

Basically, if you ever prepared for your cycle tests by simply studying questions from previous years' papers instead of reading the textbook, you are doing transduction :P

This method is especially handy when you have many questions but only a few answers. By looking at all the questions (even the ones without answers), we try to guess the answers based on patterns or similarities among them.

Transductive learning is commonly used in scenarios where a model is trained on a limited amount of labeled data alongside a larger set of unlabeled data. The goal is to map the inputs with the outputs without trying to find some general rule between them.

Recurrent Models

Imagine you’re telling a story, and each word you say depends on the words you’ve said before. Recurrent models work like this: they keep track of the story so far, so when it’s time to say the next word, they remember everything you’ve said and pick the word that fits best.

In language, this means they can help write texts by guessing the next word based on the words before it. If you teach it lots of stories or conversations, it learns patterns like how sentences are usually structured and what words often come together.

This is fine for short sequences but becomes a problem for long ones. The longer the sequence, the more memory they need, and they also take longer to train.

Due to these issues, researchers wanted to create something better than recurrent models.

Attention (in the context of AI)

As the name suggests, it's all about paying attention to what’s important.

Let’s say you have watched all the Marvel Cinematic Universe movies until Avengers: End Game. Your task is to write a short summary that captures the most with just ten crucial points. As a human, you can understand what’s important and what’s not. You would probably include a point for Thanos doing the snap and skip some other stuff, like when Iron Man and Hulk fought in Age of Ultron.

How do you know this? How can you make computers know this? We need to find a way to give more importance to certain parts of the input. That’s exactly what attention does.

In technical terms, attention mechanisms calculate weights, determining how much focus is put on each part of the input data. This process enables the model to prioritize information from the parts of the input most relevant to the task, improving its performance on tasks like translation, text summarization, and question-answering.

Models like RNNs process data sequentially and sometimes struggle to remember or give importance to information from the early parts of the sentence when they’re far into it. Attention solves this by giving the model the ability to “look back” at all parts of the input sentence, no matter where it currently is, and “pay attention” to the bits that matter most for what it’s trying to predict or generate next.

The combination of attention mechanisms with RNNs leverages the strengths of both: RNNs’ ability to process sequences with their inherent structure and attention mechanisms’ ability to highlight relevant information throughout the sequence.

Reminder: Everything we discuss here is inherently “mathematics” applied to our system to compute an output. When I say “Attention,” I imply introducing a mathematical step before computing the output.

Still, people were unhappy with recurrent models due to memory issues. So, they thought about it. Do we really need the clumsy recurrent network? What if we completely remove it from the equation?

At that time, no one knew if it would actually work. Once the experiments were done and the models trained, researchers concluded that maybe “Attention is all you need!” :P.

Self Attention

Attention can be implemented in many different ways. The ultimate goal is mathematically giving more weightage to what’s important. Self-attention is one technique where we look at all the parts of the sentence and understand how important each part is compared to every other part.

Imagine you’re in a room full of people talking about different topics. You’re trying to follow a conversation about your favorite movie. To do this, you focus more on the people talking about that movie, even though other conversations are happening at the same time. You might also think about what you know about the movie to understand what they’re saying. This way, you can build a great understanding of which conversation is the most important to you in the context of your favorite movie. This is similar to what self-attention does in AI models.

When you read between the lines, you notice that all this is very general human intuition of how we behave, converted into a more logical and technical format.

Encoders and Decoders

Most modern AI models include components called Encoders and Decoders.

As the names suggest, an encoder encodes data — converts input (say sentences) into a specific format (generally numbers that are given a fancy name called “embeddings” — more on that later).

The encoder tries to convert the input sentences into a complex numeric form that contains a lot of context, understanding, and relationships between the words in the given sentence.

A decoder then uses it to generate an output that could be the next word in that sentence.

This works in an “auto-regressive” manner, which means the output created is combined with the input for the next step, and the network can continue like that to keep generating new outputs.

The Transformer

Finally, we are ready to understand the Transformer!

Image taken from the official research paper where transformers were first introduced.

Do not worry about this complicated block diagram. I just put it out there to give a basic visual idea of what a transformer architecture looks like. Each block you see is basically like a mathematical function.

Here’s an important intuition to build at this stage. Many steps in AI models deal with manipulating data and changing it from one form to another to present it most efficiently to get the best results. It transforms data (and hence is called the transformer). A lot of what we explore below falls under this type of function.

The Encoder in the Transformer

Encoders have layers of different stuff in them. When I say “layer,” think of it as a mathematical step performed on the input data. The encoder in a transformer has six identical layers, meaning the input data goes through six steps, all performing similar operations on the data.

Each layer is made of two sub-layers:

Multi-head self-attention: It’s about splitting the attention process into “multiple heads,” where each head performs its own attention computation. These heads operate in parallel, and each computation can be different. This helps us understand different aspects of the input sequence. Each head produces outputs, which are combined together to get the final output from this layer.
Fully connected feed-forward network: This takes in the normalized output from the self-attention process and performs more mathematical operations to get a more transformed output. This helps the model understand the inputs better.

To summarize it all, the input to an encoder goes on the following journey:

It enters the first layer and is transformed via the multi-head self-attention process. This tells the model the most important thing to consider in the input.
They then go through an “add and normalization” step.
Then, they go to the feed-forward network that does more operations on them to get a more transformed output, which goes through the “add and normalization” step again.
Finally, we get the output of the first layer, which retains the same shape as the input, but all the data points are heavily transformed due to all the math we applied to them.
This output then goes to the second layer, where the exact same steps are followed. The encoder has six identical layers, as mentioned. So, the output of the second layer goes to the third layer, where it is transformed yet again. So on, and so forth, until it reaches the sixth layer's end.

Why are we doing the same thing six times? What’s the difference?

It helps the model understand the patterns and context of the input better. While the mathematical equations for all six layers are the same, there are some variables in these equations. Researchers call them “parameters,” which are different in each layer. The goal would be to find the best possible parameter values for each layer so that the model gets the best representation of the inputs after all this processing. When someone says, “training an AI,” this is what they are trying to do.

This process helps the model build a deep understanding of how words are related to each other and what actually makes sense.

After 6 layers of encoder transformations, the final output is a “vector representation” of the given input. If you visually examine that representation, it will basically look like a big matrix of numbers. These numbers hold the meaning of the input words and sentences and how groups of words are contextually related.

This representation now becomes the input of the next building block, the decoder.

The Decoder in the Transformer

It’s similar to the encoder, once again having 6 identical layers. Each layer, however, has an extra sublayer that performs the self-attention process on the output of the encoder.

The vector representation from the encoder, a dense matrix brimming with contextual clues, arrives at the first layer of the decoder. Here’s what happens step by step:

Self-attention with a twist: Initially, the decoder layer applies a modified version of the multi-head self-attention mechanism called masked attention on some starter outputs — more on that later.
After self-attention, the decoder also follows with an “add and normalization” step, just like the encoder. This step helps stabilize the learning by normalizing the data, making sure the transformations don’t push the values too far off the scale.
The cross-attention layer is the extra sublayer in each decoder layer. Here, the decoder pays attention to the encoder’s output. It uses the encoder's context to help generate the appropriate next element in the sequence.
Post-cross-attention, the data goes through another feed-forward network where further transformations are applied. These are designed to refine the decoder’s understanding of the input and the context in which it operates.
Like previous steps, this process undergoes another “add and normalization” to ensure consistency and stability in the transformations.

This completes the journey through one decoder layer. The output, now slightly more shaped towards the final result, moves to the next layer till the end of the 6 layers.

The repetition across these layers is not redundant but rather deepens the model’s cognition. Each layer, with its distinct set of parameters, adds another layer of complexity, understanding not just the raw information but the nuances of language patterns, context, and meaning.

Training the model is essentially the process of fine-tuning these parameters across all layers to achieve the best translation of input to output, representing the nuances captured through both encoding and decoding phases.

After traversing all six layers of the decoder, the culmination of this painstaking process is a sequence that mirrors the intended task, whether it be a translated sentence, the continuation of a text, or another form of sequenced output, which is further processed into the right format before its ready to be shown as the final output you see.

Let’s look at the architecture diagram once more. Perhaps you will now understand it much better than before.

Notice that the left part is the encoder, and the right part is the decoder. The encoder's output joins the decoder's input at the second sublayer. Before that, the decoder has something called the output embeddings that are processed in a masked attention layer. After the decoder computes the output, two additional steps happen — Linear and softmax. Let’s try to fix these gaps.

Each input word is converted into arrays of numbers, which we call an input embedding, capturing its contextual meaning.
Positional encodings are added to these embeddings to incorporate sequence order information.
The model processes these through its encoder-decoder layers to understand the sentence, learn relationships, and perform tasks like translation or answering questions.
In the decoder (for generation tasks), output embeddings represent the translated or generated sequence so far (this could be simple starter sentences for the final answer to be generated), and positional encodings ensure the model generates the sequence in the correct order.

Aahh… it’s getting a bit too technical. Let me simplify.

You have a sentence. You break it down into words.
You convert those words into numbers.
As you do it, you keep track of the positions of these words in the original sentence to ensure they have the correct sequence.
You pass everything you have to this point to six layers of mathematical operations that transform your numbers into more complicated numbers with special understanding. This is your encoder.
While the encoder is busy with the input sentence, the decoder prepares to generate an output. It starts with just the beginning of a sentence, like saying “The.” It then passes this through another set of six layers, which tries to generate further, one word at a time, while utilizing all the understanding contained in the special numbers created by the encoder.
Finally, after so many mathematical operations with many numbers, you create the final special numbers converted back into text format in the form of the one new word that will most likely occur next in the input sentence.
You are the Transformer!

We say AI is a black box because we don’t understand why and how these numbers are “special,” but a computer seems to get it very well. It’s remarkable that this process even works, and just by predicting the next word again and again, we observe emergent behavior that feels intelligent.

If you want to learn more about “tokens” and “embeddings (the special numbers as I refer them),” check out my other article below.

Explained: Tokens and Embeddings in LLMs

The numbers that make all the sense

medium.com

A Philosophical Conclusion?

I intentionally did not touch upon the exact mathematical concepts or equations used in each layer to avoid unnecessary complications. I also did not touch upon how a transformer is actually trained and what exactly all those “parameters” and “weights” are.

Perhaps I will cover them more in future articles that will be hyperlinked below as and when they are ready.

Explained: Attention Mechanism in AI

In simple English for everyone

medium.com

Explained: Hyperparameters in Deep Learning

In simple English for everyone.

medium.com

Explained: Regularization in Deep Learning
Explained: Optimizers in Deep Learning
How do you actually train a Transformer?

After reading this, you might wonder how complex the process is and how someone even figured this out. Did the researchers wake up one day and decide, “Let’s build an awesome AI,” and somehow they understood that they should build these 6 layers of encoders and decoders and do random mathematics to create these numbers so that everything works magically?

It can feel very overwhelming. How can you even make the head and tail of this process? How do you even know how to start?

Here’s the thing. Research is a natural incremental process that stands on the shoulders of everything that came before, decades into the past. If you want to explore, your starting point should not be the Transformers. You will have to backtrack 100s of papers that came before the Transformer paper, and when you connect the dots slowly, things become clearer as to how people actually came to this point.

And then, you can push things forward, too. Every step is a mathematical or logical equation that can be changed, leading to new discoveries. You can add, remove, or replace one or two steps in the process, which could make all the difference.

That’s the research side of things. What about others? How do you apply this understanding and knowledge as a product builder or a business leader?

This helps you identify stuff worth betting on among all the noise. When the Transformers were introduced, OpenAI engineers immediately jumped into the experiment. They went all in for scaling these models because they had the intuition that this could really work among many other architectures of that time. Naturally, they built a long lead and were the first to market with consumer products built on this technology.

When you understand how something works, you may build a strong sense of intuition to help you identify what’s good to pick up among the new developments very early and quickly.

Sometimes, a small change in the existing architecture can bring a big difference. Perhaps your experience and awareness can help spot such a change that can be done.

Fin.

In case of any errata or follow-up questions, feel free to discuss them in the article responses section. I will update the content accordingly.

Loved the content and want me to write such in-depth articles for your startup website, blog, or documentation? Feel free to hit me up with a proposal at adityavivek.xq@gmail.com.

Explained: Transformers for Everyone

The underlying architecture of modern LLMs

What we want to build

How to create an AI model

Sequence Modeling

Transduction (in the context of AI)

Recurrent Models

Attention (in the context of AI)

Self Attention

Encoders and Decoders

The Transformer

The Encoder in the Transformer

The Decoder in the Transformer

Explained: Tokens and Embeddings in LLMs

The numbers that make all the sense

A Philosophical Conclusion?

Explained: Attention Mechanism in AI

In simple English for everyone

Explained: Hyperparameters in Deep Learning

In simple English for everyone.

Written by XQ