Skip to main content

In this blog, we’ll be diving into the details of creating a GPT-like LLM (Large Language Model) from scratch. We’ll focus on coding the architecture, and in the next blog, we’ll move on to training it. This journey will be both enlightening and challenging, but by the end, you’ll have a solid understanding of how these models function.

1 Coding and LLM Architecture

Earlier, we discussed models like GPT and Llama, which generate text word by word, using the decoder part of the original transformer architecture. These models, often called „decoder-like“ LLMs, are significantly larger than conventional deep learning models, mainly because they have a massive number of parameters, not because the code itself is lengthy. As we move forward, you’ll notice a lot of repeated elements in the architecture of these LLMs.

In the previous blogs, we used small embedding dimensions for token inputs and outputs to keep things simple and easy to illustrate. Now, we’re stepping it up a notch. We’re going to work with embedding and model sizes similar to a small GPT-2 model, which boasts 124 million parameters. This size is detailed in Radford et al.’s „Language Models are Unsupervised Multitask Learners.“ Initially reported as 117 million parameters, this was later corrected in the model weight repository.

Coding and LLM Architecture | Skillbyte
BERT and GPT architecture

For our 124 million parameter GPT-2 model, the configuration looks like this:

Here’s a quick rundown of what each parameter means:

  • vocab_size: The vocabulary size is 50,257 words, supported by the BPE tokenizer.
  • context_length: This is the maximum number of input tokens the model can handle, made possible by positional embeddings.
  • emb_dim: The size of the embedding for token inputs, converting each token into a 768-dimensional vector.
  • n_heads: The number of attention heads in the multi-head attention mechanism.
  • n_layers: The number of transformer blocks within the model, which we’ll implement shortly.
  • drop_rate: The intensity of the dropout mechanism (0.1 means 10% of hidden units are dropped during training to prevent overfitting).
  • qkv_bias: Determines whether the linear layers in the multi-head attention mechanism should include a bias vector when computing query (Q), key (K), and value (V) tensors. We’ll disable this for now, as it’s a common practice in modern LLMs.

This code initializes a simple GPT model with token and position embeddings, a sequence of transformer blocks, and a final normalization layer before producing the output logits.

GPT Model Code | Skillbyte
big-picture overview showing how the input data is tokenized, embedded, and fed to the GPT model.

2 Normalizing Activations with Layer Normalization

Layer normalization, or LayerNorm, stabilizes the training process by centering the activations of a neural network layer around a mean of 0 and normalizing their variance to 1. This helps in faster convergence and more effective learning.

Here’s a quick demonstration of how layer normalization works:

Output:

By normalizing the inputs, each input is centered at 0 with a unit variance of 1. To avoid division-by-zero errors if the variance is zero, we add a small value (eps) before computing the square root of the variance. This is crucial in training LLMs, where the embedding dimension is large.

Now, let’s implement a LayerNorm class:

This class includes trainable scale and shift parameters that the model can adjust during training to optimize performance.

3 Implementing a Feed Forward Network with GELU Activations


We now create a small neural network submodule with GELU (Gaussian Error Linear Unit) activations, commonly used in LLMs for better performance than the traditional ReLU.

Here’s a simple implementation of GELU:

To visualize the difference between GELU and ReLU:

GELU Activations | Skillbyte
  • As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero
  • GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values ​​(except at approximately -0.75)

Next, let’s implement the FeedForward module:

This module will be part of our transformer block later.

Coding Output | Skillbyte
visual overview of the connections between the layers of the feed forward neural network.

4 Adding Shortcut Connections

Shortcut connections, also known as skip or residual connections, help mitigate vanishing gradient problems by creating alternative paths for the gradient to flow through the network.

Here’s a small example network demonstrating shortcut connections:

Output:

Using shortcut connections allows the gradient to flow directly from the input to the output, facilitating more effective training.

5 Coding a Transformer Block

A transformer block consists of multiple components: layer normalization, multi-head self-attention, and feed-forward neural networks, all combined with shortcut connections. Here’s the implementation of a transformer block:

This code initializes a transformer block that processes the input through layer normalization, multi-head self-attention, and a feed-forward neural network, while incorporating dropout for regularization.

Coding a Transformer Block | Skillbyte
The Transformer architecture

6 Finalizing Our GPT Model

Now, let’s put all the components together into our final GPT model:

In this final model, we embed the tokens and positions, apply dropout, pass the result through a series of transformer blocks, and finally normalize the output before projecting it back to the vocabulary size.

In the next post, we will focus on training this model, discussing the challenges and strategies for effectively training large language models.