In this blog, we’ll be diving into the details of creating a GPT-like LLM (Large Language Model) from scratch. We’ll focus on coding the architecture, and in the next blog, we’ll move on to training it. This journey will be both enlightening and challenging, but by the end, you’ll have a solid understanding of how these models function.
1 Coding and LLM Architecture
Earlier, we discussed models like GPT and Llama, which generate text word by word, using the decoder part of the original transformer architecture. These models, often called „decoder-like“ LLMs, are significantly larger than conventional deep learning models, mainly because they have a massive number of parameters, not because the code itself is lengthy. As we move forward, you’ll notice a lot of repeated elements in the architecture of these LLMs.
In the previous blogs, we used small embedding dimensions for token inputs and outputs to keep things simple and easy to illustrate. Now, we’re stepping it up a notch. We’re going to work with embedding and model sizes similar to a small GPT-2 model, which boasts 124 million parameters. This size is detailed in Radford et al.’s „Language Models are Unsupervised Multitask Learners.“ Initially reported as 117 million parameters, this was later corrected in the model weight repository.
For our 124 million parameter GPT-2 model, the configuration looks like this:
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
Here’s a quick rundown of what each parameter means:
- vocab_size: The vocabulary size is 50,257 words, supported by the BPE tokenizer.
- context_length: This is the maximum number of input tokens the model can handle, made possible by positional embeddings.
- emb_dim: The size of the embedding for token inputs, converting each token into a 768-dimensional vector.
- n_heads: The number of attention heads in the multi-head attention mechanism.
- n_layers: The number of transformer blocks within the model, which we’ll implement shortly.
- drop_rate: The intensity of the dropout mechanism (0.1 means 10% of hidden units are dropped during training to prevent overfitting).
- qkv_bias: Determines whether the linear layers in the multi-head attention mechanism should include a bias vector when computing query (Q), key (K), and value (V) tensors. We’ll disable this for now, as it’s a common practice in modern LLMs.
import torch
import torch.nn as nn
class DummyGPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
# Use a placeholder for TransformerBlock
self.trf_blocks = nn.Sequential(
*[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])
# Use a placeholder for LayerNorm
self.final_norm = DummyLayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
This code initializes a simple GPT model with token and position embeddings, a sequence of transformer blocks, and a final normalization layer before producing the output logits.
2 Normalizing Activations with Layer Normalization
Layer normalization, or LayerNorm, stabilizes the training process by centering the activations of a neural network layer around a mean of 0 and normalizing their variance to 1. This helps in faster convergence and more effective learning.
Here’s a quick demonstration of how layer normalization works:
torch.manual_seed(123)
# Create 2 training examples with 5 dimensions (features) each
batch_example = torch.randn(2, 5)
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)
# Compute mean and variance for each input
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
# Normalize the features
out_norm = (out - mean) / torch.sqrt(var)
print("Normalized layer outputs:\n", out_norm)
Output:
tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
[0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
grad_fn=<ReluBackward0>)
Mean:
tensor([[0.1324],
[0.2170]], grad_fn=<MeanBackward1>)
Variance:
tensor([[0.0231],
[0.0398]], grad_fn=<VarBackward0>)
Normalized layer outputs:
tensor([[ 0.6159, 1.4126, -0.8719, 0.5872, -0.8719, -0.8719],
[-0.0189, 0.1121, -1.0876, 1.5173, 0.5647, -1.0876]],
grad_fn=<DivBackward0>)
By normalizing the inputs, each input is centered at 0 with a unit variance of 1. To avoid division-by-zero errors if the variance is zero, we add a small value (eps) before computing the square root of the variance. This is crucial in training LLMs, where the embedding dimension is large.
Now, let’s implement a LayerNorm class:
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
This class includes trainable scale and shift parameters that the model can adjust during training to optimize performance.
3 Implementing a Feed Forward Network with GELU Activations
We now create a small neural network submodule with GELU (Gaussian Error Linear Unit) activations, commonly used in LLMs for better performance than the traditional ReLU.
Here’s a simple implementation of GELU:
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))
To visualize the difference between GELU and ReLU:
import matplotlib.pyplot as plt
gelu, relu = GELU(), nn.ReLU()
# Some sample data
x = torch.linspace(-3, 3, 100)
y_gelu, y_relu = gelu(x), relu(x)
plt.figure(figsize=(8, 3))
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
plt.subplot(1, 2, i)
plt.plot(x, y)
plt.title(f"{label} activation function")
plt.xlabel("x")
plt.ylabel(f"{label}(x)")
plt.grid(True)
plt.tight_layout()
plt.show()
- As we can see, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero
- GELU is a smooth, non-linear function that approximates ReLU but with a non-zero gradient for negative values (except at approximately -0.75)
Next, let’s implement the FeedForward module:
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
return self.layers(x)
This module will be part of our transformer block later.
4 Adding Shortcut Connections
Shortcut connections, also known as skip or residual connections, help mitigate vanishing gradient problems by creating alternative paths for the gradient to flow through the network.
Here’s a small example network demonstrating shortcut connections:
import torch
import torch.nn as nn
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))
class ExampleDeepNeuralNetwork(nn.Module):
def __init__(self, layer_sizes, use_shortcut):
super().__init__()
self.use_shortcut = use_shortcut
self.layers = nn.ModuleList([
nn.Sequential(nn.Linear(layer_sizes[i], layer_sizes[i+1]), GELU())
for i in range(len(layer_sizes) - 1)
])
def forward(self, x):
for i, layer in enumerate(self.layers):
out = layer(x)
if self.use_shortcut and i > 0:
x = x + out
else:
x = out
return x
cfg = {
"emb_dim": 5,
"batch_size": 3,
"num_inputs": 5 # Changed to match the first layer input size
}
example_input = torch.randn(cfg["batch_size"], cfg["num_inputs"])
layer_sizes = [cfg["num_inputs"]] + [cfg["emb_dim"]] * 5
deep_net = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=False)
deep_net_with_shortcuts = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=True)
output_no_shortcut = deep_net(example_input)
output_with_shortcut = deep_net_with_shortcuts(example_input)
print("Output without shortcut:", output_no_shortcut)
print("Output with shortcut:", output_with_shortcut)
Output:
Output without shortcut: tensor([[ 0.0659, 0.1414, 0.2212, 0.2431, -0.0685],
[ 0.0605, 0.1520, 0.2194, 0.2419, -0.0649],
[ 0.0572, 0.1572, 0.2164, 0.2449, -0.0585]], grad_fn=<MulBackward0>)
Output with shortcut: tensor([[ 0.2874, 0.2718, 0.1951, -0.4061, 0.1001],
[ 0.1418, -0.3371, -0.4265, -0.3000, -0.0655],
[ 0.6260, 0.1804, -0.3444, 0.2998, 0.1010]], grad_fn=<AddBackward0>)
Using shortcut connections allows the gradient to flow directly from the input to the output, facilitating more effective training.
5 Coding a Transformer Block
A transformer block consists of multiple components: layer normalization, multi-head self-attention, and feed-forward neural networks, all combined with shortcut connections. Here’s the implementation of a transformer block:
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.ln1 = LayerNorm(cfg["emb_dim"])
self.ln2 = LayerNorm(cfg["emb_dim"])
self.attn = MultiHeadAttention(cfg)
self.ffn = FeedForward(cfg)
self.drop = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
attn_out = x + self.drop(self.attn(self.ln1(x)))
ffn_out = attn_out + self.drop(self.ffn(self.ln2(attn_out)))
return ffn_out
This code initializes a transformer block that processes the input through layer normalization, multi-head self-attention, and a feed-forward neural network, while incorporating dropout for regularization.
6 Finalizing Our GPT Model
Now, let’s put all the components together into our final GPT model:
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.blocks = nn.Sequential(*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.ln_f = LayerNorm(cfg["emb_dim"])
self.head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
def forward(self, idx):
b, t = idx.size()
token_embeddings = self.tok_emb(idx)
position_embeddings = self.pos_emb(torch.arange(t, device=idx.device))
x = self.drop_emb(token_embeddings + position_embeddings)
x = self.blocks(x)
x = self.ln_f(x)
logits = self.head(x)
return logits
In this final model, we embed the tokens and positions, apply dropout, pass the result through a series of transformer blocks, and finally normalize the output before projecting it back to the vocabulary size.
In the next post, we will focus on training this model, discussing the challenges and strategies for effectively training large language models.