Skip to main content

In this blog post, we will dive into the process of pretraining a Large Language Model (LLM) using unlabeled data. We will implement the training loop and basic model evaluation code necessary to train our model from scratch. Additionally, we’ll explore how to leverage openly available pretrained weights from OpenAI to enhance our model’s performance. So let’s get started!

The topics covered in this blog post are shown below:

Figure 1: Shows the topics that we will cover in this blog post

To begin, it’s essential to ensure that we have the right packages installed and up-to-date. Here’s a quick check of the versions of the key libraries we will use:

OUTPUT

These versions ensure compatibility with our code and facilitate smooth execution of the training and evaluation processes.

1 Evaluating Generative Text Models

In this section, we’ll begin by revisiting how to initialize a GPT model, using the code we covered in the this blog. We’ll then explore basic evaluation metrics for Large Language Models (LLMs) and apply these metrics to both training and validation datasets.

1.1 Using GPT to Generate Text

Let’s start by initializing a GPT model using the configuration from the previous blog:

We use a dropout of 0.1 here, though it’s increasingly common to train LLMs without dropout. Additionally, modern LLMs typically do not use bias vectors in the nn.Linear layers for the query, key, and value matrices, achieved by setting „qkv_bias“: False.

To reduce computational resource requirements, we’ve set the context length to 256 tokens, compared to the original 124 million parameter GPT-2 model, which used 1024 tokens. This setup makes it easier for readers to execute the code on a standard laptop. However, you can increase the context_length to 1024 tokens without changing any other code.

Next, let’s generate text using the generate_text_simple function from the previous blog. We also define two utility functions, text_to_token_ids and token_ids_to_text, to convert between text and token representations:

Figure 2: Shows the text_to_token_ids and token_ids_to_text functions for converting between token and text representations that we use throughout this blog.

OUTPUT

As seen above, the model does not produce coherent text since it hasn’t been trained yet. To evaluate and track the training progress, we need to measure how „good“ the generated text is in numerical terms. The next subsection introduces metrics for calculating a loss metric for the generated outputs.

1.2 Calculating the Text Generation Loss: Cross-Entropy and Perplexity

Consider the following tensors, representing token IDs for two training examples:

Feeding the inputs into the model, we get the logits vector for these input examples:

OUTPUT

As discussed earlier, applying the argmax function converts the probability scores into predicted token IDs:

Figure 3: Outlines how we convert the probability scores back into text

OUTPUT

Comparing these predictions to the target tokens shows a significant mismatch, as the model hasn’t been trained yet. To train the model, we must calculate how far off it is from the correct predictions.

Next, we compute the average log probability:

OUTPUT

In deep learning, instead of maximizing the average log-probability, it’s standard to minimize the negative average log-probability value. This value is also called cross-entropy loss:

OUTPUT

A related concept is perplexity, which is simply the exponential of the cross-entropy loss:

OUTPUT

A lower perplexity indicates that the model predictions are closer to the actual distribution, making perplexity a useful metric for evaluating model quality.

1.3 Calculating the Training and Validation Set Losses

We will use a small dataset to train the LLM, specifically a short public domain text. This allows you to run the examples quickly on a standard laptop without the need for extensive computational resources.

First, let’s load the dataset:

We check that the text loaded correctly by printing the first and last 100 words:

OUTPUT

Next, let’s divide the dataset into training and validation sets, then use data loaders to prepare batches for LLM training:

Next, we implement utility functions to calculate the cross-entropy loss for a given batch and for a specified number of batches in a data loader:

If you have a CUDA-supported GPU, the LLM will train on the GPU without requiring any code changes. Here’s how to calculate the training and validation losses:

OUTPUT

The values are initially high since the model hasn’t been trained yet. After training, these loss values should decrease significantly.

Figure 4: Shows that we have been dealt with the initial three necessary steps, now we will work on the training function

2 Training an LLM

In this section, we finally implement the code for training the LLM. We’ll focus on a simple training function.

Figure 5: Shows the step by step process of the training fucntion of our LLM

2.1 Simple Training Function

Let’s start by defining a simple training function:

This function is simple yet effective for educational purposes. It tracks the training and validation losses, evaluates the model at regular intervals, and generates a sample text after each epoch.

2.2 Evaluation and Sample Generation

Next, let’s define the helper functions to evaluate the model and generate sample text:

These functions support model evaluation during training and help visualize progress by generating text samples after each epoch.

2.3 Training the Model

Now, let’s train the LLM using the training function defined above:

OUTPUT

As the model trains, you can see the training and validation losses decrease. The generated text becomes more coherent, though overfitting is evident due to the small dataset and extensive training.

2.4 Visualizing Training Progress

Finally, let’s visualize the training and validation losses:

Figure 6: This plot shows the training and validation losses over time. You can observe that while the training loss decreases consistently, the validation loss starts increasing, indicating overfitting.
Figure 7: Show the steps that we have covered until now, now let’s move to the next step

3 Decoding Strategies to Control Randomness

Inference with a relatively small LLM, like the GPT model we trained above, is computationally inexpensive. Even if you used a GPU for training, inference can be comfortably performed on a CPU. Using the generate_text_simple function from the previous blog, we can generate new text one word (or token) at a time.

As explained in Section 1.2, the next generated token is the one corresponding to the highest probability score among all tokens in the vocabulary. Let’s demonstrate this by moving our model to the CPU and generating text:

OUTPUT

Even if we execute the generate_text_simple function multiple times, the LLM will always generate the same outputs. This is because the function is deterministic, always selecting the token with the highest probability. To introduce variability and control the randomness of the generated text, we can employ two decoding strategies: temperature scaling and top-k sampling.

3.1 Temperature Scaling

Previously, we always selected the token with the highest probability using torch.argmax. To add variety, we can sample the next token using torch.multinomial(probs, num_samples=1), which samples from the probability distribution provided by the softmax function. In this context, each index’s chance of being picked corresponds to its probability in the input tensor.

Let’s recap how to generate the next token, assuming a very small vocabulary for illustration purposes:

Instead of determining the most likely token via torch.argmax, we can use torch.multinomial(probas, num_samples=1) to determine the next token by sampling from the softmax distribution. Here’s how this approach works when we sample the next token 1,000 times using the original softmax probabilities:

OUTPUT

The next token distribution favors „forward“ as expected, but there’s still some variability. We can further control this distribution using temperature scaling.

Temperature scaling adjusts the logits by dividing them by a number greater than 0, called the temperature. Here’s what happens:

  • Temperature > 1: The resulting probabilities are more uniformly distributed, leading to more random and diverse output.
  • Temperature < 1: The resulting probabilities become more peaked, making the model more confident in its choices, reducing diversity.

Let’s see this in action:

Figure 8: This plot demonstrates how temperature scaling affects the probability distribution. A temperature of 0.1 sharpens the distribution, making „forward“ almost always the selected token:

OUTPUT

A temperature of 5 results in a more uniform distribution:

OUTPUT

This approach can lead to nonsensical outputs, such as „every effort moves you pizza“ 3.2% of the time (32 out of 1000 times). To balance diversity and coherence, we can combine temperature scaling with top-k sampling.

3.2 Top-k Sampling

Top-k sampling restricts the model to only sample from the top k most likely tokens, reducing the probability of nonsensical outputs while allowing for diverse text generation. Here’s how to implement it:

This method ensures the model only considers the top 3 tokens, further controlling randomness.

3.3 Modifying the Text Generation Function

Let’s combine temperature scaling and top-k sampling to modify the generate_simple function used to generate text earlier:

Let’s generate text with the modified function:

OUTPUT

By tweaking the temperature and top-k parameters, you can effectively control the randomness and diversity of the generated text, balancing between coherent and diverse outputs.

Figure 9: We are done with text generation strategies also, let’s jump to the next step now

4 Loading and Saving Model Weights in PyTorch

Training Large Language Models (LLMs) is computationally intensive, making it essential to save and load model weights efficiently. In PyTorch, the recommended way to save the model weights is by using the torch.save function in conjunction with the .state_dict() method, which captures the model’s parameters.

Here’s how you can save the model weights:

To load the saved model weights into a new instance of the GPTModel, follow these steps:

When training LLMs, it’s common to use adaptive optimizers like Adam or AdamW instead of standard SGD. These optimizers store additional parameters for each model weight, so it’s wise to save the optimizer state along with the model weights if you plan to resume training later:

To load both the model and optimizer states:

5 Loading Pretrained Weights from OpenAI

Pretraining LLMs from scratch can be prohibitively expensive, but fortunately, OpenAI provides pretrained weights for their models. This allows users to leverage the power of pretrained LLMs without the associated computational costs.

To load the pretrained weights provided by OpenAI, some boilerplate code is necessary. Since OpenAI originally used TensorFlow, you’ll need to install TensorFlow along with the tqdm progress bar library:

To download the model weights for the 124 million parameter GPT-2 model:

After downloading the weights, initialize a new GPTModel instance. Note that to correctly load the weights, the model configuration must match the original model, including setting the qkv_bias to True and using a 1024 token context length:

Next, map the OpenAI weights to the corresponding tensors in your GPTModel instance:

Finally, you can generate new text using the loaded model:

OUTPUT

If the weights were loaded correctly, the model should generate coherent text, confirming the successful weight transfer.

That is it for today…