Skip to main content

We’ve explored the world of tokenization, transforming text into a format understandable by Large Language Models (LLMs). Now, we embark on a critical step: generating input-target pairs, the fuel for LLM training.

Understanding Input-Target Pairs

Recall from Episode 1 that LLMs excel at predicting the next word in a sequence. Input-target pairs provide them with bite-sized examples to learn from. Each pair consists of two parts:

  • Input (x): This represents a sequence of tokens, like a mini-sentence (e.g., „the quick brown fox“).
  • Target (y): This is the next word that follows the input sequence (e.g., „jumps“).

By creating numerous input-target pairs using a sliding window approach, we provide the LLM with diverse contexts to learn from. Imagine an LLM being shown flashcards with sentence snippets and their missing next words. These pairs progressively equip the LLM to grasp the flow and logic of language.

Looking Ahead

With these input-target pairs in hand, the next chapter will introduce embeddings – a powerful tool for bridging the gap between discrete tokens and the LLM’s internal understanding of language. Embeddings will transform tokens into numerical vectors, allowing the LLM to process and analyze language in a way more akin to the human brain. Stay tuned for this exciting revelation!

Figure 1 Extract input blocks from a text sample and use them as subsamples to feed the LLM. The LLM’s training job is to predict the word that will come after the input block. We mask out all words that are past the objective during training. It should be noted that the text displayed in this image would first need to be tokenized before the LLM could hThe foundation has been established! Tokenization and the generation of input-target pairs—two crucial components of LLM training—have been discussed. The intriguing idea of embeddings is used in the following phase.

In this section, we design a data loader that uses a sliding window technique to get the input-target pairings shown in Figure 1 from the training dataset.

After applying the BPE tokenizer, running the code above will yield 5145, which is the total number of tokens in the training set. For demonstration reasons, we then eliminate the first 50 tokens from the dataset because doing so produces a marginally more engaging text passage in the following steps:  

Making two variables, x and y, with x holding the input tokens and y holding the targets, which are the inputs shifted by one, is among the simplest and most natural approaches to generate the input-target pairs for the next word prediction task:

When the code above is run, the following output is printed:

The next-word prediction tasks shown before in Figure 1 can then be created by processing the inputs along with the targets, which are the inputs shifted by one position, as follows:

Output:

The input an LLM would receive is everything to the left of the arrow (—->), and the target token ID, which the LLM is meant to anticipate, is represented by the token ID on the right side of the arrow. To demonstrate, let’s rewrite the previous code with the token IDs changed to text:

The outputs that follow illustrate how the input and outputs appear in text format:

As we indicated at the beginning of this chapter, there is just one more thing to do before we can convert the tokens into embeddings: create an effective data loader that loops through the input dataset and provides the inputs and targets as PyTorch tensors. Specifically, what we want is two tensors returned, as shown in Figure 2: an input tensor with the text that the LLM sees and a target tensor with the targets that the LLM needs to forecast.

Figure 2: In order to create effective data loaders, we gather the inputs into a tensor called x, where each row denotes a single input context. The associated prediction targets (next words), which are produced by moving the input by one position, are contained in a second tensor, y.

For illustration purposes, Figure 2 displays the tokens in string format; however, since the BPE tokenizer’s encode method completes tokenization and conversion into token IDs in a single step, the code implementation will work directly with token IDs. We will use the built-in Dataset and DataLoader classes from PyTorch for the effective data loader implementation.

Depending on the PyTorch Dataset class, the GPTDatasetV1 class in the code above specifies how individual rows are collected from the dataset. Each row is made up of a number of token IDs (depending on a max_length) allocated to an input_chunk tensor. The related targets are contained in the target_chunk tensor. I suggest continuing reading to discover what happens to the data given by this dataset when it is combined with a PyTorch DataLoader; this will add further understanding and insight.The GPTDatasetV1 will be used by the following code to load the inputs in batches using a PyTorch DataLoader:

To gain an understanding of the interaction between the GPTDatasetV1 class and the create_dataloader function, let’s test the dataloader with a batch size of 1 for an LLM with a context size of 4:

The following is printed when the previous code is executed:

Two tensors are present in the first_batch variable: the input token IDs are stored in the first tensor, and the target token IDs are stored in the second tensor. Each of the two tensors includes four token IDs because the max_length is set to four. Keep in mind that 4 is a pretty tiny input size and was simply selected for demonstration. LLMs are frequently trained with input sizes of at least 256. Let’s retrieve another batch from this dataset to demonstrate what stride=1 means:

The contents of the second batch are as follows:

The second batch’s token IDs are displaced by one position in comparison to the first batch, as can be seen if we compare the two (for instance, the second ID in the first batch’s input is 367, which is the first ID of the second batch’s input). As shown in Figure 3, the stride setting determines how many positions the inputs shift between batches to simulate a sliding window method.

Figure 3: We drag an input window across the text to create numerous batches from the input dataset. When constructing the next batch, the input window is shifted by 1 if the stride is set to 1. Overlaps between the batches can be avoided if we set the stride equal to the input window size.

Experimenting with Data Loaders (Optional):

This section is for those curious to delve deeper into data loaders. We’ve been using a batch size of 1 for illustration, but training often utilizes larger batches. While requiring more memory, larger batches can lead to more efficient model updates.

Exploring Different Settings

If you’re familiar with deep learning, you know that hyperparameters like batch size significantly impact training. Here’s an opportunity to experiment with the data loader using different settings:

These variations will change how the data loader extracts input-target pairs, potentially influencing the LLM’s learning process.

Batch Sizes and Overfitting

The provided code demonstrates using a batch size greater than 1. Note that the stride is increased to 5 (matching max_length + 1). This ensures we utilize the entire dataset without skipping any words and avoids overlap between batches. Overlap could lead to increased overfitting, where the model memorizes specific training examples instead of generalizing well.

The following is printed as a result:

Looking Ahead

The intriguing idea of embeddings will be examined in the upcoming sections. Our discrete token IDs will become continuous vector representations—the LLM language—thanks to these embeddings. Await the demonstration of how these embeddings enable LLMs to comprehend and analyze human language!