Master Transformer Self Attention Mechanism

Introduction

Since its introduction via the original transformer paper (Attention Is All You Need), self-attention has become a cornerstone of many state-of-the-art deep learning models, particularly in the field of Natural Language Processing (NLP). In this article, we will delve into how self-attention works and implement it step-by-step from scratch.

Self-Attention

The concept of „attention“ in deep learning originated to enhance Recurrent Neural Networks (RNNs) for handling longer sequences or sentences. For example, consider translating a sentence from one language to another. Translating word-by-word isn’t effective, as context plays a crucial role in understanding the meaning.

To address this, attention mechanisms were introduced, giving access to all sequence elements at each time step. The key is to be selective and determine which words are most important in a specific context. In 2017, the transformer architecture introduced a standalone self-attention mechanism, eliminating the need for RNNs altogether.

Self-attention enhances the information content of an input embedding by including information about the input’s context. This mechanism enables the model to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output.

In this article, we focus on the original scaled-dot product attention mechanism (referred to as self-attention), which remains the most popular and widely used attention mechanism in practice.

The Self-Attention Mechanism: A Building Block of LLMs

The self-attention mechanism, essential to Large Language Models (LLMs), assigns attention weights to input elements based on their relevance, enabling the model to capture context and dependencies effectively. In a sentence like „The cat sat on the mat,“ the mechanism gives higher weights to related words such as „cat“ and „mat,“ creating context vectors that highlight important relationships.

For each word, self-attention calculates query, key, and value vectors using learned weights. The attention score is derived from the dot product of query and key vectors, scaled and processed through softmax to obtain weights. Each word’s context vector is then a weighted sum of the value vectors based on these weights.

Multi-Head Attention

Multi-head attention enhances the self-attention mechanism by using multiple attention heads to process data simultaneously, each capturing different traits and associations. This allows the model to understand the input material comprehensively, enabling accurate and efficient execution of complex linguistic tasks.

Instead of a single set of attention weights, multi-head attention employs multiple sets, each with its own trainable weight matrices. This enables the model to focus on various aspects of the input sequence concurrently, capturing a diverse and comprehensive collection of features and associations.

For instance, in the sentence „The quick brown fox jumps over the lazy dog,“ different attention heads might focus on distinct relationships: one on the subject-verb relationship between „fox“ and „jumps,“ and another on the descriptive relationship among „quick,“ „brown,“ and „fox.“ The model combines these outputs to produce a detailed representation of the input sequence.

Multi-head attention works by concatenating the outputs of each attention head and performing a final linear transformation, creating a cohesive representation from the diverse features captured by each head. This mechanism enhances the model’s ability to handle longer sequences and complex linguistic structures.

The accompanying image illustrates the multi-head attention mechanism, showing how several sets of attention heads, each with its own trainable weight matrix, function in parallel. This approach captures varied features and associations, producing a comprehensive representation and improving the model’s capacity to manage complex linguistic tasks.

Coding and Embedding an Input Sentence

Let’s consider an input sentence, „Life is short, eat dessert first,“ and put it through the self-attention mechanism. We create a sentence embedding first.

Step 1: Tokenize the Sentence

sentence = 'Life is short, eat dessert first'
dc = {s: i for i, s in enumerate(sorted(sentence.replace(',', '').split()))}
print(dc)

Output:

{'Life': 0, 'dessert': 1, 'eat': 2, 'first': 3, 'is': 4, 'short': 5}

Step 2: Convert Tokens to Integers

import torch

sentence_int = torch.tensor([dc[s] for s in sentence.replace(',', '').split()])
print(sentence_int)

Output:

tensor([0, 4, 5, 2, 1, 3])

Step 3: Create Embeddings

We use an embedding layer to encode the inputs into a real-vector embedding.

torch.manual_seed(123)
embed = torch.nn.Embedding(6, 16)
embedded_sentence = embed(sentence_int).detach()

print(embedded_sentence)
print(embedded_sentence.shape)

Output:

tensor([[ 0.3374, -0.1778, -0.3035, -0.5880,  0.3486,  0.6603, -0.2196, -0.3792,          0.7671, -1.1925,  0.6984, -1.4097,  0.1794,  1.8951,  0.4954,  0.2692],
        [ 0.5146,  0.9938, -0.2587, -1.0826, -0.0444,  1.6236, -2.3229,  1.0878,          0.6716,  0.6933, -0.9487, -0.0765, -0.1526,  0.1167,  0.4403, -1.4465],
        [ 0.2553, -0.5496,  1.0042,  0.8272, -0.3948,  0.4892, -0.2168, -1.7472,         -1.6025, -1.0764,  0.9031, -0.7218, -0.5951, -0.7112,  0.6230, -1.3729],
        [-1.3250,  0.1784, -2.1338,  1.0524, -0.3885, -0.9343, -0.4991, -1.0867,          0.8805,  1.5542,  0.6266, -0.1755,  0.0983, -0.0935,  0.2662, -0.5850],
        [-0.0770, -1.0205, -0.1690,  0.9178,  1.5810,  1.3010,  1.2753, -0.2010,          0.4965, -1.5723,  0.9666, -1.1481, -1.1589,  0.3255, -0.6315, -2.8400],
        [ 0.8768,  1.6221, -1.4779,  1.1331, -1.2203,  1.3139,  1.0533,  0.1388,          2.2473, -0.8036, -0.2808,  0.7697, -0.6596, -0.7979,  0.1838,  0.2293]])
torch.Size([6, 16])

Defining the Weight Matrices

Self-attention utilizes three weight matrices: Wq, Wk, and Wv, which are adjusted as model parameters during training. These matrices project the inputs into query, key, and value components of the sequence, respectively.

Step 1: Initialize Weight Matrices

torch.manual_seed(123)

d = embedded_sentence.shape[1]
d_q, d_k, d_v = 24, 24, 28

W_query = torch.nn.Parameter(torch.rand(d_q, d))
W_key = torch.nn.Parameter(torch.rand(d_k, d))
W_value = torch.nn.Parameter(torch.rand(d_v, d))

Step 2: Compute the Unnormalized Attention Weights

Let’s calculate the attention vector for the second input element, which acts as the query here:

x_2 = embedded_sentence[1]
query_2 = W_query.matmul(x_2)
key_2 = W_key.matmul(x_2)
value_2 = W_value.matmul(x_2)

print(query_2.shape)
print(key_2.shape)
print(value_2.shape)

Output:

torch.Size([24])
torch.Size([24])
torch.Size([28])

Generalize this to calculate the remaining key and value elements for all inputs:

keys = W_key.matmul(embedded_sentence.T).T
values = W_value.matmul(embedded_sentence.T).T

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

Output:

keys.shape: torch.Size([6, 24])
values.shape: torch.Size([6, 28])

Compute the unnormalized attention weights:

omega_24 = query_2.dot(keys[4])
print(omega_24)

Output:

tensor(11.1466)

Compute the ω values for all input tokens:

omega_2 = query_2.matmul(keys.T)
print(omega_2)

Output:

tensor([ 8.5808, -7.6597,  3.2558,  1.0395, 11.1466, -0.4800])

Computing the Attention Scores

Normalize the unnormalized attention weights to obtain the normalized attention weights by applying the softmax function. Additionally, scale ω by 1/dk−−√ before normalizing it.

import torch.nn.functional as F

attention_weights_2 = F.softmax(omega_2 / d_k**0.5, dim=0)
print(attention_weights_2)

Output:

tensor([0.2912, 0.0106, 0.0982, 0.0625, 0.4917, 0.0458])

Compute the context vector, which is an attention-weighted version of our original query input, including all the other input elements as its context via the attention weights:

context_vector_2 = attention_weights_2.matmul(values)

print(context_vector_2.shape)
print(context_vector_2)

Output:

torch.Size([28])
tensor([-1.5993,  0.0156,  1.2670,  0.0032, -0.6460, -1.1407, -0.4908, -1.4632,
         0.4747,  1.1926,  0.4506, -0.7110,  0.0602,  0.7125, -0.1628, -2.0184,
         0.3838, -2.1188, -0.8136, -1.5694,  0.7934, -0.2911, -1.3640, -0.2366,
        -0.9564, -2.1807,  0.7064, -0.6914])

Repeat the same steps for all six input elements to calculate their attention-weighted context vectors.

def compute_attention(query_idx):
    query = W_query.matmul(embedded_sentence[query_idx])
    keys = W_key.matmul(embedded_sentence.T).T
    values = W_value.matmul(embedded_sentence.T).T
    omega = query.matmul(keys.T)
    attention_weights = F.softmax(omega / d_k**0.5, dim=0)
    context_vector = attention_weights.matmul(values)
    return context_vector

context_vectors = torch.stack([compute_attention(i) for i in range(len(sentence_int))])
print(context_vectors.shape)
print(context_vectors)

Output:

torch.Size([6, 28])
tensor([[ 1.6553e+00,  1.3356e+00,  2.3568e+00,  1.8870e+00,  1.2365e+00,
          1.4977e+00,  1.0891e+00,  2.1984e+00, -5.8724e-01,  2.3954e+00,
          1.1139e+00,  2.2103e+00,  2.1913e+00,  9.1274e-01,  2.1987e+00,
          1.8836e+00,  5.7325e-01,  6.8721e-01,  7.2828e-01,  2.2573e+00,
          2.8605e+00,  2.1737e+00,  1.2913e+00,  1.6612e+00,  1.4178e+00,
          3.1113e+00,  2.0411e+00,  3.5763e+00],
        [-1.5993e+00,  1.5594e-02,  1.2670e+00,  3.1612e-03, -6.4600e-01,
         -1.1407e+00, -4.9081e-01, -1.4632e+00,  4.7471e-01,  1.1926e+00,
          4.5059e-01, -7.1097e-01,  6.0172e-02,  7.1248e-01, -1.6280e-01,
         -2.0184e+00,  3.8376e-01, -2.1188e+00, -8.1358e-01, -1.5694e+00,
          7.9338e-01, -2.9112e-01, -1.3640e+00, -2.3665e-01, -9.5643e-01,
         -5.2651e-01,  6.2443e-02,  1.7084e+00],
        [-4.1774e+00, -1.6440e+00, -1.9643e+00, -1.6642e+00, -1.0216e+00,
         -5.0441e+00, -1.4350e+00, -3.0582e+00, -1.3735e+00, -1.0167e+00,
         -9.3965e-01, -2.5408e+00, -2.1351e+00, -1.8701e+00, -1.9994e+00,
         -3.7609e+00, -3.8755e+00, -3.1365e+00, -2.1639e+00, -3.0949e+00,
         -3.7118e+00, -1.8682e+00, -1.8869e+00, -1.7023e+00, -1.4043e+00,
         -4.1602e+00, -3.5326e+00, -1.8202e+00],
        [-4.1765e+00, -1.6436e+00, -1.9632e+00, -1.6638e+00, -1.0214e+00,
         -5.0427e+00, -1.4345e+00, -3.0579e+00, -1.3729e+00, -1.0158e+00,
         -9.3947e-01, -2.5402e+00, -2.1344e+00, -1.8692e+00, -1.9991e+00,
         -3.7605e+00, -3.8735e+00, -3.1365e+00, -2.1636e+00, -3.0947e+00,
         -3.7100e+00, -1.8679e+00, -1.8868e+00, -1.7020e+00, -1.4043e+00,
         -4.1588e+00, -3.5308e+00, -1.8186e+00],
        [-4.1551e+00, -1.6412e+00, -1.9663e+00, -1.6580e+00, -1.0151e+00,
         -5.0340e+00, -1.4330e+00, -3.0481e+00, -1.3810e+00, -1.0148e+00,
         -9.4537e-01, -2.5280e+00, -2.1248e+00, -1.8742e+00, -1.9999e+00,
         -3.7398e+00, -3.8600e+00, -3.1291e+00, -2.1658e+00, -3.0865e+00,
         -3.6977e+00, -1.8617e+00, -1.8866e+00, -1.7083e+00, -1.4011e+00,
         -4.1460e+00, -3.5131e+00, -1.8161e+00],
        [ 2.3501e+00,  1.2960e+00,  2.2324e+00,  2.1957e+00,  2.3762e+00,
          1.8197e+00,  2.2329e+00,  3.4829e+00, -1.9674e+00,  3.0705e+00,
          6.7280e-01,  3.1772e+00,  2.7996e+00,  7.7589e-01,  2.7167e+00,
          2.6194e+00,  1.0987e-01,  9.6178e-01,  1.1149e+00,  3.5639e+00,
          3.5327e+00,  2.4810e+00,  2.8085e+00,  2.3073e+00,  2.6020e+00,
          4.4131e+00,  3.1466e+00,  5.2343e+00]], grad_fn=<StackBackward0>)

Conclusion

Self-attention is a powerful mechanism enabling models to dynamically focus on different parts of an input sequence. By implementing self-attention from scratch, we gain a deeper understanding of its inner workings and appreciate its significance in modern deep learning architectures. In future posts, we’ll explore more advanced topics related to transformers and their applications.

Mastering Self-Attention: Coding From Scratch

Introduction

Self-Attention

The Self-Attention Mechanism: A Building Block of LLMs

Multi-Head Attention

Coding and Embedding an Input Sentence

Defining the Weight Matrices

Computing the Attention Scores

Conclusion

Building a GPT Model from Scratch to Generate Text

Be intelligent - start your journey with us!

Rechtliches

Kontakt

0221-95490614

[email protected]

Kostenlose Beratung!

Mastering Self-Attention: Coding From Scratch

Introduction

Self-Attention

The Self-Attention Mechanism: A Building Block of LLMs

Multi-Head Attention

Coding and Embedding an Input Sentence

Defining the Weight Matrices

Computing the Attention Scores

Conclusion

Building a GPT Model from Scratch to Generate Text

Related Posts

Mastering Chain-of-Thought Prompting: A Step-by-Step Guide

Deep Reinforcement Learning in der KI

Understanding Prompt Engineering: A Deep Dive

Be intelligent - start your journey with us!

Rechtliches

Kontakt

0221-95490614

[email protected]

Kostenlose Beratung!