llms-from-scratch--attention mechanism 详解代码计算

image.png

1.1 长序列建模的问题

由于源语和目的语语法结构的差异，逐字翻译文本是不可行的

在引入transformer之前，encoder-decoder 的RNN模型通常用于机器翻译任务
在这种设置中，编码器使用隐藏状态（神经网络中的一种中间层）处理来自源语言的一系列token，以生成整个输入序列的浓缩表示。

但是RNN有两个非常致命的缺陷：

当序列很长时，encoder 很容易丢失很早期的信息，后来也有引入attention机制解决
在模型训练计算的过程中，RNN是序列模型，只能按照顺序计算，无法实现并行计算

image.png

1.2 使用attention mechanism 捕获数据依赖

1.3 计算attention score

假设所有单词的embedding 如下，这里没有考虑batch的大小，只是单纯讨论计算过程

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

第一步，计算未标准化的attention scores
query 正好是inputs的第二个词, 这里其实就是Q * K的转置，主要计算的是K中每个token和query的相关性。

query = inputs[1]  # 2nd input token is the query
print(query)
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)
print(attn_scores_2)
# tensor([0.5500, 0.8700, 0.6600])
# tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

第二步：将未标准化的注意力分数（“omegas”）标准化，使其总和为1
需要进行标准化. attention_scores再经过softmax，之后就变成了权重。

attn_scores_2 = attn_scores_2/3
print(attn_scores_2)
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

第三步：通过将embedding的input token与attention_weight相乘来计算上下文向量，并将结果向量求和

query = inputs[1] # 2nd input token is the query

context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i

print(context_vec_2)

总结：Q、K、V的embedding 都是同一个输入复制来的。

1.4 计算带有可训练权重的self-attention

上面的都没有可学习的参数矩阵，在transformer论文提到的scaled dot-product attention存在可训练的权重。

与前面的attention 机制相比，最显著的特征是：引入在模型训练期间更新权重的矩阵，这些可训练的权重矩阵是至关重要的，这样模型（特别是模型中的注意力模块）就可以学会产生“好的”上下文向量。

逐步实现self attention mechanism，我们将从引入三个训练权矩阵, 和开始
-这三个矩阵用于通过矩阵乘法将embedding的输入token投影到query, key, and value 向量中：
- Query vector: $q^{(i)} = x^{(i)}\,W_q$
- Key vector: $k^{(i)} = x^{(i)}\,W_k$
- Value vector: $v^{(i)} = x^{(i)}\,W_v$

输入和查询向量的embedding 维度是相同的还是不同的，取决于模型的设计和具体的实现。在GPT模型中，输入和输出维度通常是相同的。

还是拿单个输入的embedding为例，和1.3节不同在于会多出三个参数矩阵 $W_q$ , $W_k$ , $W_v$ .例子将输出的维度改变了，但是实际的gpt是与输入的维度是一致的。(从3D映射为2D)

x_2 = inputs[1] # second input element
d_in = inputs.shape[1] # the input embedding size, d=3
d_out = 2 # the output embedding size, d=2

第一步：初始化权重矩阵，并计算query, key, and value vectors

torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False) # 如果需要进行参数学习，此处requires_grad=True
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False) 
query_2 = x_2 @ W_query # _2 because it's with respect to the 2nd input element
key_2 = x_2 @ W_key 
value_2 = x_2 @ W_value
print(query_2)

下面是全部的输入从3D -> 2D

keys = inputs @ W_key        # 6 X 3   3 X 2 - > 6 X 2
values = inputs @ W_value. # 6 X 3   3 X 2 - > 6 X 2

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)
# keys.shape: torch.Size([6, 2])
# values.shape: torch.Size([6, 2])

第二步，通过计算query 和每个 key vectors的点积来计算非标准的attention scores
$W_q$ , $W_k$ , $W_v$ 三个矩阵是随机初始化的，因此是不一样的。

# 1 X 2    6 X 2的转置 -> 1 X 6
attn_scores_2 = query_2 @ keys.T # All attention scores for given query 
print(attn_scores_2)

第三步, 除以embedding size的根方 $\sqrt{d_k}$ ，然后使用softmax函数计算出标准化的权重（加和为1)

d_k = keys.shape[1]  # 这里是2， 原本应该是3（本身权重矩阵的维度应与embedding size 一致）
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

第四步，计算上下文向量 contextual vectors

context_vec_2 = attn_weights_2 @ values # 1X 6  6 X 2  -> 1 X 2
print(context_vec_2)

self attention 计算过程.png

1.5 causual attention隐藏未来的词--causal self-attention mechanism

1.5.1 Applying a causal attention mask

在causual attention中，对角线上方的注意权重被屏蔽，确保对于任何给定的输入，LLM在计算具有注意权重的上下文向量时无法利用未来的token
Causal self-attention确保模型对序列中某个位置的预测仅依赖于之前位置的已知输出，而不依赖于未来位置
简单地说，这确保了每个下一个单词的预测应该只依赖于前面的单词
为了实现这一点，对于每个给定的token，我们屏蔽掉未来的token

与其将对角线以上的注意力权重归零并重新规范化结果，我们可以在对角线以上的非标准化注意力得分进入softmax函数之前用负无穷大掩盖它们
也就是在计算出attn score之后(attn weights之前)对对应位置进行负无穷大替换

mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)

mask 完成之后，再进行标准化计算attn weights

attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

1.5.2 Masking additional attention weights with dropout

Dropout可以应用在几个地方：

在计算了注意力权重之后；
或者将注意力权重与价值向量相乘之后

更常见的方法是第一种，计算注意力权重之后就进行drop out
如果我们应用dropout =0.2，则未丢弃的值将相应地按1/0.8 = 1.25的倍数缩放
缩放由公式1 / (1 - dropout_rate)计算

dropout.png

Dropout在训练阶段会为每个输入元素独立生成一个伯努利分布的二值掩码（mask）。每个元素被置零的概率是p（此处为0.5），但具体哪些元素被置零是完全随机的，且每次前向传播都会重新生成掩码.

这里有一个非常重要的问题，譬如如果有10个元素需要进行dropout，dropout=0.5，丢弃的权重不一定就是5个，可能是大于5个或者小于5个，这是因为dropout的设计是为每个单独的元素独立生成的一个伯努利分布的mask，小样本下结果可能偏离实际概率

dropout = torch.nn.Dropout(0.5) # dropout rate of 50%
print(dropout(attn_weights))

Note that dropout is only applied during training, not during inference

1.6 多头注意力

多头注意力有两种方式，一种是堆叠，一种是切分

1.6.1stacking multiple single-head attention layers--叠加多个单头注意力层

image.png

多头注意背后的主要思想是用不同的、可学习的线性投影多次（并行）运行注意机制。这使得模型可以联合处理来自不同位置的不同表示子空间的信息。

class MultiHeadAttentionWrapper(nn.Module):

    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)] # 这里头是2，所以计算了两遍单头的结果
        )

    def forward(self, x):
        # 最终结果维度 是原始的d_out X head  因为进行了concat 拼接
        return torch.cat([head(x) for head in self.heads], dim=-1)


torch.manual_seed(123)

context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, 0.0, num_heads=2
)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

堆叠的注意力层的结果会使得最后计算出的上下文向量跟着head的数量翻相应的倍数。

1.6.2 Implementing multi-head attention with weight splits 对权重矩阵进行切分来实现多头注意力机制

该方法就是根据d_out 和 embedding size 来自动获取head的数量，将三个权重矩阵 $W_q$ $W_k$ $W_v$ 进行切分增加维度，然后按不同的对应的头进行并行计算，最后上下文向量的维度大小和初始的embedding维度保持一致

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        # As in `CausalAttention`, for inputs where `num_tokens` exceeds `context_length`, 
        # this will result in errors in the mask creation further below. 
        # In practice, this is not a problem since the LLM (chapters 4-7) ensures that inputs  
        # do not exceed `context_length` before reaching this forwar

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

torch.manual_seed(123)

batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

根据代码的实际内容，是将 $W_q$ $W_k$ $W_v$ 三个权重矩阵，根据头的数量，变换成多个矩阵，也就是从原本的Batch X seq_length X d_out -> Batch X head X seq_length X head_dim. 其中 d_out = head X head_dim

image.png

注意，另外，我们添加了一个线性投影层（self.out_proj)到上面的MultiHeadAttention类。这是一个不改变维数的线性变换。在LLM实现中使用这样的投影层是一种标准约定，但这并不是严格必要的(最近的研究表明，可以在不影响建模性能的情况下删除它)

数据在多头注意力中的计算
接下来，数据被分割到多个注意力头中，以便每个头可以独立地进行处理。
然而，需要注意的是，这里的分给只是逻辑上的分割。Query、Key和Value并没有在物理上分割成每个Attention head一个独立的矩阵。实际上，对于Query、Key和Value，仍然是一个单一的大矩阵（把Q，K，V拼在了一起），这里只是逻辑上将矩阵的不同部分分配给每个Attention Head。同理，并没有针对每个Attention Head的独立线性层。所有Attention Head共享相同的线性层。
线性层的权重在逻辑上被按Attention Head分割
这种逻辑分割是通过在注Attention Head之间均匀分割输入数据以及线性层权重来实现的。
下面的矩阵分割和矩阵合并过程实际上是通过矩阵的变换，添加削减维度来实现的。