1.BERT

BERT主要是多个Transformer的Encoder作为主题，主要包含Embedding层，Encoder层。

1.1 Embedding

BERT中的Embedding主要有3种：

Token Embedding(词编码),
Position Embedding (位置编码),
Segment Embedding

1.1.1 Token Embedding

Token Embedding 是对词向量进行编码。原始的输入是[batch,seq_len]。经过 Token Embedding 后数据的维度为[batch,seq_len,d_model]。
在BERT中Token Embedding的内部计算流程是初始化一个二维数组，大小为[vocab_size,d_model]，然后将输入的数据进行one-hot编码，维度为[batch,seq_len,vocab_size]，进行tensor的乘法。验证如下：

以torch原始的Embedding进行token编码

import torch
import torch.nn.functional as F

## 验证Token embedding
input = torch.tensor([[1,4,2,3,4],[4,2,3,1,5]],dtype = torch.long)
init_weight = torch.rand(6,3) # 这里是6为词典的大小，3为d_model
print(init_weight)
# init_weight的值为：
# tensor([[0.2741, 0.7190, 0.5863],
#         [0.9283, 0.3595, 0.8193],
#         [0.6051, 0.4441, 0.6545],
#         [0.8852, 0.9930, 0.6367],
#         [0.0421, 0.1417, 0.6370],
#         [0.3956, 0.5442, 0.4503]])
out = F.embedding(input,init_weight)
print(out)
# out的值为：
# tensor([[[0.9283, 0.3595, 0.8193],
#          [0.0421, 0.1417, 0.6370],
#          [0.6051, 0.4441, 0.6545],
#          [0.8852, 0.9930, 0.6367],
#          [0.0421, 0.1417, 0.6370]],
# 
#         [[0.0421, 0.1417, 0.6370],
#          [0.6051, 0.4441, 0.6545],
#          [0.8852, 0.9930, 0.6367],
#          [0.9283, 0.3595, 0.8193],
#          [0.3956, 0.5442, 0.4503]]])

将索引进行onehot编码后做矩阵乘法进行验证，这里固定init_weight和上面一样

import numpy as np
input2 = np.array([
    [[0,1,0,0,0,0],[0,0,0,0,1,0],[0,0,1,0,0,0],[0,0,0,1,0,0],[0,0,0,0,1,0]],
    [[0,0,0,0,1,0],[0,0,1,0,0,0],[0,0,0,1,0,0],[0,1,0,0,0,0],[0,0,0,0,0,1]]
])
init_weight = np.array([[0.2741, 0.7190, 0.5863],
        [0.9283, 0.3595, 0.8193],
        [0.6051, 0.4441, 0.6545],
        [0.8852, 0.9930, 0.6367],
        [0.0421, 0.1417, 0.6370],
        [0.3956, 0.5442, 0.4503]])
for i in range(len(input2)):
    out = np.dot(input2[i],init_weight)
    print(out)

# [[0.9283 0.3595 0.8193]
#  [0.0421 0.1417 0.637 ]
#  [0.6051 0.4441 0.6545]
#  [0.8852 0.993  0.6367]
#  [0.0421 0.1417 0.637 ]]


# [[0.0421 0.1417 0.637 ]
#  [0.6051 0.4441 0.6545]
#  [0.8852 0.993  0.6367]
#  [0.9283 0.3595 0.8193]
#  [0.3956 0.5442 0.4503]]

可以看见两者的结果是一样的，所以猜测embedding内部就是先将句子中每个词的索引表示转化为one-hot表示，然后对编码后的数据进行矩阵的变换，其中参数开始是输出化的，后面训练的时候可以用来学习。编码后的输出为[batch,seq_len,d_model]

1.1.2 Position Embedding

BERT中的Position Embedding和Transformer不一样，transormer中式直接利用公式，计算出对用维度的值。在BERT中是要学习的。比如说d_model的大小为512，那么每个句子就会生成一个[0,1,2,...511]的一维数组，然后重复batch次，因此实际的输入为[batch,d_model]，将其送到one_hot中进行编码，具体的编码过程和Token Embedding一样，然后最后的输出为[batch,seq_len,d_model]。和Token Embedding输出的维度一样。

1.1.3 Segment Embedding

BERT 能够处理对输入句子对的分类任务。这类任务就像判断两个文本是否是语义相似的。句子对中的两个句子被简单的拼接在一起后送入到模型中。那BERT如何去区分一个句子对中的两个句子呢？答案就是segment embeddings.

image.png

一般是不用的，只在句子对的时候采用。其编码后的维度也是[batch,seq_len,d_model]。

BERT预训练模型中关于embedding的代码如下：

class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings.
    """
    def __init__(self, config):
        super(BertEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12) # 层归一化就是对最后一个维度进行归一化
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, token_type_ids=None):
        seq_length = input_ids.size(1) # 句子的长度，input_ids的维度一般位【batch_size,seq_length】
        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids) # 将维度转化位和input_ids一样的维度
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        words_embeddings = self.word_embeddings(input_ids) # word_embedding就是直接将input_ids作为输入送入embedding
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = words_embeddings + position_embeddings + token_type_embeddings # 将三者相加作为encoder的输入
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

    class BertLayerNorm(nn.Module):
        def __init__(self, hidden_size, eps=1e-12):
            """Construct a layernorm module in the TF style (epsilon inside the square root).
            """
            super(BertLayerNorm, self).__init__()
            self.weight = nn.Parameter(torch.ones(hidden_size))
            self.bias = nn.Parameter(torch.zeros(hidden_size))
            self.variance_epsilon = eps

        def forward(self, x):
            u = x.mean(-1, keepdim=True) # layerNorm就是对最后一个维度进行变化的
            s = (x - u).pow(2).mean(-1, keepdim=True)
            x = (x - u) / torch.sqrt(s + self.variance_epsilon)
            return self.weight * x + self.bias

BERT 结构与原理（1）--Embedding