1. 预训练的BERT模型

使用16GB的数据从头开始训练104M（1.04亿）参数量的BERT-base模型是很费算力的，Google发布了各种配置的BERT模型，可以基于这些模型对下游任务进行微调。L是Transformer层数，H是隐藏层维度。

BERT模型配置

2. 从预训练BERT模型中提取嵌入

标记级（词级）的特征
句级的特征。通常情况可以使用全部标记的特征平均或聚合而不单纯只用[CLS]标记产生的特征

特征

2.1 安装Hugging Face的Transformers库

pip install Transformers

2.2 BERT嵌入的生成

预处理句子
调用模型获得嵌入

2.2.1 预处理句子

2.2.1.1 引入模型和词元分析器

from transformers import BertModel, BertTokenizer
# 创建模型
model = BertModel.from_pretrained('bert-base-uncased')
# 创建tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

2.2.1.2 手动处理一个句子

步骤：

分词：tokenizer.tokenize(sentence)
添加[CLS]和[SEP]标记
使用[PAD]补齐
创建注意力掩码：不注意[PAD]部分的句子
将标记转化为标记id：tokenizer.convert_tokens_to_ids(tokens) / 标记id解码成标记：tokenizer.decode(input_ids)
将标记id和注意力掩码转化为张量

sentence = 'I love China'
print('句子: {}'.format(sentence))
# 句子: I love China

tokens = tokenizer.tokenize(sentence)
print('分词: {}'.format(tokens))
# 分词: ['i', 'love', 'china']

tokens = ['[CLS]'] + tokens + ['[SEP]']
print('添加[CLS]和[SEP]标记: {}'.format(tokens))
# 添加[CLS]和[SEP]标记: ['[CLS]', 'i', 'love', 'china', '[SEP]']

tokens = tokens + ['[PAD]'] + ['[PAD]']
print('使用[PAD]补齐: {}'.format(tokens))
# 使用[PAD]补齐: ['[CLS]', 'i', 'love', 'china', '[SEP]', '[PAD]', '[PAD]']

attention_mask = [1 if i != '[PAD]' else 0 for i in tokens]
print('创建注意力掩码: {}'.format(attention_mask))
# 创建注意力掩码: [1, 1, 1, 1, 1, 0, 0]

input_ids = tokenizer.convert_tokens_to_ids(tokens)
print('将标记转化为标记id: {}'.format(input_ids))
# 将标记转化为标记id: [101, 1045, 2293, 2859, 102, 0, 0]
decode_ids = tokenizer.decode(input_ids)
print('标记id解码成标记: {}'.format(decode_ids))
# 标记id解码成标记: [CLS] i love china [SEP] [PAD] [PAD]

attention_mask = torch.tensor(attention_mask).unsqueeze(0)
input_ids = torch.tensor(input_ids).unsqueeze(0)
print('注意力掩码张量: {}'.format(attention_mask))
# 注意力掩码张量: tensor([[1, 1, 1, 1, 1, 0, 0]])
print('标记id张量: {}'.format(input_ids))
# 标记id张量: tensor([[ 101, 1045, 2293, 2859,  102,    0,    0]])

2.2.1.3 编码一个句子

tokenizer(sentence)

sentence = 'I love China'
inputs = tokenizer(sentence)
print('句子: {}'.format(sentence))
# 句子: I love China
print('input_ids: {}'.format(inputs['input_ids']))
# input_ids: [101, 1045, 2293, 2859, 102]
print('attention_mask: {}'.format(inputs['attention_mask']))
# attention_mask: [1, 1, 1, 1, 1]
print('token_type_ids: {}'.format(inputs['token_type_ids']))
# token_type_ids: [0, 0, 0, 0, 0]

2.2.1.4 编码两个句子并补齐

tokenizer([sentence_a, sentence_b], padding=True)

sentence_a = 'This is a short sentence.'
sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
print('句子a: {}'.format(sentence_a))
print('句子b: {}'.format(sentence_b))

outputs = tokenizer([sentence_a, sentence_b], padding=True)
print('input_ids: {}'.format(outputs['input_ids']))
# input_ids: [[101, 2023, 2003, 1037, 2460, 6251, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2023, 2003, 1037, 2738, 2146, 5537, 1012, 2009, 2003, 2012, 2560, 2936, 2084, 1996, 5537, 1037, 1012, 102]]
print('attention_mask: {}'.format(outputs['attention_mask']))
# attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
print('token_type_ids: {}'.format(outputs['token_type_ids']))
# token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

2.2.1.5 编码两个拼接的句子

tokenizer(sentence_a, sentence_b)

sentence_a = 'This is a short sentence.'
sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
print('句子a: {}'.format(sentence_a))
print('句子b: {}'.format(sentence_b))

inputs = tokenizer(sentence_a, sentence_b)
print('input_ids: {}'.format(inputs['input_ids']))
# input_ids: [101, 2023, 2003, 1037, 2460, 6251, 1012, 102, 2023, 2003, 1037, 2738, 2146, 5537, 1012, 2009, 2003, 2012, 2560, 2936, 2084, 1996, 5537, 1037, 1012, 102]
print('attention_mask: {}'.format(inputs['attention_mask']))
# attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print('token_type_ids: {}'.format(inputs['token_type_ids']))
# token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

2.2.2 调用模型获得嵌入

调用模型：

model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

输出：

outputs['pooler_output']：句级的嵌入，NSP任务使用这个预测。[CLS]标记经过tanh激活的前馈神经网络获得
outputs['last_hidden_state']：词级的嵌入，MLM任务使用这个预测

sentence_a = 'This is a short sentence.'
sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
inputs = tokenizer(sentence_a, sentence_b)
input_ids = torch.tensor(inputs['input_ids']).unsqueeze(0)
attention_mask = torch.tensor(inputs['attention_mask']).unsqueeze(0)
token_type_ids = torch.tensor(inputs['token_type_ids']).unsqueeze(0)

outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
pooler_output = outputs['pooler_output'] 
last_hidden_state = outputs['last_hidden_state']  
print('pooler_output shape: {}'.format(pooler_output.shape))  # [batch_size, hidden_size]
# pooler_output shape: torch.Size([1, 768])
print('last_hidden_state shape: {}'.format(last_hidden_state.shape))  # [batch_size, sequence_length, hidden_size]
# last_hidden_state shape: torch.Size([1, 26, 768])

3. 从BERT的所有编码器层中提取嵌入

原因：单纯使用一个隐藏层的特征处理下游任务并非是结果最好的。BERT的研究人员在命名实体任务中进行了实验，比较了实用不同层特征的F1分数。当串联最后4个隐藏层时，取得了最好的F1分数。

不同层的嵌入的F1分数

3.1 提取方式

BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)，设置output_hidden_states=True获得每一层的特征

输出：

outputs['hidden_states']：获得13 * [batch_size, sequence_length, hidden_size] 的张量
hidden_states[0]：输入嵌入层的输出
hidden_states[-1]：最后一个编码器层的输出

model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

sentence_a = 'This is a short sentence.'
sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
print('句子a: {}'.format(sentence_a))
print('句子b: {}'.format(sentence_b))
outputs = tokenizer(sentence_a, sentence_b)
input_ids = torch.tensor(outputs['input_ids']).unsqueeze(0)
attention_mask = torch.tensor(outputs['attention_mask']).unsqueeze(0)
token_type_ids = torch.tensor(outputs['token_type_ids']).unsqueeze(0)

outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
pooler_output = outputs['pooler_output']
last_hidden_state = outputs['last_hidden_state']
hidden_states = outputs['hidden_states']
print('pooler_output shape: {}'.format(pooler_output.shape))  # [batch_size, hidden_size]
# pooler_output shape: torch.Size([1, 768])
print('last_hidden_state shape: {}'.format(last_hidden_state.shape))  # [batch_size, sequence_length, hidden_size]
# last_hidden_state shape: torch.Size([1, 26, 768])
print('hidden_states length: {}'.format(len(hidden_states)))  # 13
# hidden_states length: 13
print('hidden_states[0] shape: {}'.format(hidden_states[0].shape))  # [batch_size, sequence_length, hidden_size]
# hidden_states[0] shape: torch.Size([1, 26, 768])

参考资料

[1]. BERT基础教程Transformer大模型实战
[2]. 如何计算Bert模型的参数量：https://blog.csdn.net/weixin_44402973/article/details/126405946
[3]. Pytorch Transformer Tokenizer常见输入输出实战详解：https://blog.csdn.net/yosemite1998/article/details/122306758

BERT实战（上）