上半部分介绍如何从BERT模型提取嵌入,下半部分介绍如何针对下游任务进行微调。
1. 预训练的BERT模型
使用16GB的数据从头开始训练104M(1.04亿)参数量的BERT-base模型是很费算力的,Google发布了各种配置的BERT模型,可以基于这些模型对下游任务进行微调。L是Transformer层数,H是隐藏层维度。
BERT模型配置
2. 从预训练BERT模型中提取嵌入
- 标记级(词级)的特征
- 句级的特征。通常情况可以使用全部标记的特征平均或聚合而不单纯只用[CLS]标记产生的特征
特征
2.1 安装Hugging Face的Transformers库
pip install Transformers
2.2 BERT嵌入的生成
- 预处理句子
- 调用模型获得嵌入
2.2.1 预处理句子
2.2.1.1 引入模型和词元分析器
from transformers import BertModel, BertTokenizer
# 创建模型
model = BertModel.from_pretrained('bert-base-uncased')
# 创建tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
2.2.1.2 手动处理一个句子
步骤:
- 分词:
tokenizer.tokenize(sentence)
- 添加[CLS]和[SEP]标记
- 使用[PAD]补齐
- 创建注意力掩码:不注意[PAD]部分的句子
- 将标记转化为标记id:
tokenizer.convert_tokens_to_ids(tokens)
/ 标记id解码成标记:tokenizer.decode(input_ids)
- 将标记id和注意力掩码转化为张量
sentence = 'I love China'
print('句子: {}'.format(sentence))
# 句子: I love China
tokens = tokenizer.tokenize(sentence)
print('分词: {}'.format(tokens))
# 分词: ['i', 'love', 'china']
tokens = ['[CLS]'] + tokens + ['[SEP]']
print('添加[CLS]和[SEP]标记: {}'.format(tokens))
# 添加[CLS]和[SEP]标记: ['[CLS]', 'i', 'love', 'china', '[SEP]']
tokens = tokens + ['[PAD]'] + ['[PAD]']
print('使用[PAD]补齐: {}'.format(tokens))
# 使用[PAD]补齐: ['[CLS]', 'i', 'love', 'china', '[SEP]', '[PAD]', '[PAD]']
attention_mask = [1 if i != '[PAD]' else 0 for i in tokens]
print('创建注意力掩码: {}'.format(attention_mask))
# 创建注意力掩码: [1, 1, 1, 1, 1, 0, 0]
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print('将标记转化为标记id: {}'.format(input_ids))
# 将标记转化为标记id: [101, 1045, 2293, 2859, 102, 0, 0]
decode_ids = tokenizer.decode(input_ids)
print('标记id解码成标记: {}'.format(decode_ids))
# 标记id解码成标记: [CLS] i love china [SEP] [PAD] [PAD]
attention_mask = torch.tensor(attention_mask).unsqueeze(0)
input_ids = torch.tensor(input_ids).unsqueeze(0)
print('注意力掩码张量: {}'.format(attention_mask))
# 注意力掩码张量: tensor([[1, 1, 1, 1, 1, 0, 0]])
print('标记id张量: {}'.format(input_ids))
# 标记id张量: tensor([[ 101, 1045, 2293, 2859, 102, 0, 0]])
2.2.1.3 编码一个句子
tokenizer(sentence)
sentence = 'I love China'
inputs = tokenizer(sentence)
print('句子: {}'.format(sentence))
# 句子: I love China
print('input_ids: {}'.format(inputs['input_ids']))
# input_ids: [101, 1045, 2293, 2859, 102]
print('attention_mask: {}'.format(inputs['attention_mask']))
# attention_mask: [1, 1, 1, 1, 1]
print('token_type_ids: {}'.format(inputs['token_type_ids']))
# token_type_ids: [0, 0, 0, 0, 0]
2.2.1.4 编码两个句子并补齐
tokenizer([sentence_a, sentence_b], padding=True)
sentence_a = 'This is a short sentence.'
sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
print('句子a: {}'.format(sentence_a))
print('句子b: {}'.format(sentence_b))
outputs = tokenizer([sentence_a, sentence_b], padding=True)
print('input_ids: {}'.format(outputs['input_ids']))
# input_ids: [[101, 2023, 2003, 1037, 2460, 6251, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2023, 2003, 1037, 2738, 2146, 5537, 1012, 2009, 2003, 2012, 2560, 2936, 2084, 1996, 5537, 1037, 1012, 102]]
print('attention_mask: {}'.format(outputs['attention_mask']))
# attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
print('token_type_ids: {}'.format(outputs['token_type_ids']))
# token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
2.2.1.5 编码两个拼接的句子
tokenizer(sentence_a, sentence_b)
sentence_a = 'This is a short sentence.'
sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
print('句子a: {}'.format(sentence_a))
print('句子b: {}'.format(sentence_b))
inputs = tokenizer(sentence_a, sentence_b)
print('input_ids: {}'.format(inputs['input_ids']))
# input_ids: [101, 2023, 2003, 1037, 2460, 6251, 1012, 102, 2023, 2003, 1037, 2738, 2146, 5537, 1012, 2009, 2003, 2012, 2560, 2936, 2084, 1996, 5537, 1037, 1012, 102]
print('attention_mask: {}'.format(inputs['attention_mask']))
# attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print('token_type_ids: {}'.format(inputs['token_type_ids']))
# token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
2.2.2 调用模型获得嵌入
调用模型:
model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
输出:
-
outputs['pooler_output']
:句级的嵌入,NSP任务使用这个预测。[CLS]标记经过tanh激活的前馈神经网络获得 -
outputs['last_hidden_state']
:词级的嵌入,MLM任务使用这个预测
sentence_a = 'This is a short sentence.'
sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
inputs = tokenizer(sentence_a, sentence_b)
input_ids = torch.tensor(inputs['input_ids']).unsqueeze(0)
attention_mask = torch.tensor(inputs['attention_mask']).unsqueeze(0)
token_type_ids = torch.tensor(inputs['token_type_ids']).unsqueeze(0)
outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
pooler_output = outputs['pooler_output']
last_hidden_state = outputs['last_hidden_state']
print('pooler_output shape: {}'.format(pooler_output.shape)) # [batch_size, hidden_size]
# pooler_output shape: torch.Size([1, 768])
print('last_hidden_state shape: {}'.format(last_hidden_state.shape)) # [batch_size, sequence_length, hidden_size]
# last_hidden_state shape: torch.Size([1, 26, 768])
3. 从BERT的所有编码器层中提取嵌入
原因:单纯使用一个隐藏层的特征处理下游任务并非是结果最好的。BERT的研究人员在命名实体任务中进行了实验,比较了实用不同层特征的F1分数。当串联最后4个隐藏层时,取得了最好的F1分数。
不同层的嵌入的F1分数
3.1 提取方式
-
BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
,设置output_hidden_states=True
获得每一层的特征
输出:
-
outputs['hidden_states']
:获得13 * [batch_size, sequence_length, hidden_size] 的张量 -
hidden_states[0]
:输入嵌入层的输出 -
hidden_states[-1]
:最后一个编码器层的输出
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence_a = 'This is a short sentence.'
sentence_b = 'This is a rather long sequence. It is at least longer than the sequence A.'
print('句子a: {}'.format(sentence_a))
print('句子b: {}'.format(sentence_b))
outputs = tokenizer(sentence_a, sentence_b)
input_ids = torch.tensor(outputs['input_ids']).unsqueeze(0)
attention_mask = torch.tensor(outputs['attention_mask']).unsqueeze(0)
token_type_ids = torch.tensor(outputs['token_type_ids']).unsqueeze(0)
outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
pooler_output = outputs['pooler_output']
last_hidden_state = outputs['last_hidden_state']
hidden_states = outputs['hidden_states']
print('pooler_output shape: {}'.format(pooler_output.shape)) # [batch_size, hidden_size]
# pooler_output shape: torch.Size([1, 768])
print('last_hidden_state shape: {}'.format(last_hidden_state.shape)) # [batch_size, sequence_length, hidden_size]
# last_hidden_state shape: torch.Size([1, 26, 768])
print('hidden_states length: {}'.format(len(hidden_states))) # 13
# hidden_states length: 13
print('hidden_states[0] shape: {}'.format(hidden_states[0].shape)) # [batch_size, sequence_length, hidden_size]
# hidden_states[0] shape: torch.Size([1, 26, 768])
参考资料
[1]. BERT基础教程Transformer大模型实战
[2]. 如何计算Bert模型的参数量:https://blog.csdn.net/weixin_44402973/article/details/126405946
[3]. Pytorch Transformer Tokenizer常见输入输出实战详解:https://blog.csdn.net/yosemite1998/article/details/122306758