上半部分介绍了如何从BERT模型提取嵌入,下半部分介绍如何针对下游任务进行微调,分为四个任务。
下游任务:
微调方式:
- 分类器层与BERT模型一起更新权重(通常情况且效果更好)
- 仅更新分类器层的权重而不更新BERT模型的权重。BERT模型仅作为特征提取器
1 问答任务
1.1 任务说明
问答任务分为两种:
- 抽取式:从给定的上下文中抽取回答。
- 摘要式:从给定的上下文中生成对于问题正确的回答。
这里介绍抽取式:输入是一个问题和一个含有答案的段落,然后段落中提取答案。本质上讲是返回包含答案的文本段。模型实际上预测的是答案在段落中的起始位置和结束位置的索引。
步骤:
- 引入起始向量S和结束向量E
- 将输入的问题和含有答案的段落输入BERT获得每个标记的特征
- 将段落标记的特征向量与S和E分别点积,然后段落标记之间进行softmax,获得每个标记作为起始位置或结束位置的概率:
1.2 代码
1.2.1 QA任务代码
- huggingface官网QA任务指南:https://huggingface.co/docs/transformers/tasks/question_answering
- squad数据集:https://huggingface.co/datasets/squad
- tokenizer参数说明:Huggingface Transformers各类库介绍(Tokenizer、Pipeline): https://blog.csdn.net/weixin_42475060/article/details/128105633
- huggingface hub使用说明:https://www.jianshu.com/p/5337a01f1cae?v=1683977910521
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, Trainer, TrainingArguments
from transformers import DefaultDataCollator
from nlp import load_dataset
from huggingface_hub import notebook_login
from transformers import pipeline
notebook_login()
def train():
# squad: https://huggingface.co/datasets/squad
squad = load_dataset("squad", split="train[:5000]")
print("squad: {}".format(squad))
# train_test_split=0.2
dataset = squad.train_test_split(test_size=0.2)
train_set = dataset["train"]
test_set = dataset["test"]
print("train_set[0]: {}".format(train_set[0]))
print("test_set[0]: {}".format(test_set[0]))
print("train_set: {}".format(train_set))
print("test_set: {}".format(test_set))
# 加载模型: distilbert
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
# 词元分析器
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# 预测的是: answer在question和context拼接, tokenize成token后的起始token索引和结束token索引
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
# Huggingface Transformers各类库介绍(Tokenizer、Pipeline): https://blog.csdn.net/weixin_42475060/article/details/128105633
inputs = tokenizer(
questions,
examples["context"],
max_length=384, # 限制最大长度
truncation="only_second", # 截第二个输入, 即context
return_offsets_mapping=True, # 返回每个token在输入字符串中的偏移量
padding="max_length", # 按最大长度补齐
)
# 从inputs中提前移除了offset_mapping, 因为不需要这个输入
offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0] # 答案在context字符串中的起始索引
end_char = answer["answer_start"][0] + len(answer["text"][0]) # 答案在context字符串中的结束索引
sequence_ids = inputs.sequence_ids(i)
# Find the start and end of the context
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# If the answer is not fully inside the context, label it (0, 0)
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# Otherwise it's the start and end token positions
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
# 把没有用到的原始输入抛弃: remove_columns=squad.column_names
train_set = train_set.map(preprocess_function, batched=True, remove_columns=squad.column_names)
test_set = test_set.map(preprocess_function, batched=True, remove_columns=squad.column_names)
# 使用默认的DefaultDataCollator, 将输入数据转化为pytorch的tensor
data_collator = DefaultDataCollator()
training_args = TrainingArguments(
output_dir="~/Documents/huggingface_local_hub/llm/task_qa_distilbert",
hub_model_id="smile367/task_qa_distilbert",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
push_to_hub=True,
)
# qa任务使用交叉熵损失训练
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_set,
eval_dataset=test_set,
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
# https://git-lfs.com/
trainer.push_to_hub()
print("ok")
def inference():
test_data = load_dataset("squad", split="validation[:1]")
print("question: {}".format(test_data["question"]))
print("context: {}".format(test_data["context"]))
print("answers: {}".format(test_data["answers"]))
question_answerer = pipeline("question-answering", model="smile367/task_qa_distilbert")
result = question_answerer(question=test_data["question"], context=test_data["context"])
print("result: {}".format(result))
print("ok")
1.2.2 squad数据集预处理过程解析
- 可以单步调试观察每一步的处理过程
from transformers import AutoTokenizer
from nlp import load_dataset
if __name__ == '__main__':
dataset = load_dataset("squad", split="train[:2]")
question = dataset["question"]
context = dataset["context"]
answers = dataset["answers"]
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
inputs = tokenizer(question, context, max_length=384, truncation="only_second", return_offsets_mapping=True, padding="max_length")
start_positions = []
end_positions = []
offset_mapping = inputs.pop("offset_mapping")
for i, offset in enumerate(offset_mapping):
print()
print("question: {}".format(question[i]))
print("context: {}".format(context[i]))
print("answers: {}".format(answers[i]["text"]))
print("inputs.tokens: {}".format(inputs.tokens(i)))
print("inputs.sequence_ids: {}".format(inputs.sequence_ids(i)))
print("offset: {}".format(offset))
# 答案在context字符串中的起始索引和结束索引(注意不是tokens索引)
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
print("start_char: {}, end_char: {}, context[start_char: end_char]: {}".format(start_char, end_char, context[i][start_char: end_char]))
# 找到context的在tokens中起始索引和结束索引
sequence_ids = inputs.sequence_ids(i)
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
print("context_start: {}, context_end: {}, tokens[context_start: context_end]: {}".format(context_start, context_end, inputs.tokens(i)[context_start: context_end]))
# If the answer is not fully inside the context, label it (0, 0)
# 答案在context字符串的索引不在context的索引之内
print("offset[context_start][0]: {}, offset[context_end][1]: {}".format(offset[context_start][0], offset[context_end][1]))
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# Otherwise it's the start and end token positions
idx = context_start # 从context起始token的索引向结束token的索引扫描, 找到start_char的token的索引
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end # 从context结束token的索引向开始token的索引扫描, 找到end_char的token的索引
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
print("start_position: {}, end_position: {}, inputs.tokens[start_position: end_position + 1]: {}".format(start_positions[i], end_positions[i], inputs.tokens(i)[start_positions[i]: end_positions[i] + 1]))
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
print("inputs: {}".format(inputs))
print("ok")
参考资料
[1]. BERT基础教程Transformer大模型实战
[2]. huggingface官网QA任务指南:https://huggingface.co/docs/transformers/tasks/question_answering
[3]. squad数据集文档: https://huggingface.co/datasets/squad
[4]. huggingface hub使用说明:https://www.jianshu.com/p/5337a01f1cae?v=1683977910521
[5]. Huggingface Transformers各类库介绍(Tokenizer、Pipeline):https://blog.csdn.net/weixin_42475060/article/details/128105633