【唐宇迪】深度学习论文精讲系列-BERT模型
B站链接:https://www.bilibili.com/video/BV1vg4y1i7zX?p=2
论文:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
课时1 论文讲解思路概述
Bidirectional(双向) Encoder Representations from Transformers
课时2 Bert模型摘要概述
0 Abstract
BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.>>模型有文本数据即可,不需要打标签;类似根据上下文进行完形填空;根据预训练模型可直接拿来用,需要添加输出层
BERT is conceptually simple and empirically powerful.
课时3 模型在NLP领域应用效果
1 Introduction
The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer.
课时4 预训练模型的作用
2 Related Work
3 Bert
We introduce BERT and its detailed implementation in this section. There are two steps in our framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the down stream tasks.>>预训练参数可以直接拿来用,fine-tuned 时需要的是有标签的数据
课时5 输入数据特殊编码字符解析
A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.>>在各种应用中模型架构是不变的,预训练与下游任务中只有细微差别
we denote the number of layers(i.e., Transformer blocks) as L(层数), the hidden size as H(维数), and the number of self-attention heads as A(头数).3We primarily report results on two model sizes: (L=12, H=768, A=12, Total Parameters=110M) and (L=24, H=1024,A=16, Total Parameters=340M).>>BASE生成的768维向量,是由12个小模块生成的,最后拼接而成;使用时基础模型已经足够了
To make BERT handle a variety of down-stream tasks(下游任务), our input representation is able to unambiguously represent both a single sentence and a pair of sentences(e.g., Question, Answer 输入应包括问题与文章) in one token sequence.
课时6 向量特征编码方法
For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.>>不同句子,不同位置<一个词在不同的位置,意义可能不一样>都可进行编码(第3/4行),不同行业的信息也可以进行编码;编码是进行相加的
课时7 BERT模型训练策略
we pre-train BERT using two unsupervised tasks(两种无监督方法), described in this section. This step is presented in the left part of Figure 1.
Task #1: Masked LM (带Mask的语言模型)Intuitively, it is reason-able to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model.>>可以理解为之前所说的根据上下文进行完形填空
In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), >>随机mask掉15%;也有可能遇到mismatch的问题,mask的词需要进行特殊字符的替换,对于这种问题进行了微调
Task #2: Next Sentence Prediction (NSP)(预测两句话是否是一句话)Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences,
Specifically, when choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A (labeled as Is Next),and 50% of the time it is a random sentence from the corpus (labeled as Not Next). >>A与B有50%可能性可以连接在一起,利用CLS进行二分类,从而进行判断
课时8 论文总结分析(Fine-tuning BERT)
For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.>>直接输入到输出
At the output, the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.>>如果进行命名实体识别,判断是地名/人名等,对每个词做判断,所以对每个词做编码然后解码;所以在不同应用中,需要在输出层选择不同的特征表达作为输出结果
All of the results in the pa-per can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.>>训练时间大概几个小时