BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)
- BERT (Bidirectional Encoder Representations from Transformers) is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
2. Two-steps Framework
- Pre-training: The model is trained on unlabeled data over different pre-training tasks.
- Fine-tuning: The initialized BERT model is fine-tuned using labeled data from the downstream tasks, while each downstream task has its own tuned model.
3. Input/Output Representations
- Although the down-stream tasks are different, the input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question, Answer) in one token sequence.
- [CLS]: It is always the first token of every sequence. Especially in classification tasks, it can be used as the aggregate representation of a sequence.
- [SEP]: It can separate pairs of sentences. Furthermore, we can add a learned segment embedding to each token indicating which sentence it belongs to.
- For a given token, its input representation is constructed by summing the corresponding token, segment and position embeddings.
4. Pre-training
-
Masked LM: Mask some percentage of input tokens at random and then predict these masked tokens.
- Mask 15% of all tokens in each sequence at random.
- However, it induces a mismatching problem between pre-training and fine-tuning since the fine-tuning stage dose not have the [MASK] token.
- To mitigate this problem, if the -th token is chosen, BERT replaces it with:
- the [MASK] token 80% of the time
- a random token 10% of the time
- the unchanged -th token 10% of the time
- Merits: As the model dose not know whether the input token has been replaced, it force the model to keep a distributional contextual representation of every input token.
-
Next sentence prediction (NSP) is a binarized task which can train a model to understand sentence relationships.
- The training samples can be generated from any monolingual corpus.
- Label IsNext: 50% of samples are actual sentence A followed by sentence B.
- Label NotNext: 50% of samples are randomly selected from corpus.
- The special symbol [CLS]'s output is used for NSP classification, as shown in Fig. 2.
- Pre-training data is a document-level corpus rather than a shuffled sentence-level corpus.
5. Fine-tuning BERT
- BERT encodes a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.
- Input: above mentioned sentence A and sentence B are analogous to
- sentence pairs in paraphrasing,
- hypothesis-premise pairs in entailment,
- question-passage pairs in question answering,
- a degenerate text-∅ pair in text classification or sequence tagging.
- Output:
- the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering.
- the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.
6. BERT vs. GPT vs. ELMo
- BERT uses a bidirectional Transformer. OpenAI GPT (Radford et al., 2018) uses a left-to-right Transformer. ELMo (Peters et al., 2018) uses the concatenation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks.
-
BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.
Reference
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.