写在前面
这一篇文章主要是介绍 transformer
模型
论文参考:
Attention is All You Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
知识点参考:
Attention原理和源码解析
Transformer详解
语言模型和迁移学习
Google BERT.
项目参考:
Transformer in Pytorch
RNN + Attention
Recall
Another formation of Attention
Attention = A(Q, K, V)=softmax(sim(Q, K)) • V
Advantages & Disadvantages
Advantages
- Takes positional information into account
Disadvantages
- Parallel Computation
- Only decoder - encoder attention, has no concern on encoder itself and decoder itself
Transformer
Attention
In transformer model, we represent attention in this way:
Scaled Dot-Product Attention
When |K| becomes large,the dot products grow large in magnitude, softmax(qKt) may be similar to 0 or 1
Multi-head Attention
It is beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively
Self Attention
Besides decoder-encoder attention, we can also discover self-attention in encoder itself or decoder itself.
Encoder self-attention: Q=K=V = output of previous layer
Decoder self-attention: Q=K=V = output of previous layer, mask out all the right-part attention, only permit current state's paying attention to the previous state, not future state
Encoder
Embedding-Layer: token embedding & positional embedding(later)
SubLayer_1: Multi-Head Attention: encoder self-attention
SubLayer_2: FeedForward Networks: a simple, position-wise fully connected feed forward network
Decoder
Embedding-Layer: token embedding & positional embedding
SubLayer_1: Masked Multi-Head Attention: decoder masked self-attention
SubLayer_2: Multi-Head Attention:
Q: The previous decoder layer
K, V: Output of the encoderSubLayer_2: FeedForward Networks: a simple, position-wise fully connected feed forward network
Linear & Softmax: Softmax to classify
input of each sub_layer: x
output of each sub_layer: LayerNorm(x + SubLayer(x))
Positional-Encoding
Because of no recurrence and no convolution, we should use positional-encoding to make use of the order of sequence
d_model: the same size of token embedding
pos: current position of token sequence
i: dimension
That is, each dimension of the positional encoding corresponds to a sinusoid
Experiments
DataSet
WMT'16 Multimodal Translation: Multi30k (de-en)
PreProcess
Train
- Elapse per epoch (on NVIDIA Titan X)
Training set: 0.888 minutes
Validation set: 0.011 minutes
Evaluate
BERT(Bidirectional Encoder Representations from Transformers)
Pre-train Model
- ELMo: Shallow Bi-directional, like a traditional Language Model
- OpenAI GPT: left-to-right, like a decoder
- BERT: Deep Bi-directional, like an encoder
Input Embedding
- Token Embeddings是词向量,第一个单词是CLS标志,可以用于之后的分类任务(CLS for classification)
- Segment Embeddings用来区别两种句子,因为预训练不光做LM还要做以两个句子为输入的分类任务
- Position Embeddings和之前文章中的Transformer不一样,不是三角函数而是学习出来的