Attention Is All You Need [1]
[1]
1. Encoder-Decoder
- The encoder maps an input sequence of symbol representations
to a sequence of continuous representations
.
- Given
, the decoder then generates an output sequence
of symbols one element at a time.
- At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.
Overview of Transformer
-
Encoder
- It has 6 identical layers.
- Each layer has a multi-head self-attention sub-layer and a position-wise fully connect feed-forward sub-layer in turn.
- Each sub-layer uses the residual connection mechanism, and followed by a normalization layer.
- The residual connection mechanism uses
x+F(x)
as its final result.
-
Decoder
- It has 6 identical layers.
- Each layer has three sub-layers:
- Masked multi-head attention layer ensues the prediction at time
t
can only depends on the known output at positions less thant
. - Multi-head attention layer further adds the output of the encoder stack into this decoder.
- Position-wise fully connect feed-forward layer.
- Masked multi-head attention layer ensues the prediction at time
2. Attention
- Mapping a query and a set of key-value pairs to a weighted output.
-
Scaled Dot-product Attention
Scaled Dot-product Attention- Given three vectors, a query
, keys
and values
:
- Given three vectors, a query
-
Multi-head Attention
Multi-head Attention- Jointly collect information from different representation subspace focused on different positions.
- Given
queries
, keys
and values
:
-
,
,
,
,
,
.
- The
and
have the same dimension
.
- First, it linearly projects the queries, keys and values
times to learn
different
,
and
.
- Next, it concatenates all yield output values together.
- At last, it projects the concatenated vector into a
-dimension vector.
Example [2]
-
-
Attention in Transformer
- Encoder's multi-head:
-
=
=
=the output of previous layer.
- Each position in the encoder can attend to all positions in the previous layer of the encoder.
-
- Decoder's masked multi-head:
-
=
=
=the masked output of previous layer.
- For example, if we predict the
-th output token, all tokens after timestamp
have to be marked.
- This prevents leftward information flow in the decoder in order to preserve the auto-regressive property.
- It masks out (setting to
) all values in the input of the softmax which correspond to illegal connections during the scaled dot-product attention.
-
- Decoder's multi-head:
-
=the output of the previous decoder layer,
=
=the encoder stack's output.
- This allows every position in the decoder to attend over all positions in the input sequence.
-
- Encoder's multi-head:
3. Position-wise Feed-forward Networks
- ReLu activation function:
.
- It's applied to each position separately and identically.
4. Positional Encoding
- To make use of the order of the sequence.
- Add at the bottoms of the encoder and decoder stacks.
- Have the same dimension
as the embeddings.
-
-
is the index of position,
in encoder while
in decoder.
-
is the index of the dimension
,
.
-
Reference
[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[2] 口仆. Transformer 原理解析 https://zhuanlan.zhihu.com/p/135873679