Note 1: Transformer

Attention Is All You Need [1]

[1]

1. Encoder-Decoder

The encoder maps an input sequence of symbol representations $(x_1,\ldots, x_n)$ to a sequence of continuous representations $z = (z_1, \ldots, z_n)$ .
Given $z$ , the decoder then generates an output sequence $(y_1, \ldots, y_m)$ of symbols one element at a time.
At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

Overview of Transformer

Encoder
- It has 6 identical layers.
- Each layer has a multi-head self-attention sub-layer and a position-wise fully connect feed-forward sub-layer in turn.
- Each sub-layer uses the residual connection mechanism, and followed by a normalization layer.
- The residual connection mechanism uses x+F(x) as its final result.
Decoder
- It has 6 identical layers.
- Each layer has three sub-layers:
  - Masked multi-head attention layer ensues the prediction at time t can only depends on the known output at positions less than t.
  - Multi-head attention layer further adds the output of the encoder stack into this decoder.
  - Position-wise fully connect feed-forward layer.

2. Attention

Mapping a query and a set of key-value pairs to a weighted output.
Scaled Dot-product Attention

Scaled Dot-product Attention
- Given three vectors, a query $Q \in R^{1 \times d_k}$ , keys $K \in R^{1 \times d_k}$ and values $V \in R^{1 \times d_v}$ :
  $Attention(Q, K, V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$
Multi-head Attention

Multi-head Attention
- Jointly collect information from different representation subspace focused on different positions.
- Given queries , keys and values :
  - $W_i^Q \in R^{d_{model} \times d_k}$ , $W_i^K \in R^{d_{model} \times d_k}$ , $W_i^V \in R^{d_{model} \times d_v}$ , ${head}_i \in R^{n \times d_v}$ , $Concat(\cdot) \in R^{n \times hd_v}$ , $W^o \in R^{hd_v \times d_{model}}$ .
  - The $Q$ and $MultiHead(Q,K,V)$ have the same dimension $R^{n \times d_{model}}$ .
  - First, it linearly projects the queries, keys and values $h$ times to learn $h$ different $Q$ , $K$ and $V$ .
  - Next, it concatenates all yield output values together.
  - At last, it projects the concatenated vector into a $d_v$ -dimension vector.
    
    Example [2]
Attention in Transformer
- Encoder's multi-head:
  - $Q$ = $K$ = $V$ =the output of previous layer.
  - Each position in the encoder can attend to all positions in the previous layer of the encoder.
- Decoder's masked multi-head:
  - $Q$ = $K$ = $V$ =the masked output of previous layer.
  - For example, if we predict the $t$ -th output token, all tokens after timestamp $t$ have to be marked.
  - This prevents leftward information flow in the decoder in order to preserve the auto-regressive property.
  - It masks out (setting to $-\infty$ ) all values in the input of the softmax which correspond to illegal connections during the scaled dot-product attention.
- Decoder's multi-head:
  - $Q$ =the output of the previous decoder layer, $K$ = $V$ =the encoder stack's output.
  - This allows every position in the decoder to attend over all positions in the input sequence.

3. Position-wise Feed-forward Networks

$FFN(x)=max(0, xw_i+b_i)w_2+b_2$

ReLu activation function: $max(0, x)$ .
It's applied to each position separately and identically.

4. Positional Encoding

To make use of the order of the sequence.
Add at the bottoms of the encoder and decoder stacks.
Have the same dimension $d_{model}$ as the embeddings.
- $pos$ is the index of position, $1 \leq pos \leq n$ in encoder while $1 \leq pos \leq m$ in decoder.
- $i$ is the index of the dimension $d_{model}$ , $1 \leq i \leq d_{model}$ .

Reference

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[2] 口仆. Transformer 原理解析 https://zhuanlan.zhihu.com/p/135873679

最后编辑于：2020.07.11 16:46:20