Note 1: Transformer

Attention Is All You Need [1]

[1]

1. Encoder-Decoder

  • The encoder maps an input sequence of symbol representations (x_1,\ldots, x_n) to a sequence of continuous representations z = (z_1, \ldots, z_n).
  • Given z, the decoder then generates an output sequence (y_1, \ldots, y_m) of symbols one element at a time.
  • At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.
Overview of Transformer
  • Encoder
    • It has 6 identical layers.
    • Each layer has a multi-head self-attention sub-layer and a position-wise fully connect feed-forward sub-layer in turn.
    • Each sub-layer uses the residual connection mechanism, and followed by a normalization layer.
    • The residual connection mechanism uses x+F(x) as its final result.
  • Decoder
    • It has 6 identical layers.
    • Each layer has three sub-layers:
      • Masked multi-head attention layer ensues the prediction at time t can only depends on the known output at positions less than t.
      • Multi-head attention layer further adds the output of the encoder stack into this decoder.
      • Position-wise fully connect feed-forward layer.

2. Attention

  • Mapping a query and a set of key-value pairs to a weighted output.
  • Scaled Dot-product Attention
    Scaled Dot-product Attention
    • Given three vectors, a query Q \in R^{1 \times d_k}, keys K \in R^{1 \times d_k} and values V \in R^{1 \times d_v}:
      Attention(Q, K, V)=softmax(\frac{QK^T}{\sqrt{d_k}})V
  • Multi-head Attention
    Multi-head Attention
    • Jointly collect information from different representation subspace focused on different positions.
    • Given n queries Q \in R^{n \times d_{model}}, keys K and values V:
      MultiHead(Q, K, V)=Concat({head}_1, \cdots, {head}_n)W^o
      where \ {head}_i=Attention(QW_i^Q, KW_i^K, VW_i^V)
      • W_i^Q \in R^{d_{model} \times d_k}, W_i^K \in R^{d_{model} \times d_k}, W_i^V \in R^{d_{model} \times d_v}, {head}_i \in R^{n \times d_v}, Concat(\cdot) \in R^{n \times hd_v}, W^o \in R^{hd_v \times d_{model}}.
      • The Q and MultiHead(Q,K,V) have the same dimension R^{n \times d_{model}}.
      • First, it linearly projects the queries, keys and values h times to learn h different Q, K and V.
      • Next, it concatenates all yield output values together.
      • At last, it projects the concatenated vector into a d_v -dimension vector.
        Example [2]
  • Attention in Transformer
    • Encoder's multi-head:
      • Q=K=V=the output of previous layer.
      • Each position in the encoder can attend to all positions in the previous layer of the encoder.
    • Decoder's masked multi-head:
      • Q=K=V=the masked output of previous layer.
      • For example, if we predict the t-th output token, all tokens after timestamp t have to be marked.
      • This prevents leftward information flow in the decoder in order to preserve the auto-regressive property.
      • It masks out (setting to -\infty) all values in the input of the softmax which correspond to illegal connections during the scaled dot-product attention.
    • Decoder's multi-head:
      • Q=the output of the previous decoder layer, K=V=the encoder stack's output.
      • This allows every position in the decoder to attend over all positions in the input sequence.

3. Position-wise Feed-forward Networks

FFN(x)=max(0, xw_i+b_i)w_2+b_2

  • ReLu activation function: max(0, x).
  • It's applied to each position separately and identically.

4. Positional Encoding

  • To make use of the order of the sequence.
  • Add at the bottoms of the encoder and decoder stacks.
  • Have the same dimension d_{model} as the embeddings.
  • PE(pos, 2i)=sin(pos/10000^{2i / d_{model}})
    PE(pos, 2i+1)=cos(pos/10000^{2i / d_{model}})
    • pos is the index of position, 1 \leq pos \leq n in encoder while 1 \leq pos \leq m in decoder.
    • i is the index of the dimension d_{model}, 1 \leq i \leq d_{model}.

Reference

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[2] 口仆. Transformer 原理解析 https://zhuanlan.zhihu.com/p/135873679

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 220,002评论 6 509
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,777评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 166,341评论 0 357
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,085评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,110评论 6 395
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,868评论 1 308
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,528评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,422评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,938评论 1 319
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,067评论 3 340
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,199评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,877评论 5 347
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,540评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,079评论 0 23
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,192评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,514评论 3 375
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,190评论 2 357