Seq2Seq 经典论文

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Sequence to Sequence Learning with Neural Networks

区别在于其source编码后的向量C直接作为Decoder阶段RNN的初始化state，而不是在每次decode时都作为RNN cell的输入。此外，decode时RNN的输入是目标值，而不是前一时刻的输出

Neural machine translation by jointly learning to align and translate

提出加性attention（score的计算方式，点乘，加法），encoder用双向

一作Dzmitry Bahdanau，在tensorflow中集成了，接口是的tf.contrib.seq2seq.BahdanauAttention

On using very large target vocabulary for neural machine translation

引入乘性attention（score的计算方式，单层神经网络，乘法），在tensorflow中也集成，接口是tf.contrib.seq2seq.LuongAttention

还是加权求context vector，区别在于score的计算，即a的计算，用一个单隐藏层的前馈网络实现。

Effective Approaches to Attention-based Neural Machine Translation

global attention和local attention

attention又分为soft attention和hard attention

soft attention分配的概率是个概率分布，而相对应的hard attention则是非0即1的对齐概率。而local attention则是soft 和 hard 的attention的一个混合方法。一般的操作是先预估一个对齐位置，再在该位置左右各为D的窗口范围内取类似soft attention的概率分布。