Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks：使用神经网络来做序列到序列的学习

Abstract 本文提出了一种通用的端对端的方法进行序列到序列的学习，其中的 Encoder和Deocder都是多层的LSTM。我们的模型在机器翻译上取得了非常好的效果。

The Model Introduction 为了处理变长的输入和变长的输出，我们使用了LSTM来作为Encoder和Deocder，并且得到了很好的结果。

Experiment 我们使用两个不同的LSTM来作为 Encoder和Deocder，其中Encoder将源语言编码成定长的向量,Deocder生成一个个目标语言的词。

Related Work 论文的相关工作。

Conclusion 对全文进行总结并对未来进行展望

一、评价指标

1.人工评价：通过人主观对翻译进行打分

优点：准确

缺点：速度慢，价格昂贵

2.机器自动评价：通过设置指标对翻译结果自动评价

优点：较为准确，速度快，免费

缺点：可能和人工评价有一些出入

BLEU评价指标

只使用1-gram的问题：对每个词进行翻译就能得到很高的分，完全没考虑到句子的流利性。

解决方法：使用多-gram融合，BLEU使用1-4gram。

任务场景：输入一个序列，输出一个序列。

基本思想：使用一个Encoder将输入序列编码成定长的向量，Decoder使用这个向量产生输出。

二、论文详解

Abstract

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difﬁcult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences.介绍做法：In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a ﬁxed dimensionality, and then another deep LSTM to decode the target sequence from the vector.结果：Our main result is that on an English to French translation task from the WMT’14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difﬁculty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly,because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.（倒序输入缩小依赖，使得优化更简单）

1. DNN在很多任务上取得了非常好的结果，但是它并不能解决Seq2Seq模型。

2. 我们使用多层LSTM作为Encoder和Decoder，并且在WMT14英语到法语上取得了34.8的BLEU

的结果。

3. 此外，LSTM在长度上表现也很好，我们使用深度NMT模型来对统计机器翻译的结果进行重排序，

能够使结果BLEU从33.3提高到36.5。

4. LSTM能够很好地学习到局部和全局的特征，最后我们发现对源句子倒序输入能够大大提高翻译

的效果，因为这样可以缩短一些词从源语言到目标语言的依赖长度。

1 Introduction

Deep Neural Networks (DNNs) are extremely powerful machine learning models（DNN是非常强大的机器学习模型）that achieve excellent performanceon difﬁcult problems such as speech recognition [13, 7] and visual object recognition（在不同任务中都有很好的表现） [19, 6, 21, 20]. DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps. A surprising example of the power of DNNs is their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size [27]. So, while neural networks are related to conventional statistical models, they learn an intricate computation. Furthermore, large DNNs can be trained with supervised backpropagationwhenever the labeled training set has enough information to specify the network’s parameters. Thus, if there exists a parameter setting of a large DNN that achieves good results (for example, because humans can solve the task very rapidly), supervised backpropagation will ﬁnd these parameters and solve the problem.

Despite their ﬂexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of ﬁxed dimensionality（尽管很强大，但是仅能处理定长的问题）.It is a signiﬁcant limitation, since many important problems are best expressed with sequences whose lengths are not known a-priori.（这个限制会很大，因为序列问题都是不定长的问题） For example, speech recognition and machine translation are sequential problems. Likewise, question answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer. It is therefore clear that a domain-independent method that learns to map sequences to sequences would be useful.

Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and outputs is known and ﬁxed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large ﬁxed-dimensional vector representation,（通过使用LSTM，每一个时间点作为输入，去填充一个很大的向量） and then to use another LSTM to extract the output sequence from that vector (ﬁg. 1). The second LSTM is essentially a recurrentneural network language model [28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs (ﬁg. 1).

relate work：There have been a number of related attempts to address the general sequence to sequence learning problem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18] who were the ﬁrst to map the entire input sentence to vector, and is related to Cho et al. [5] although the latter was used only for rescoring hypotheses produced by a phrase-based system. Graves [10] introduced a novel differentiable attention mechanism that allows neural networks to focus on different parts of their input, and an elegant variant of this idea was successfully applied to machine translation by Bahdanau et al. [2]. The Connectionist Sequence Classiﬁcation is another popular technique for mapping sequences to sequences with neural networks, but it assumes a monotonic alignment between the inputs and the outputs [11].

The main result of this work is the following. On the WMT’14 English to French translation task, we obtained aBLEU score of 34.81by directly extracting translations from an ensemble of 5 deep LSTMs (with 384M parameters and 8,000 dimensionalstate each)using a simple left-to-rightbeam-search decoder. This is by far the best result achieved by direct translation with large neural networks. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81 BLEU score was achieved by an LSTM with a vocabulary of 80k words, so the score was penalized whenever the reference translation contained a word not covered by these 80k. This result shows that a relatively unoptimized small-vocabulary neural network architecture which has much room for improvement outperforms a phrase-based SMT system.

Finally, we used the LSTM to rescore the publicly available 1000-best lists of the SMT baseline on the same task [29]. By doing so, we obtained a BLEU score of 36.5, which improves the baseline by 3.2 BLEU points and is close to the previous best published result on this task (which is 37.0 [9]).

Surprisingly, the LSTM did not suffer on very long sentences（LSTM并没有因为句子变长而效果变差）, despite the recent experience of other researchers with related architectures [26]. We were able to do well on long sentences becausewe reversed the order of words in the source sentence but not the target sentences in the training and test set（我们反转了句子输入时的顺序）. By doing so, we introduced many short term dependencies that made the optimization problem much simpler（我们引入短期依赖使得我们优化更加简单） (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble with long sentences. The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work.

A useful property of the LSTM is that it learns to map an input sentence of variable length into a ﬁxed-dimensional vector representation. Given that translations tend to be paraphrases of the source sentences, the translation objective encourages the LSTM to ﬁnd sentence representations that capture their meaning, as sentences with similar meaningsare close to each other while different sentences meanings will be far. A qualitative evaluation supports this claim, showing that our model is aware of word order and is fairly invariant to the active and passive voice.

LSTM的一个有用特性是，它学会将可变长度的输入语句映射为固定维向量表示。由于翻译往往是对源句的意译，因此翻译目标通过LSTM去寻找能抓住其意思的句子的词向量表示，因为意义相似的句子彼此接近，而不同的句子意义却相差甚远。一项定性评估支持了这一论断，表明我们的模型能够获取词序以及主动和被动语态。

深度神经网络非常成功，但是却很难处理序列到序列的问题。

本文使用一种新的Seq2Seq模型结果来解决序列到序列的问题，其中Seq2Seq模型的Encoder

和Decoder都使用的是LSTM。

前人研究者针对这个问题已经有了很多工作，包括Seq2Seq模型和注意力机制。

本文的深度Seq2Seq模型在机器翻译上取得了非常好的效果。

model

Tricks：

1. 对于Encoder和Deocder，使用不同的LSTM。

2. 深层的LSTM比浅层的LSTM效果好。

3. 对源语言倒序输入会大幅度提高翻译效果。

实验结果及分析：

借助统计机器学习的方法。在其排序上的进行打分

We initialized all of the LSTM’s parameters with the uniform distribution between -0.08 and 0.08：使用-0.08到0.08的均匀分布作为参数初始化

We used stochastic gradient descent without momentum, with a ﬁxed learning rate of 0.7. After 5 epochs, we begun halving the learning rate every half epoch. We trained our models for a total of 7.5 epochs. 初始学习率0.7，5轮后每训练半轮，学习率减半。一共是7.5轮

We used batches of 128 sequences for the gradient and divided it the size of the batch (namely, 128).

Although LSTMs tend to not suffer from the vanishing gradient problem, they can have exploding gradients. Thus we enforced a hard constraint on the norm of the gradient [10, 25] by scaling it when its norm exceeded a threshold. For each training batch, we compute s = ||g||2, where g is the gradient divided by 128. If s > 5, we set g = 5g s . 尽管LSTM很少出现梯度消失的问题，但是会出现梯度爆炸的问题。约束梯度的大小

Different sentences have different lengths. Most sentences are short (e.g., length 20-30) but some sentences are long (e.g., length > 100), so a minibatch of 128 randomly chosen training sentences will have many short sentences and few long sentences, and as a result, much of the computationin the minibatch is wasted. To address this problem, we made sure that all sentences in a minibatch are roughly of the same length, yielding a 2x speedup.每个batch内需要统一长度。为了提高计算效率，对句子长度排序，这样保证每个batch内长度差不多。

关键点

• 验证了Seq2Seq模型对于序列到序列任务的有效性。

• 从实验的角度发现了很多提高翻译效果的tricks

• Deep NMT模型

创新点

• 提出了一种新的神经机器翻译模型---Deep NMT模型

• 提出了一些提高神经机器翻译效果的tricks——多层LSTM和倒序输入等。

• 在WMT14英语到法语翻译上得到了非常好的结果

启发点

• Seq2Seq模型就是使用一个LSTM提取输入序列的特征，每个时间步输入一个词，从而生成固定维度的句子向量表示，然后Deocder使用另外一个LSTM来从这个向量中生成输入序列。

• 我们的实验也支持这个结论，我们的模型生成的句子表示能够明确词序信息，并且能够识别出来同一种含义的主动和被动语态。