word2vec详解（未完待续）

word2vec总览

一、重要的两篇论文

（1）Efficient Estimation of Word Representations in Vector Space（向量空间中词表示的有效估计）

该篇论文主要描述原理，所以本篇主要讲解该篇论文

（2）Distributed Representations of Words and Phrases

and their Compositionality（单词和短语的分布式表示及其组成）

该片论文主要是数学原理

二、论文结构

0.摘要：提出2种新的高效计算词向量结构，并使用词相似度任务验证效果

1.Introduction：介绍词向量背景；本文目标；前人工作

2.Model Architectures（模型架构）：LSA/LDA;前向神经网络；循环神经网络；并行网络计算

3.New Log-linear Models：介绍2中新模型结构：CBOW, Skipgrams

4.Results：评价任务描述；最大化正确率；模型结构比较；模型上大量数据的并行计算；微软研究院句子完成比赛

5.Examples of the Learned Relationships：例子：学习到的词与词之间的关系

6.Conclusion：结论：高质量词向量；高效率训练方式；作为预训练词向量作用于其他nlp任务能提升效果

7.Follow-Up Work：后续工作：C++单机代码

三、论文详解

0.摘要

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

我们提出了两种新的模型体系结构，用于从非常大的数据集计算单词的连续向量表示。这些词向量的质量在一个单词相似性任务中进行测量，并将结果与之前基于不同类型的神经网络的性能最好的技术进行比较。我们观察到在更低的计算成本下精度的巨大提高。从16亿个单词中学习高质量的单词向量只需要不到一天的时间。此外，我们还表明，这些向量在用于测量语法和语义词的相似性的测试集上，提供了最先进的性能。

总结：

1.提出两种新颖的模型结构用来计算词向量

2.采用一种词相似度（语法、语义）的任务来评估对比词向量之类

3.大量降低计算量的同时提升词向量的质量

4.在语义和句法的任务，同样得到当前最好的效果。

1.Introduction

Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data. An example is the popular N-gram model used for statistical language modeling - today, it is possible to train N-grams on virtually all available data (trillions of words [3]).

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.

With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform N-gram models [1, 27, 17].

许多当前的NLP系统和技术将单词视为原子单位——单词之间没有相似性的概念，因为它们在词汇表中表示为索引。这个选择有好几个原因——简单性、健壮性和简单模型在大量数据上的观察数据比基于较少数据的复杂系统表现更好。一个例子就是如今流行的N-gram模型用于统计语言建模，N-gram模型几乎可以在所有可用数据(数万亿个单词[3])上训练。

然而，简单的技术在许多任务中都有其局限性。例如，用于自动语音识别的相关域内数据的数量是有限的，性能的好坏通常由高质量记录的语音数据(通常只有数百万字)的大小决定。在机器翻译，许多语言的现有语料库只包含几十亿个单词或更少。因此，在某些情况下，基本技术的简单扩展将不会产生效果任何重大进展，我们都必须关注更先进的技术。

随着近年来机器学习技术的进步，训练更复杂的模型在更大的数据集上成为可能，它们通常比简单的模型更好。可能最成功的概念是使用单词[10]的分布式表示。例如,神经基于网络的语言模型显著优于N-gram模型[1,27,17]。

1.1Goals of the Paper

The main goal of this paper is to introduce techniques that can be used for learning high-quality wordvectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as we know, none of the previously proposed architectures has been successfully trained on more han a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100.

We use recently proposed techniques for measuring the quality of the resulting vector representations, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity [20]. This has been observed earlier in the context of inflectional languages - for example, nouns can have multiple word endings, and if we search for similar words in a subspace of the original vector space, it is possible to find words that have similar endings [13, 14].

Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen [20].

In this paper, we try to maximize accuracy of these vector operations by developing new model architectures that preserve the linear regularities among words. We design a new comprehensive test set for measuring both syntactic and semantic regularities, and show that many such regularities can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends on the dimensionality of the word vectors and on the amount of the training data.

本文的主要目标是介绍一些技术，这些技术可用于从具有数十亿个单词和数百万个单词的庞大数据集中学习高质量的单词向量。据我们所知，以前提出的体系结构都没有成功地训练出超过几亿的单词，而且字向量的维数在50 - 100之间。

我们使用最近提出的技术来测量结果向量表示的质量，期望不仅相似的单词会趋向于彼此接近，而且单词可以有多个程度的相似[20]。这已经在早期的屈折语言中观察到——例如，名词可以有多个词尾，如果我们在原始向量空间的子空间中搜索相似的词，就有可能找到词尾相似的词[13,14]。

有些令人惊讶的是，人们发现词语表征的相似性超出了简单的句法规律。使用单词偏移技术对单词向量进行简单的代数运算，例如，vector(“King”)- vector(“Man”)+ vector(“Woman”)得到的向量最接近单词Queen[20]的向量表示。

在本文中，我们试图通过开发新的模型架构来保持单词之间的线性规律，以最大限度地提高这些向量运算的准确性。我们设计了一个新的综合测试集来衡量语法和语义的规律，并表明许多这样的规律是可以学习的高精度。此外，我们讨论了训练时间和准确性如何依赖于字向量的维数和训练数据的数量。

1.2 Previous Work

Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model architecture for estimating neural network language model (NNLM) was proposed in [1], where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others.

Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model.

It was later shown that the word vectors can be used to significantly improve and simplify many NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word vectors were made available for future research and comparison. However, as far as we know, these architectures were significantly more computationally expensive for training than the one proposed in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices are used [23].

将单词表示为连续向量已经有很长的历史了[10,26,8]。[1]中提出了一种非常流行的神经网络语言模型估计模型结构，该结构采用具有线性投影层和非线性隐层的前馈神经网络共同学习单词向量表示和统计语言模型。这项工作之后又有许多其他的工作。

另一个有趣的NNLM结构在[13,14]中提出，其中单词向量首先使用具有单个隐含层的神经网络学习。然后使用单词向量来训练NNLM。因此，即使不构造完整的NNLM，也可以学习单词向量。在这项工作中，我们直接扩展了这个架构，并且只关注使用简单模型学习单词向量的第一步。

后来的研究表明，单词向量可以显著改善和简化许多NLP应用[4,5,29]。使用不同的模型架构对词向量本身进行估计，并在不同的语料库上进行训练[4,29,23,19,9]，得到的部分词向量可用于将来的研究和比较。然而，据我们所知，这些架构比[13]中提出的架构在训练方面的计算成本要高得多，除了某些版本的log-bilinear model，其中对角权矩阵使用[23]。

2 Model Architectures

Many different types of models were proposed for estimating continuous representations of words,including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.

Similar to [18], to compare different model architectures we define first the computational complexity of a model as the number of parameters that need to be accessed to fully train the model. Next, we will try to maximize the accuracy, while minimizing the computational complexity.

For all the following models, the training complexity is proportional to

O = E × T × Q, (1)

where E is number of the training epochs, T is the number of the words in the training set and Q is defined further for each model architecture. Common choice is E = 3 − 50 and T up to one billion.All models are trained using stochastic gradient descent and backpropagation [26].

许多不同类型的模型被提出用于估计词的连续表示，包括众所周知的潜在语义分析(LSA)和潜在狄利克雷分配(LDA)。在本文中，我们关注的是神经网络学习到的单词的分布表示，因为之前已经表明，在保留单词之间的线性规律方面，神经网络的表现明显优于LSA [20,31];此外，LDA在大型数据集上的计算开销非常大。

与[18]类似，为了比较不同的模型架构，我们首先将模型的计算复杂度定义为需要访问的参数的数量，以充分训练模型。接下来，我们将尝试最大化精确度，同时最小化计算复杂度。

对于以下所有模型，训练复杂度与

O = E×T×Q， (1)

其中E为训练epoch的个数，T为训练集中单词的个数，Q为每个模型架构进一步定义。常见的选择是E = 3 - 50, T高达10亿。所有模型都使用随机梯度下降和反向传播[26]进行训练。

2.1 Feedforward Neural Net Language Model (NNLM)

The probabilistic feedforward neural network language model has been proposed in [1]. It consists of input, projection, hidden and output layers. At the input layer, N previous words are encoded using 1-of-V coding, where V is size of the vocabulary. The input layer is then projected to a projection layer P that has dimensionality N × D, using a shared projection matrix. As only N inputs are active at any given time, composition of the projection layer is a relatively cheap operation.

The NNLM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense. For a common choice of N = 10, the size of the projection layer (P) might be 500 to 2000, while the hidden layer size H is typically 500 to 1000 units. Moreover, the hidden layer is used to compute probability distribution over all the words in the vocabulary, resulting in an output layer with dimensionality V . Thus, the computational complexity per each training example is

Q = N × D + N × D × H + H × V, (2)

where the dominating term is H × V . However, several practical solutions were proposed for avoiding it; either using hierarchical versions of the softmax [25, 23, 18], or avoiding normalized models completely by using models that are not normalized uring training [4, 9]. With binary tree representations of the vocabulary, the number of output units that need to be evaluated can go own to around log2(V ). Thus, most of the complexity is caused by the term N × D × H.

In our models, we use hierarchical softmax where the vocabulary is represented as a Huffman binary tree. This follows previous observations that the frequency of words works well for obtaining classes in neural net language models [16]. Huffman trees assign short binary codes to frequent words, and this further reduces the number of output units that need to be evaluated: while balanced binary tree would require log2(V ) outputs to be evaluated, the Huffman tree based hierarchical softmax requires only about log2(Unigram perplexity(V )). For example when the vocabulary size is one million words, this results in about two times speedup in evaluation. While this is not crucial speedup for neural network LMs as the computational bottleneck is in the N ×D×H term, we will later propose architectures that do not have hidden layers and thus depend heavily on the efficiency of the softmax normalization.

在[1]中提出了概率前馈神经网络语言模型。它由输入层、投影层、隐藏层和输出层组成。在输入层，前面的N个单词使用1-of-V编码，其中V为词汇表的大小。然后使用共享的投影矩阵将输入层投影到维数N×D的投影层P。由于在任何给定时间只有N个输入是活动的，合成投影层是一个相对便宜的操作。

由于投影层中的值比较密集，NNLM体系结构使得投影层和隐层之间的计算变得复杂。对于通常选择的N = 10，投影层(P)的大小可能是500到2000，而隐藏层的大小H通常是500到1000个单位。并利用隐含层计算词汇中所有单词的概率分布，得到维数为V的输出层。因此，每个训练示例的计算复杂度为

Q = N * D + N * D * H + H * V， (2)

主要项是H乘以V。然而，提出了一些实际的解决方案来避免它;要么使用层次softmax[25,23,18]，要么完全避免使用未标准化模型进行训练[4,9]。词汇表用二叉树表示，需要计算的输出单元的数量可以降低到log2(V)。如此，此时大部分复杂性是由N×D×H这一项引起的。

在我们的模型中，我们使用分层的softmax，其中词汇表表示为霍夫曼二叉树。这与之前的观察结果一致，即在神经网络语言模型[16]中，单词的频率对于获取分类非常有效。Huffman树将简短的二进制编码分配给频繁的单词，这进一步减少了需要计算的输出单元数。此时平衡二树需要计算log2(V)输出，而基于Huffman树的hierarchical softmax只需要大约log2(Unigram perplexity(V))。例如，当词汇表大小为100万个单词时，这会导致评估速度提高两倍。虽然这对于神经网络LMs来说并不是关键的加速，因为计算瓶颈在N×D×H项。我们稍后将提出没有隐藏层的架构，此时严重依赖softmax归一化的效率。

Feedforward Neural Net Language Model

2.2 Recurrent Neural Net Language Model (RNNLM)

Recurrent neural network based language model has been proposed to overcome certain limitations of the feedforward NNLM, such as the need to specify the context length (the order of the model N),and because theoretically RNNs can efficiently represent more complex patterns than the shallow neural networks [15, 2]. The RNN model does not have a projection layer; only input, hidden and output layer. What is special for this type of model is the recurrent matrix that connects hidden layer to itself, using time-delayed connections. This allows the ecurrent model to form some kind of short term memory, as information from the past can be represented by the hidden layer state that gets updated based on the current input and the state of the hidden layer in the previous time step.

The complexity per training example of the RNN model is

Q = H × H + H × V, (3)

where the word representations D have the same dimensionality as the hidden layer H. Again, the term H × V can be efficiently reduced to H × log2(V ) by using hierarchical softmax. Most of the complexity then comes from H × H.

Recurrent Neural Net Language Model

基于递归神经网络的语言模型被提出，以克服前馈NNLM的某些限制，如需要指定上下文长度(模型的阶数N)，以及RNNs在理论上可以比浅神经网络有效地表示更复杂的模式[15,2]。RNN模型没有投影层;只有输入、隐藏和输出层。这类模型的特殊之处在于使用延时连接将隐含层连接到自身的递归矩阵。这使得ecurrent模型能够形成某种短期记忆，因为来自过去的信息可以通过基于当前输入和前一个时间步长的隐含层状态进行更新的隐含层状态来表示。

RNN模型每训练例的复杂度为

Q = H×H + H×V， (3)

其中表示词D与隐含层H具有相同的维数，通过使用softmax分层，术语H×V可以有效地降为H×log2(V)。大部分的复杂性来自于H乘以H。