Intro
skip-gram最开始就是生成单个单词的词向量然而并不能生成词组的词向量
添加了sub-sampling能提高训练速度
we present a simplified variant of Noise Contrastive Estimation (NCE) for training the Skip-gram model that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work
Skip-gram model
公式:
w_t是中心词,2c是window size
中心词对周边词的概率预测是通过softmax算出来的
Hierarchical Softmax
构造了一颗二叉树,减少运算量
n(w, j) 从root到w的第j个点
L(w) 这条path的长度
所以,n(w, 1) = root, n(w, L(w)) = w
[[x]]表示x为true则1反之则0
\sigma是sigmoid函数
Negative Sampling
Objective function
The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.
这块不太懂
Subsampling of frequent words
想the a这种词,高频没啥用,希望经过trainning之后,词向量变化不大
对于每一个词,按照p的概率丢掉
f是频率,t是阈值,一般为10^-5。
We chose this subsampling formula because it aggressively subsamples words whose frequency is greater than t while preserving the ranking of the frequencies
对于多个单词组成的词组,通过如下公式生成
迭代2-4轮,可以生成长词组
系数是为了不让频率过低的两个词形成词组。
这里是一个bigram