1. 为什么使用word2vec?
2. How To Learn Word2Vec?
We are working on NLP project, it is interesting.
相邻词的相似度更大。对于指代问题,可能并非如此。
3. CBOW Model(不常用的方法-相邻词预测中间词
)
We are working on NLP project, it is interesting |
---|
We are _ on NLP project, it is interesting |
We are working _ NLP project , it is interesting |
We are working on _ project, it is interesting |
We are working on NLP _, it is interesting |
We are working on NLP project, _ is interesting
|
4. Skip-Gram Model(常用的方法-中间词预测相邻词
)
Text:We are working on NLP project, it is interesting |
---|
_ _ working _ _ project, it is interesting |
We _ _ on _ _ , it is interesting |
We are _ _ NLP _, _ is interesting |
We are working _ _ project, _ _ interesting |
We are working on _ _, it _ _
|
目标:数学形式就是求下面式子极大 |
---|
4.1 模型构建
Text = (今天 天气 很好 今天 上 NLP 课程 NLP 是 目前 最 火 的 方向)
window_size = 1
w = 中心词
c = 上下文词
可以通过softmax方式写出来。如下:
4.2 训练优化:Negative Sampling(常用方法)
and Hierarchical Softmax(不常用方法)
此时可以用SGD去求出参数,但是对于当前目标时间复杂度较高len(Text)·len(window_size)·O(|V|)
,训练时需要找到一种更好的方法去训练。
如何把上述概率描述起来呢?
答案:可以借助于逻辑回归条件概率表示形式。
4.3 Negative Sampling (采样一部分负样本)
i.e:I am a student
vocab = [I, am, a, student]
例子:
S=“I like NLP , it is interesting, but it is hard” |
vocab={I,like,NLP,it,is,interesting,but,hard} |
---|---|
正样本 | 负样本 |
(NLP,like) | (NLP,I), (NLP,but) |
(NLP,it) | (NLP,hard), (NLP,I) |
(it,is) | (it,interesting), (it,hard) |
(it,NLP) | (it,hard), (it,I) |
对下式进行梯度下降法:
C++代码:https://github.com/dav/word2vec
最好读一下。
SG with Negative Sampling |
---|
for each 正样本集合 |
集合:针对中心词w,采样(负) |
5. 词向量评估
- Visualization(可视化TSNE)
- Similarity(相似度)
- Analogy(类比 woman-man=girl-boy)
6. word2vec中skip-gram有哪些缺点?
- 窗口长度有限
(Language Model(RNN/LSTM))
- 无法考虑全局
(全局模型(MF))
- 无法有效学习低频词向量
(subword embedding)
- 未登录词(OOV:out-of-vocabulary)
(subword embedding)
- 不考虑上下文
(context-aware embedding(ELMO/BERT))
- 没有严格意义上的语序
(LM(RNN/LSTM/Transformer))
- 可解释性不够
(非欧氏空间)
7. 词表示分类图
参考文献
- 自然语言处理实战