摘要:现在的word embeddings方法都是基于线性上下文的。本文generalize the skip gram model with negative sampling introduced by Mikolov to include arbitrary contexts任意的上下文。
dependency-base embeddings是less topical(局部的),并且more functional similarity than the original skip-gram embeddings
Introduction
过去的方法:将单词表示是分离和不同的symbols,suffers from poor generalization。
本文希望seek a representation that captures semantic and syntactic similarities between words。
过去有方法是基于distributional hypothesis分布假设(Harris 1954),On one end of the spectrum,words are grouped into clusters based on their context(Brown et al.1992, Uszkoreit and Brants, 2008),另一种,高维但是稀疏的向量。 在一些任务中,降低向量的稀疏性,例如SVD或LDA。
最近,使用神经网络语言模型,这些word representation可以称为"neural embeddings"或"word embeddings"。
the state-of-the-art word embedding method是the skip-gram with negative sampling model(SKIPGRAM),使用word2vec软件。
本文,generalize the skip-gram model,将线性上下文转换为arbitrary word contexts。
We experiment with syntactic contexts that are derived from automatically produced dependency parse-trees.
The Skip-Gram Model:
定义:
负采样:the negative-sampling objective assumes a dataset D of observed (w, c) pairs of words w and the contexts c。
word-context pair (w, c),这个pair是否来自D。p(D=1 | w, c)是(w,c)来自数据的概率,p(D=0 | w,c) = 1 - p(D=1|w,c)是不存在的概率。分布为:
vw和vc模型要学习的参数。最大化log-probability:
如果p(D=1|w,c)=1,设置vc=vw并且vc点乘vw=K,K是足够大的数。为了避免这种解决方法,the objective is extended with (w,c) pairs for which p(D=1|w,c) must be low,pairs没有在D中,构建数据集D‘,(w,c) pairs都是不对的。负采样的训练目标:
负样本D'可以通过不同的方法构建,Mikolov提出的:对每个(w,c)属于D,构建n个样本(w,c1)...(w,cn),n是hyperparameter,每个cj根据它的unigram distribution raised to the 3/4 power。
Optimizing this objective makes observed word-context pairs have similar embeddings, while scattering unobserved pairs. Intuitively, word that appear in similar context should have similar embeddings, though we have not yet found a formal proof that SKIPGRAM does indeed maximize the dot product of similar words.
Embedding with Arbitrary Contexts:
SKIPGRAM embedding算法,word w的上下文使用surrounding它的,context vocabulary C是和word vocabulary W一样的。然而,上下文不需要与words相关联,context的数目可以远大于word的数目。We generalize SKIPGRAM by replacing the bag-of-words contexts with arbitrary contexts。
本文使用dependency-based syntactic contexts。
1. Linear Bag-of-Words Contexts, target word w附近有大小为k的窗口,上下文大小为2k,如果k是2,那么w的上下文为:w-2, w-1, w+1, w+2。这种线性上下文可能会丢失很多重要的上下文信息。窗口大小为5 is commonly used to capture broad topical content,whereas smaller windows contain more focused information about the target word。
2. Dependency-Based Contexts:首先parsing每个句子,本文:derive word contexts as follows:对于每个target word w with modifiers m1,...,mk and a head h
lbl是Dependency relation的类型between the head and the modifier(e.g. nsubj, dobj, prep_with, amod)等, lbl-1用于mark the inverse-relation。
Relations that include a preposition are "collapsed" prior to context extraction, by directly connecting the head and the object of the preposition介词。
syntactic dependencies可以捕获距离远的词之间的关联,filter out "coincidnetal"context which are within the window but not directly related to the target word。
实验:
bag-of-word:context with k=5, bag-of-word: context with k=2,DEPS(dependency-based syntactic contexts)
modified word2vec to support arbitrary contexts, and to output the context embeddings in addition to the word embeddings.
The negative-sampling parameter(how many negative contexts to sample for every correct one) was 15.
!!!For DEPS,the corpus was tagged with parts-of-speech using the Stanford tagger(Toutanova et al. 2003) and parsed into labeled Stanford dependencies(de marneffe and Manning, 2008) using an implementation of the parser described in (Goldberg and Nivre, 2012). All tokens were converted to lowercase, and words and contexts that appeared less than 100 times were filtered.
Qualitative Evaluation:5 most similar words
Quantitative Evaluation:WordSim353 dataset,Chiarello et al. dataset