1 Word2vec(gensim)
1.1 Word2vec介绍
word2vec是一个将单词转换成向量形式的工具。可以把对文本内容的处理简化为向量空间中的向量运算,计算出向量空间上的相似度,来表示文本语义上的相似度。
1.2 gensim(word2vec)的安装与使用
1.2.1 安装gensim
安装gensim工具包,有以下要求:
python>=2.6
NumPy>=1.3
Scipy>=0.7
打开Anaconda Prompt,输入
pip install gensim
有以下内容,安装即为成功。
1.2.2 gensim word2vec的使用
gensim中word2vec介绍:
word2vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False)
word2vec的参数介绍:
sg defines the training algorithm. By default (sg=0), CBOW is used.Otherwise (sg=1), skip-gram is employed.
size is the dimensionality of the feature vectors.
window is the maximum distance between the current and predicted word within a sentence.
alpha is the initial learning rate (will linearly drop to min_alpha as training progresses).
seed = for the random number generator. Initial vectors for eachword are seeded with a hash of the concatenation of word + str(seed).Note that for a fully deterministically-reproducible run, you must also limit the model toa single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEEDenvironment variable to control hash randomization.)
min_count = ignore all words with total frequency lower than this.
max_vocab_size = limit RAM during vocabulary building; if there are more uniquewords than this, then prune the infrequent ones. Every 10 million word typesneed about 1GB of RAM. Set to
None for no limit (default).
sample = threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).
workers = use this many worker threads to train the model (=faster training with multicore machines).hs = if 1, hierarchical softmax will be used for model training.If set to 0 (default), and
negative is non-zero, negative sampling will be used.negative = if > 0, negative sampling will be used, the int for negativespecifies how many "noise words" should be drawn (usually between 5-20).Default is 5. If set to 0, no negative samping is used.
cbow_mean = if 0, use the sum of the context word vectors. If 1 (default), use the mean.Only applies when cbow is used.
hashfxn = hash function to use to randomly initialize weights, for increasedtraining reproducibility. Default is Python's rudimentary built in hash function.
iter = number of iterations (epochs) over the corpus. Default is 5.
trim_rule = vocabulary trimming rule, specifies whether certain words should remainin the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count).Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) andreturns either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT.Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as partof the model.
sorted_vocab = if 1 (default), sort the vocabulary by descending frequency beforeassigning word indexes.
batch_words= target size (in words) for batches of examples passed to worker threads (andthus cython routines). Default is 10000. (Larger batches will be passed if individualtexts are longer than 10000 words, but the standard cython code truncates to that maximum.)
准备语料库:
中文的或者英文的文章都可以,一般要经过预处理才能使用,将文本语料进行分词,以空格,tab隔开都可以。
导入包:
import gensim.models as g
from gensim.models.word2vec import LineSentence
'''Word2vec的输入是一个LineSentence的迭代器,即我们需要将原始的训练语料转化成一个sentence的迭代器;每一次迭代返回的sentence是一个word(utf8格式)的列表。我们再用这个迭代器作为输入,构造一个Gensim内建的word2vec模型的对象。
'''
# data/Corpus.txt为输入的文件
model=g.Word2Vec(LineSentence('data/Corpus.txt'),size=100,window=1,min_count=1)
以上便完成了一个word2vec模型的训练。你也可以根据需求修改其他的参数来训练模型。
保存训练结果:
# 将训练的词向量结果保存至data/vectors.bin文件,一般将文件保存为二进制文件,方便以后做研究用。
model.save('data/vectors.bin')
# 为了方便查看训练的词向量结果,也可以将训练的结果保存至data/vectors.txt文本文件。
model.wv.save_word2vec_format('data/vectors.txt', binary=False)
1.3 Word2vec使用举例
1.3.1 训练中文词向量
中文语料库:这里只是列举了其开始的一小部分
经典 教程 转载 教程 目录 简介 数据 式 数据准备 关联规则 购物篮分析 分类 回归 聚类分析 简介
实验代码:
import gensim.models as g
from gensim.models.word2vec import LineSentence
model=g.Word2Vec(LineSentence('data/1.txt'),size=50,min_count=1)
model.save('data/v.bin')
model.wv.save_word2vec_format('data/v.txt', binary=False)
data/1.txt为输入的语料库,data/v.bin为训练得到的二进制文件,data/v.txt为得到的词向量的文本文件。得到的v.txt文件如下:下面只是截取该文件中的一小部分结果。
710 50
属性 -0.009596 -0.001876 -0.009559 0.006456 -0.001698 0.003129 0.003461 -0.008876 0.007711 -0.007966 0.008706 0.008594 -0.000639 0.006059 -0.001408 0.004246 0.000866 0.005963 0.006523 -0.001072 -0.004322 -0.005270 -0.004433 -0.007570 0.006196 0.005732 0.003178 -0.001564 0.008695 -0.004273 -0.000454 0.006022 0.003671 -0.002460 -0.005034 -0.008246 0.008214 0.005232 0.008977 0.009046 -0.009300 0.003446 -0.003139 -0.008507 0.005131 -0.003137 0.001671 -0.000145 0.002956 0.008733
weka -0.001554 -0.002667 0.005671 -0.003087 0.005874 -0.000982 -0.007489 -0.003619 -0.001746 -0.002489 -0.007203 -0.006696 -0.004924 -0.005163 -0.004303 0.007519 -0.009520 0.000178 0.008966 0.003525 -0.003593 -0.009662 -0.001394 0.002259 -0.006288 -0.007043 0.002655 0.006285 -0.007610 -0.007114 -0.005075 0.007908 0.001376 0.006226 0.009289 0.004669 -0.002740 -0.005563 0.001656 -0.006386 0.001319 -0.005669 0.001278 0.001255 0.009341 0.005373 -0.005182 0.004410 0.005824 0.005403
查看‘经典’的词向量:
s=model['经典']
print (s)
[-0.00151591 0.00092584 -0.009939 -0.00224788 0.00265429 -0.00093409 -0.00179082 -0.00541331 0.00329962 -0.00698855 -0.00517856 -0.00500181 0.00651171 -0.00661191 0.00882049 0.0098754 0.00071282 -0.00142486 0.00129473 -0.00415983 0.00480736 -0.00090799 0.00340422 0.00832723 -0.00304851 0.00366337 -0.00927676 0.0067507 0.00159891 0.00384319 -0.00919439 -0.00999665 0.00552959 0.00835639 0.00578091 -0.00271975 -0.00355495 0.00936656 0.00503161 -0.00182825 0.00873035 0.00328094 0.00860831 -0.00161888 -0.00698135 -0.00649323 0.00175485 -0.00052322 -0.00751577 0.00466034]
3.2 训练英文词向量
英文语料库:这里只是列举了其开始的一小部分
anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements
实验代码:
import gensim.models as g
from gensim.models.word2vec import LineSentence
model=g.Word2Vec(LineSentence('data/test.txt'),size=100,min_count=1)
model.save('data/vectors.bin')
model.wv.save_word2vec_format('data/vectors.txt', binary=False)
data/test.txt为输入的语料库,data/vectors.bin为训练得到的二进制文件,data/vectors.txt为得到的词向量的文本文件。得到的vectors.txt文件如下:下面只是截取该文件中的一小部分结果。
666 100
the 0.004054 -0.005728 0.001882 -0.007849 0.000501 -0.000245 0.002579 -0.006704 -0.000515 -0.006479 -0.002866 -0.000778 0.000011 0.002991 -0.006956 0.002837 -0.000320 -0.003594 -0.000749 -0.001940 -0.000699 0.004678 0.000189 0.005632 -0.011995 -0.008831 -0.004254 0.004729 -0.009354 0.012335 -0.002985 -0.001294 -0.000387 -0.000695 -0.008349 0.004057 0.012475 -0.001510 0.007925 -0.002098 -0.000324 -0.005771 -0.004947 0.000327 -0.001644 -0.007850 -0.004993 -0.006858 0.000746 0.008955 -0.007938 -0.003369 0.002979 0.002525 0.004577 -0.005645 -0.002922 -0.005588 0.010486 0.002849 0.004451 -0.004816 -0.005280 -0.007834 -0.001578 -0.003363 -0.010155 -0.000018 0.000580 -0.002440 -0.001560 0.009118 0.005289 -0.001354 -0.005925 -0.002601 -0.000712 -0.003121 -0.008938 -0.005457 0.000100 -0.002922 0.015099 0.005530 -0.010080 0.004722 0.006936 0.003801 -0.001417 0.003169 -0.007495 0.002904 0.001612 0.002964 -0.006149 0.002020 0.000339 0.007824 0.000346 0.002536
查看‘term’的词向量
s=model['term']
print (s)`
[ -8.02484981e-04 3.00095952e-03 -2.80341203e-03 -2.28437409e-03
-1.41002267e-04 3.17938073e-04 -1.92295073e-03 1.20879768e-03
2.65529496e-03 -1.28982833e-03 1.91517011e-03 -4.56867693e-03
2.18311977e-03 3.81058129e-03 -4.24355967e-03 -3.17155820e-04
1.09942793e-03 2.39409064e-03 -3.63637373e-04 -1.84015720e-03
4.41278913e-04 -3.52353952e-03 -3.73517699e-03 4.22701379e-03
-1.51773565e-03 -3.12223769e-04 -3.87281552e-03 4.57488419e-03
5.01494098e-04 -1.16992218e-03 -7.07793864e-04 7.98304332e-04
-6.94587361e-04 3.93078197e-03 -8.57832725e-04 -3.53127725e-05
-4.22595243e-04 -4.07684455e-03 1.00225047e-03 -1.50288991e-03
-3.13035818e-03 2.82595353e-03 8.76318838e-04 4.85123321e-03
4.31202492e-03 -2.23689433e-03 2.42896122e-03 1.09624270e-04
-3.44186695e-03 4.13992163e-03 -7.77615292e-04 -3.60144814e-03
-4.39681392e-03 -2.65590707e-03 -3.72421159e-03 1.81939476e-03
1.78643677e-03 2.86483858e-03 1.47811277e-03 9.28127265e-04
3.18731368e-03 -3.80100426e-03 2.40622307e-04 -2.19078665e-03
3.50835803e-03 2.78714317e-04 -9.21671162e-04 -2.44749500e-03
3.74052743e-03 3.42344493e-03 -7.17817107e-04 -1.34494551e-03
-1.16853847e-03 -2.11323774e-03 3.73977539e-03 1.91729330e-03
3.98231298e-03 4.98663634e-04 2.42953142e-03 -1.06209144e-03
-2.44620093e-03 1.36581645e-03 1.18581043e-03 -7.93479325e-04
2.43103225e-03 -4.14129347e-03 -2.47231149e-03 -1.35558052e-03
4.02195612e-03 -2.43257638e-03 -2.05650902e-03 -1.16446456e-04
3.31417285e-03 6.20363280e-04 4.15661745e-03 1.28834159e-03
-4.63809120e-03 -2.60737562e-03 -3.23505420e-03 1.68117651e-04]