gensim模型(4)——FastText模型

FastText模型

下面介绍了Gensim中的fastText模型并展示如何在Lee Corpus上使用它。

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

在这里,我们将学习使用fastText库来训练词嵌入模型、保存和加载它们,并执行类似于Word2Vec的相似性操作和向量查找。

什么时候使用fastText?

fastText背后的主要原理是单词的形态结构带有单词意义的重要信息。这样的结构是不会被传统的单词嵌入模型(如Word2Vec)所考虑的。传统的单词嵌入模型只为每个独立的单词训练一个唯一的单词嵌入。形态结构对于形态丰富的语言(德语,土耳其语)尤其重要。在这些语言中,单个单词可以具有大量形态形式,且每种形态形式可能很少出现。因此,传统的单词嵌入模型很难训练好的词嵌入。

fastText试图通过将每个单词作为其子单词的聚集来解决以上问题。为了简单和语言独立,子词被视为该词的字符ngram。一个单词的向量简单地视为其组件char-ngram(字词)的所有向量的总和。

根据Word2Vec和fastText的详细对比,与原始Word2Vec相比,fastText在语法任务上的表现明显更好,特别是当训练语料库的规模很小。不过,Word2Vec在语义任务上的表现略胜fastText。随着训练语料库规模的增加,两者间的差异逐渐变小。

假设词汇外单词(out-of-vocabulary,OOV)至少有一个char-ngram(子单词)存在训练数据中,fastText甚至可以通过对其组件char-ngram(子单词)的向量进行求和来获得该词汇外单词的向量。

训练模型

在以下的例子中,我们将使用Lee Corpus来训练我们的模型。

from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(corpus_file=corpus_file, epochs=model.epochs, total_examples=model.corpus_count, total_words=model.corpus_total_words
)

print(model)

结果为:

<gensim.models.fasttext.FastText object at 0x7f9733391be0>

训练超参数

用于训练模型的超参数遵循与 Word2Vec 相同的模式。FastText支持原始word2vec中的以下参数:

参数 说明
model 训练结构。允许值:cbow(默认),skipgram
vector_size 将要学习的向量嵌入维度(默认值:100)
alpha 初始学习率(默认值:0.025)
window 上下文窗口大小(默认值:5)
min_count 忽略出现次数小于此值的单词(默认值:5)
loss 训练对象。允许值:ns(负采样,默认值),hs(Hierarchical Softmax),softmax
sample 对高频词进行下采样的阈值(默认值:0.001)
negative 采样大的负单词数量,仅在loss设置为ns时使用(默认值:5)
epochs 轮次(默认值:5)
sorted_vocab 按降序频率对词汇进行排序(默认值:1)
threads 使用的线程数量(默认值:12)
min_n char ngram的最小长度(默认值:3)
max_n char ngram的最大长度(默认值:6)
bucket number of buckets used for hashing ngrams(默认值:2000000)

参数min_n和max_n控制每个单词在训练和嵌入时被划分成字符gram的长度。如果max_n被设置为0或者小于min_n,则没有字符gram被使用。该模型被简化为了Word2Vec。

为了绑定正在训练的模型的内存要求,使用了一个哈希函数,该函数将gram映射到1到K中的整数。为了哈希这些字符序列,我们将使用Fowler-Noll-Vo(FNV-1a变体)函数。

注意:你可以在使用Gensim的fastText原生实现时继续训练模型。

保存/加载模型

模型能够通过load和save方法来保存和加载,就像Gensim中的其他模型一样。

# Save a model trained via Gensim's fastText implementation to temp.
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
    model.save(tmp.name, separately=[])

# Load back the same model.
loaded_model = FastText.load(tmp.name)
print(loaded_model)

os.unlink(tmp.name)  # demonstration complete, don't need the temp file anymore

结果为:

<gensim.models.fasttext.FastText object at 0x7f972fe265b0>

save_word2vec_format同样也适用于fastText模型,但是会导致所有gram的向量丢失。因此,模型以此方式加载会表现的像常规的word2vec模型一样。

词向量查找

所有查找fastText单词(包括OOV单词)所需的信息都包含在其model.wv属性中。

如果你不需要继续训练你的模型,为了节约空间和RAM,你可以导出并保存此.wv属性并丢弃模型。

wv = model.wv
print(wv)

#
# FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.
#
print('night' in wv.key_to_index)

结果为

<gensim.models.fasttext.FastTextKeyedVectors object at 0x7f9733391280>
True
print('nights' in wv.key_to_index)

结果为

False
print(wv['night'])

结果为

array([ 0.12453239, -0.26018462, -0.04087191,  0.2563215 ,  0.31401935,
        0.16155584,  0.39527607,  0.27404118, -0.45236284,  0.06942682,
        0.36584955,  0.51162827, -0.51161295, -0.192019  , -0.5068029 ,
       -0.07426998, -0.6276584 ,  0.22271585,  0.19990133,  0.2582401 ,
        0.14329399, -0.01959469, -0.45576197, -0.06447829,  0.1493489 ,
        0.17261286, -0.13472046,  0.26546794, -0.34596932,  0.5626187 ,
       -0.7038802 ,  0.15603925, -0.03104019, -0.06228801, -0.13480644,
       -0.0684596 ,  0.24728075,  0.55081636,  0.07330963,  0.32814154,
        0.1574982 ,  0.56742406, -0.31233737,  0.14195296,  0.0540203 ,
        0.01718009,  0.05519052, -0.04002226,  0.16157456, -0.5134223 ,
       -0.01033936,  0.05745083, -0.39208183,  0.52553374, -1.0542839 ,
        0.2145304 , -0.15234643, -0.35197273, -0.6215585 ,  0.01796502,
        0.21242104,  0.30762967,  0.2787644 , -0.19908747,  0.7144409 ,
        0.45586124, -0.21344525,  0.26920903, -0.651759  , -0.37096855,
       -0.16243419, -0.3085725 , -0.70485127, -0.04926324, -0.80278563,
       -0.24352737,  0.6427129 , -0.3530421 , -0.29960123,  0.01466726,
       -0.18253349, -0.2489397 ,  0.00648343,  0.18057272, -0.11812428,
       -0.49044088,  0.1847386 , -0.27946883,  0.3941279 , -0.39211616,
        0.26847798,  0.41468227, -0.3953728 , -0.25371104,  0.3390468 ,
       -0.16447693, -0.18722224,  0.2782088 , -0.0696249 ,  0.4313547 ],
      dtype=float32)
print(wv['nights'])

结果为

array([ 0.10586783, -0.22489995, -0.03636307,  0.22263278,  0.27037606,
        0.1394871 ,  0.3411114 ,  0.2369042 , -0.38989475,  0.05935   ,
        0.31713557,  0.44301754, -0.44249156, -0.16652377, -0.4388366 ,
       -0.06266895, -0.5436303 ,  0.19294666,  0.17363031,  0.22459263,
        0.12532061, -0.01866964, -0.3936521 , -0.05507145,  0.12905194,
        0.14942174, -0.11657442,  0.22935589, -0.29934618,  0.4859668 ,
       -0.6073519 ,  0.13433163, -0.02491274, -0.05468523, -0.11884545,
       -0.06117092,  0.21444008,  0.4775469 ,  0.06227469,  0.28350767,
        0.13580805,  0.48993143, -0.27067345,  0.1252003 ,  0.04606731,
        0.01598426,  0.04640368, -0.03456376,  0.14138013, -0.44429192,
       -0.00865329,  0.05027836, -0.341311  ,  0.45402458, -0.91097856,
        0.1868968 , -0.13116683, -0.30361563, -0.5364188 ,  0.01603454,
        0.18146741,  0.26708448,  0.24074472, -0.17163375,  0.61906886,
        0.39530373, -0.18259627,  0.23319626, -0.5634787 , -0.31959867,
       -0.13945322, -0.269441  , -0.60941464, -0.0403638 , -0.69563633,
       -0.2098089 ,  0.5569868 , -0.30320194, -0.25840232,  0.01436759,
       -0.15632603, -0.21624804,  0.00434287,  0.15566474, -0.10228094,
       -0.4249678 ,  0.16197811, -0.24147548,  0.34205705, -0.3391568 ,
        0.23235887,  0.35860622, -0.34247142, -0.21777524,  0.29318404,
       -0.1407287 , -0.16115218,  0.24247572, -0.06217333,  0.37221798],
      dtype=float32)

相似性操作

相似性操作运行和word2vec一样。假设词汇表外的单词至少有一个字符gram出现在训练数据中,该单词也可以被使用。

print("nights" in wv.key_to_index)

结果为

False
print("night" in wv.key_to_index)

结果为

True
print(wv.similarity("night", "nights"))

结果为

0.999992

在fastText模型中,句法上相似的单词通常拥有较高的相似性,因为因为大量的字符gram将是相同的。因此,fastText通常在句法任务上比Word2Vec表现得好。详细的比较参见这里

其他相似性操作

示例训练语料库是一个较小的语料库,结果预计不会很好。这里仅用于概念验证

print(wv.most_similar("nights"))

结果为

[('night', 0.9999929070472717),
 ('night.', 0.9999895095825195),
 ('flights', 0.999988853931427),
 ('rights', 0.9999886751174927),
 ('residents', 0.9999884366989136),
 ('overnight', 0.9999883770942688),
 ('commanders', 0.999988317489624),
 ('reached', 0.9999881386756897),
 ('commander', 0.9999880790710449),
 ('leading', 0.999987781047821)]
print(wv.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))

结果为

0.9999402
print(wv.doesnt_match("breakfast cereal dinner lunch".split()))

结果为

'lunch'
print(wv.most_similar(positive=['baghdad', 'england'], negative=['london']))

结果为

[('attempt', 0.999660074710846),
 ('biggest', 0.9996545314788818),
 ('again', 0.9996527433395386),
 ('against', 0.9996523857116699),
 ('doubles', 0.9996522068977356),
 ('Royal', 0.9996512532234192),
 ('Airlines', 0.9996494054794312),
 ('forced', 0.9996494054794312),
 ('arrest', 0.9996492266654968),
 ('follows', 0.999649167060852)]
print(wv.evaluate_word_analogies(datapath('questions-words.txt')))

结果为

(0.24489795918367346,
 [{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},
  {'correct': [], 'incorrect': [], 'section': 'capital-world'},
  {'correct': [], 'incorrect': [], 'section': 'currency'},
  {'correct': [], 'incorrect': [], 'section': 'city-in-state'},
  {'correct': [],
   'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')],
   'section': 'family'},
  {'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},
  {'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},
  {'correct': [('GOOD', 'BETTER', 'LOW', 'LOWER'),
               ('GREAT', 'GREATER', 'LOW', 'LOWER'),
               ('LONG', 'LONGER', 'LOW', 'LOWER')],
   'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
                 ('GOOD', 'BETTER', 'LONG', 'LONGER'),
                 ('GREAT', 'GREATER', 'LONG', 'LONGER'),
                 ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
                 ('LONG', 'LONGER', 'GOOD', 'BETTER'),
                 ('LONG', 'LONGER', 'GREAT', 'GREATER'),
                 ('LOW', 'LOWER', 'GOOD', 'BETTER'),
                 ('LOW', 'LOWER', 'GREAT', 'GREATER'),
                 ('LOW', 'LOWER', 'LONG', 'LONGER')],
   'section': 'gram3-comparative'},
  {'correct': [('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
               ('GOOD', 'BEST', 'LARGE', 'LARGEST'),
               ('GREAT', 'GREATEST', 'LARGE', 'LARGEST')],
   'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'),
                 ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
                 ('GOOD', 'BEST', 'GREAT', 'GREATEST'),
                 ('GOOD', 'BEST', 'BIG', 'BIGGEST'),
                 ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
                 ('GREAT', 'GREATEST', 'GOOD', 'BEST'),
                 ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
                 ('LARGE', 'LARGEST', 'GOOD', 'BEST'),
                 ('LARGE', 'LARGEST', 'GREAT', 'GREATEST')],
   'section': 'gram4-superlative'},
  {'correct': [('GO', 'GOING', 'SAY', 'SAYING'),
               ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
               ('LOOK', 'LOOKING', 'SAY', 'SAYING'),
               ('LOOK', 'LOOKING', 'GO', 'GOING'),
               ('PLAY', 'PLAYING', 'SAY', 'SAYING'),
               ('PLAY', 'PLAYING', 'GO', 'GOING'),
               ('SAY', 'SAYING', 'GO', 'GOING')],
   'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'),
                 ('GO', 'GOING', 'PLAY', 'PLAYING'),
                 ('GO', 'GOING', 'RUN', 'RUNNING'),
                 ('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
                 ('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
                 ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
                 ('RUN', 'RUNNING', 'SAY', 'SAYING'),
                 ('RUN', 'RUNNING', 'GO', 'GOING'),
                 ('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
                 ('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
                 ('SAY', 'SAYING', 'LOOK', 'LOOKING'),
                 ('SAY', 'SAYING', 'PLAY', 'PLAYING'),
                 ('SAY', 'SAYING', 'RUN', 'RUNNING')],
   'section': 'gram5-present-participle'},
  {'correct': [('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
               ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
               ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
               ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
               ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
               ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN')],
   'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
                 ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
                 ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
                 ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
                 ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
                 ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
                 ('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'),
                 ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
                 ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
                 ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
                 ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
                 ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
                 ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
                 ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI')],
   'section': 'gram6-nationality-adjective'},
  {'correct': [],
   'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'),
                 ('GOING', 'WENT', 'PLAYING', 'PLAYED'),
                 ('GOING', 'WENT', 'SAYING', 'SAID'),
                 ('GOING', 'WENT', 'TAKING', 'TOOK'),
                 ('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
                 ('PAYING', 'PAID', 'SAYING', 'SAID'),
                 ('PAYING', 'PAID', 'TAKING', 'TOOK'),
                 ('PAYING', 'PAID', 'GOING', 'WENT'),
                 ('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
                 ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
                 ('PLAYING', 'PLAYED', 'GOING', 'WENT'),
                 ('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
                 ('SAYING', 'SAID', 'TAKING', 'TOOK'),
                 ('SAYING', 'SAID', 'GOING', 'WENT'),
                 ('SAYING', 'SAID', 'PAYING', 'PAID'),
                 ('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
                 ('TAKING', 'TOOK', 'GOING', 'WENT'),
                 ('TAKING', 'TOOK', 'PAYING', 'PAID'),
                 ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
                 ('TAKING', 'TOOK', 'SAYING', 'SAID')],
   'section': 'gram7-past-tense'},
  {'correct': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
               ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
               ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
               ('CHILD', 'CHILDREN', 'CAR', 'CARS'),
               ('MAN', 'MEN', 'CAR', 'CARS')],
   'incorrect': [('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
                 ('CAR', 'CARS', 'CHILD', 'CHILDREN'),
                 ('CAR', 'CARS', 'MAN', 'MEN'),
                 ('CHILD', 'CHILDREN', 'MAN', 'MEN'),
                 ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
                 ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
                 ('MAN', 'MEN', 'CHILD', 'CHILDREN')],
   'section': 'gram8-plural'},
  {'correct': [], 'incorrect': [], 'section': 'gram9-plural-verbs'},
  {'correct': [('GOOD', 'BETTER', 'LOW', 'LOWER'),
               ('GREAT', 'GREATER', 'LOW', 'LOWER'),
               ('LONG', 'LONGER', 'LOW', 'LOWER'),
               ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
               ('GOOD', 'BEST', 'LARGE', 'LARGEST'),
               ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
               ('GO', 'GOING', 'SAY', 'SAYING'),
               ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
               ('LOOK', 'LOOKING', 'SAY', 'SAYING'),
               ('LOOK', 'LOOKING', 'GO', 'GOING'),
               ('PLAY', 'PLAYING', 'SAY', 'SAYING'),
               ('PLAY', 'PLAYING', 'GO', 'GOING'),
               ('SAY', 'SAYING', 'GO', 'GOING'),
               ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
               ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
               ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
               ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
               ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
               ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
               ('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
               ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
               ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
               ('CHILD', 'CHILDREN', 'CAR', 'CARS'),
               ('MAN', 'MEN', 'CAR', 'CARS')],
   'incorrect': [('HE', 'SHE', 'HIS', 'HER'),
                 ('HIS', 'HER', 'HE', 'SHE'),
                 ('GOOD', 'BETTER', 'GREAT', 'GREATER'),
                 ('GOOD', 'BETTER', 'LONG', 'LONGER'),
                 ('GREAT', 'GREATER', 'LONG', 'LONGER'),
                 ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
                 ('LONG', 'LONGER', 'GOOD', 'BETTER'),
                 ('LONG', 'LONGER', 'GREAT', 'GREATER'),
                 ('LOW', 'LOWER', 'GOOD', 'BETTER'),
                 ('LOW', 'LOWER', 'GREAT', 'GREATER'),
                 ('LOW', 'LOWER', 'LONG', 'LONGER'),
                 ('BIG', 'BIGGEST', 'GOOD', 'BEST'),
                 ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
                 ('GOOD', 'BEST', 'GREAT', 'GREATEST'),
                 ('GOOD', 'BEST', 'BIG', 'BIGGEST'),
                 ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
                 ('GREAT', 'GREATEST', 'GOOD', 'BEST'),
                 ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
                 ('LARGE', 'LARGEST', 'GOOD', 'BEST'),
                 ('LARGE', 'LARGEST', 'GREAT', 'GREATEST'),
                 ('GO', 'GOING', 'LOOK', 'LOOKING'),
                 ('GO', 'GOING', 'PLAY', 'PLAYING'),
                 ('GO', 'GOING', 'RUN', 'RUNNING'),
                 ('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
                 ('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
                 ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
                 ('RUN', 'RUNNING', 'SAY', 'SAYING'),
                 ('RUN', 'RUNNING', 'GO', 'GOING'),
                 ('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
                 ('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
                 ('SAY', 'SAYING', 'LOOK', 'LOOKING'),
                 ('SAY', 'SAYING', 'PLAY', 'PLAYING'),
                 ('SAY', 'SAYING', 'RUN', 'RUNNING'),
                 ('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
                 ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
                 ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
                 ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
                 ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
                 ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
                 ('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'),
                 ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
                 ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
                 ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
                 ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
                 ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
                 ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
                 ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
                 ('GOING', 'WENT', 'PAYING', 'PAID'),
                 ('GOING', 'WENT', 'PLAYING', 'PLAYED'),
                 ('GOING', 'WENT', 'SAYING', 'SAID'),
                 ('GOING', 'WENT', 'TAKING', 'TOOK'),
                 ('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
                 ('PAYING', 'PAID', 'SAYING', 'SAID'),
                 ('PAYING', 'PAID', 'TAKING', 'TOOK'),
                 ('PAYING', 'PAID', 'GOING', 'WENT'),
                 ('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
                 ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
                 ('PLAYING', 'PLAYED', 'GOING', 'WENT'),
                 ('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
                 ('SAYING', 'SAID', 'TAKING', 'TOOK'),
                 ('SAYING', 'SAID', 'GOING', 'WENT'),
                 ('SAYING', 'SAID', 'PAYING', 'PAID'),
                 ('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
                 ('TAKING', 'TOOK', 'GOING', 'WENT'),
                 ('TAKING', 'TOOK', 'PAYING', 'PAID'),
                 ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
                 ('TAKING', 'TOOK', 'SAYING', 'SAID'),
                 ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
                 ('CAR', 'CARS', 'CHILD', 'CHILDREN'),
                 ('CAR', 'CARS', 'MAN', 'MEN'),
                 ('CHILD', 'CHILDREN', 'MAN', 'MEN'),
                 ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
                 ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
                 ('MAN', 'MEN', 'CHILD', 'CHILDREN')],
   'section': 'Total accuracy'}])

词移动器距离

这部分需要安装pyemd库:pip install pyemd。

从两个句子开始:

sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

移除停止词:

from gensim.parsing.preprocessing import STOPWORDS
sentence_obama = [w for w in sentence_obama if w not in STOPWORDS]
sentence_president = [w for w in sentence_president if w not in STOPWORDS]

计算两个句子的词移动距离:

distance = wv.wmdistance(sentence_obama, sentence_president)
print(f"Word Movers Distance is {distance} (lower means closer)")

结果为

'Word Movers Distance is 0.015923231075180694 (lower means closer)'
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容