1.安装gensim库

https://www.lfd.uci.edu/~gohlke/pythonlibs/#gensim
下载合适的版本（我实在官网ctrl+F才搜到的，看花眼了都）

image.png

我一开始下载了最新的版本，37，但是安装的过程报错了
（和之前一样.whl文件，放在script文件夹里）

image.png

我以为要升级pip呢，问老公，老公说是版本问题。
于是下载了36版本的，成功安装了

image.png

2.编程实现

from gensim import corpora, models, similarities
import jieba

# 分词函数，返回分词列表
def cut(sentence):
    generator = jieba.cut(sentence)
    return [word for word in generator]


# 文本集和搜索词
text1 = open('D:\时间简史.txt', encoding='UTF-8').read()
text2 = open('D:\我的帝王生涯.txt', encoding='UTF-8').read()

texts = [text1, text2]
keyword = open('D:\果壳中的宇宙.txt', encoding='UTF-8').read()
# 1、将【文本集】生成【分词列表】
texts = [cut(text) for text in texts]
# 2、基于文本集建立【词典】，并提取词典特征数
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id.keys())
# 3、基于词典，将【分词列表集】转换成【稀疏向量集】，称作【语料库】
corpus = [dictionary.doc2bow(text) for text in texts]
# 4、使用【TF-IDF模型】处理语料库
tfidf = models.TfidfModel(corpus)
# 5、同理，用【词典】把【搜索词】也转换为【稀疏向量】
kw_vector = dictionary.doc2bow(cut(keyword))
# 6、对【稀疏向量集】建立【索引】
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)
# 7、相似度计算
sim = index[tfidf[kw_vector]]
for i in range(len(sim)):
    print('keyword 与 text%d 相似度为：%.2f' % (i+1, sim[i]))

文章是我选择的，待测是果壳中的宇宙，匹配是时间简史和我的帝王生涯。为了对比明显，果壳中的宇宙和时间简史都是霍金的宇宙科普，我的帝王生涯是小说。

image.png

这是运行结果，差别很明显。

文本相似度计算

文本相似度计算

1.安装gensim库

2.编程实现

推荐阅读更多精彩内容