最近看到一篇有趣的论文,Sentence Similarity Based on Semantic Nets and Corpus Statistics.恰好最近也遇上了类似的需求。因此便实现了论文中的算法。
我的算法实现是基于python3 和 Natural Language Toolkit(NLTK).因为nltk中含有实现算法的WordNet和Brown Corpus。以下是算法:
from math import e,log,sqrt
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
corpus = [] # brown 语料库
for i in brown.categories():
corpus.extend(brown.words(categories=i))
word_buff = {}
threshold = 0.25 # 最小相似度阈值
semantic_and_word_order_factor=0.8 # 语义权重(语义和词序)
def get_min_path_distance_and_subsumer_between_two_words(word1,word2):
"""
获取两个词之间的最小距离和父节点的最小深度
"""
if word1 in word_buff:
word1_synsets = word_buff[word1]
else:
word1_synsets = wn.synsets(word1)
word_buff[word1] = word1_synsets
if word2 in word_buff:
word2_synsets = word_buff[word2]
else:
word2_synsets = wn.synsets(word2)
word_buff[word2] = word2_synsets
if not word1_synsets or not word2_synsets:
return 0,0
min_distance = 999999
min_pairs = None
for word1_synset in word1_synsets:
for word2_synset in word2_synsets:
distance = word1_synset.shortest_path_distance(word2_synset)
if distance and distance < min_distance:
min_distance = distance
min_pairs = (word1_synset,word2_synset)
subsumer_depth = 0
if min_pairs:
subsumer = min_pairs[0].lowest_common_hypernyms(min_pairs[0])
if subsumer and len(subsumer) == 1:
subsumer_depth = subsumer[0].min_depth()
else:
raise BaseException('function "min_path_distance_between_two_words" went wrong,check it')
else:
min_distance = None
return min_distance,subsumer_depth
def similarity_between_two_words(word1,word2,length_factor=0.2,depth_factor=0.45):
# 计算相似度
length,subsumer_depth = get_min_path_distance_and_subsumer_between_two_words(word1,word2)
if not length:
return 0
function_length = e ** -(length_factor*length)
temp1 = e ** (depth_factor * subsumer_depth)
temp2 = e ** -(depth_factor * subsumer_depth)
function_depth = (temp1 - temp2) / (temp1 + temp2)
return function_length * function_depth
def get_information_content(word,corpus):
# 获取词的information content
n = corpus.count(word)
N = len(corpus)
I_w = 1 - (log(n + 1) / log(N + 1))
return I_w
def word_order_vector(word_vector,joint_words):
res = []
for word in joint_words:
if word in word_vector:
res.append(joint_words.index(word) + 1)
else:
max_similarity_word = None
max_similarity = -1
for t_word in word_vector:
current_similarity = similarity_between_two_words(word,t_word)
if current_similarity > max_similarity:
max_similarity_word = t_word
if current_similarity > threshold and current_similarity > max_similarity:
max_similarity = current_similarity
res.append(joint_words.index(max_similarity_word) + 1)
return res
def semantic_vector(word_vector,joint_words):
res = []
for word in joint_words:
i_w1 = get_information_content(word, corpus)
if word in word_vector:
res.append(i_w1 * i_w1)
else:
# 遍历word_vector,寻找与word相似度最大的词
max_similarity_word = None
max_similarity = -1
for t1_word in word_vector:
current_similarity = similarity_between_two_words(word, t1_word)
if current_similarity > threshold and current_similarity > max_similarity:
max_similarity = current_similarity
max_similarity_word = t1_word
if max_similarity != -1:
i_w2 = get_information_content(max_similarity_word, corpus)
res.append(max_similarity * i_w1 * i_w2)
else:
res.append(0)
return res
def sentence_similarity(sentence1,sentence2):
# sentence1 = row['question1']
# sentence2 = row['question2']
words_1 = nltk.word_tokenize(sentence1)
words_2 = nltk.word_tokenize(sentence2)
if not words_1 or not words_2:
return 0
joint_words = list(set(words_1 + words_2))
semantic_vector1,semantic_vector2 = semantic_vector(words_1,joint_words),semantic_vector(words_2,joint_words)
word_order1,word_order2 = word_order_vector(words_1,joint_words),word_order_vector(words_2,joint_words)
s_s = sum(map(lambda x: x[0] * x[1], zip(semantic_vector1, semantic_vector2))) / sqrt(
sum(map(lambda x: x ** 2, semantic_vector1)) * sum(map(lambda x: x ** 2, semantic_vector2)))
s_r = sqrt(sum(map(lambda x: (x[0] - x[1]) ** 2, zip(word_order1, word_order2)))) / sqrt(
sum(map(lambda x: (x[0] + x[1]) ** 2, zip(word_order1, word_order2))))
sentence_similarity = semantic_and_word_order_factor * s_s + (1 - semantic_and_word_order_factor) * s_r
print(sentence1, '%%', sentence2, ':', sentence_similarity)
return sentence_similarity
一些测试:
What is the step by step guide to invest in share market in india? | What is the step by step guide to invest in share market? : 0.6834055667921426
What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? : 0.7238159709057276
How can I increase the speed of my internet connection while using a VPN? | How can Internet speed be increased by hacking through DNS? : 0.3474180327786902
Why am I mentally very lonely? How can I solve it? | Find the remainder when [math]23^{24}[/math] is divided by 24,23? : 0.24185376358110777
Which one dissolve in water quikly sugar, salt, methane and carbon di oxide? | Which fish would survive in salt water? : 0.5557426453712866
Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me? | I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me? : 0.5619685362853818
Should I buy tiago? | What keeps childern active and far from phone and video games? : 0.273650666926712
How can I be a good geologist? | What should I do to be a great geologist? : 0.7444940225200597
When do you use シ instead of し? | When do you use "&" instead of "and"? : 0.33368722311749527
Motorola (company): Can I hack my Charter Motorolla DCX3400? | How do I hack Motorola DCX3400 for free internet? : 0.679325702169737
Method to find separation of slits using fresnel biprism? | What are some of the things technicians can tell about the durability and reliability of Laptops and its components? : 0.42371839556731794
How do I read and find my YouTube comments? | How can I see all my Youtube comments? : 0.39666438912838764
What can make Physics easy to learn? | How can you make physics easy to learn? : 0.7470727852312119
What was your first sexual experience like? | What was your first sexual experience? : 0.7939444688772478
What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada? | What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan? : 0.7893963850595556
What would a Trump presidency mean for current international master’s students on an F1 visa? | How will a Trump presidency affect the students presently in US or planning to study in US? : 0.4490581992952136
What does manipulation mean? | What does manipulation means? : 0.8021629585217567
Why do girls want to be friends with the guy they reject? | How do guys feel after rejecting a girl? : 0.6173692627635123
Why are so many Quora users posting questions that are readily answered on Google? | Why do people ask Quora questions which can be answered easily by Google? : 0.6794045129534761
Which is the best digital marketing institution in banglore? | Which is the best digital marketing institute in Pune? : 0.5332225611879753
Why do rockets look white? | Why are rockets and boosters painted white? : 0.7624609655280314