把每篇文章看作一个向量,向量每个维度代表有没有对应的词汇(0或1,不是词汇出现次数),检索的时候只需要做布尔运算。
0、需要找一些文档
1、构建词表:
词表处理部分只是简单地做了下,每个单词对应其二进制的第几位
def build_word_to_index_dict():
index = 0
word_to_index_dict = {}
with open('./data/bytecup.corpus.validation_set.txt', "r", encoding='utf-8') as f:
for line in f.readlines():
content_id_title = json.loads(line)
content = content_id_title['content'].lower()
# id = content_id_title['id']
# title = content_id_title['title'].lower()
words = content.split()
for word in words:
for punct in english_punctuations:
# 去掉收尾的标点符号
while word.startswith(punct):
word = word.strip(punct)
while word.endswith(punct):
word = word.strip(punct)
if word in english_punctuations or word in word_to_index_dict:
continue
word_to_index_dict[word] = index
index += 1
with open('./dict.txt', "w", encoding='utf-8') as f:
f.write(str(word_to_index_dict))
return word_to_index_dict
2、根据词表创建向量
用到了库bitarray
from bitarray import bitarray
def creat_vector(words,word_to_index_dict):
num_words = len(word_to_index_dict)
vector = bitarray(num_words)
vector.setall(False)
for word in words:
if word in word_to_index_dict:
index = word_to_index_dict[word]
vector[index] = True
return vector
3、全部待检索文档映射到向量
每篇文章有个id,用id来索引文档向量,构建成一个字典
def creat_id_to_vector_dict(word_to_index_dict):
id_to_vector = {}
with open('./data/bytecup.corpus.validation_set.txt', "r", encoding='utf-8') as f:
for line in f.readlines():
content_id_title = json.loads(line)
content = content_id_title['content'].lower()
id = content_id_title['id']
# title = content_id_title['title'].lower()
words = content.split()
vector = creat_vector(words,word_to_index_dict)
id_to_vector[id] = vector
with open('./id_to_vector_dict.txt', "w", encoding='utf-8') as f:
f.write(str(id_to_vector))
return id_to_vector
4、文档id映射到文档内容
因为向量运算后只有文档的id,需要获取到内容,所以要这个字典。
def build_id_to_doc_dict():
id_to_doc_dict = {}
with open('./data/bytecup.corpus.validation_set.txt', "r", encoding='utf-8') as f:
for line in f.readlines():
content_id_title = json.loads(line)
content = content_id_title['content'].lower()
id = content_id_title['id']
id_to_doc_dict[id] = content
return id_to_doc_dict
5、最后工作
准备工作:构建词表、构建文章向量字典
main里面首先将建立id_to_doc字典,读取上面两个字典,等待用户输入后,切分成几个单词,构建向量,然后对和所有文章向量进行运算,判断结果即可输出。
def main():
id_to_doc = build_id_to_doc_dict()
with open('./dict.txt', "r", encoding='utf-8') as f:
word_to_index_dict = eval(f.read())
# creat_id_to_vector_dict(word_to_index_dict)
with open('./id_to_vector_dict.txt', "r", encoding='utf-8') as f:
id_to_vector_dict = eval(f.read())
zeros_vector = creat_vector([],word_to_index_dict)
while True:
i = 0
search_str = input("please input the words you want to research:")
# search_str = "when"
words = search_str.split()
search_vector = creat_vector(words,word_to_index_dict)
if search_vector == zeros_vector:
continue
for id,doc_vector in id_to_vector_dict.items():
if i >= 3:
break
ret = search_vector & doc_vector
if ret == search_vector:
print(id_to_doc[id])
print('='*400)
i += 1