实验目的及要求

实验目的

1.理解中文分词所面临的挑战

2.熟练掌握中文分词方法及不同方法之间的优缺点

实验要求

基本要求

利用减字匹配法实现正向及逆向最大字符串的中文分词算法

评估中文分词算法效率

实验提交内容

实验工程文件和可执行文件

实验报告

实验原理

1. 基于字符串匹配的分词方法又称机械分词法，基本思想：建立词库，一般用汉字字典；将给定的待分词的汉字串s，按照一定的扫描规则(正向/逆向)取s的子串；按照一定的匹配规则将该子串与词库中的某词条进行匹配。若成功，则该子串是词，继续分割剩余的部分，直到剩余部分为空；否则，该子串不是词，则取s的子串进行匹配。

2. 正向最大匹配法：目标：要求每一句的切分结果中词组的总数最少。若在字典中仅进行匹配，只要匹配就切分出来，但是不一定满足最大匹配的目标。

3. 逆向最大匹配法，思想和正向最大匹配法一样，只不过扫描方向与正向最大匹配分词相反，是从句子的右边向左边切分，直至句首。

实验环境（使用的软件）

Python 3.8

实验过程（实验步骤、记录、数据、分析）

一、正向最大匹配法

import time

#使用正向最大匹配算法实现中文分词

words_dic = []

def init():

file=open('词库','r',encoding='gb18030')#自行输入

for line in file.readlines():

lind =line.strip()

v=line.split(' ')[1]

words_dic.append(v)

#实现正向最大匹配算法的切词方法

def cut_words(raw_sentence,words_dic):

#统计词典中词的最大长度

max_length = max(len(word) for word in words_dic )

sentence = raw_sentence.strip()

#统计序列的长度

words_length = len(sentence)

cut_words_list = []

while words_length > 0: #判断是否需要继续切词

max_cut_length = min(max_length,words_length)

subsentence = sentence[:max_cut_length]

while max_cut_length >0:

if subsentence in words_dic:

cut_words_list.append(subsentence)

break

elif max_cut_length == 1:

cut_words_list.append(subsentence)

break

else:

max_cut_length = max_cut_length - 1

subsentence = subsentence[:max_cut_length]

sentence = sentence[max_cut_length:]

words_length = words_length - max_cut_length

words = "/".join(cut_words_list)

return words

def main():

init()

while True:

print("请输入您要分词的序列")

input_str = input()

if not input_str:

break

start = time.time()

result = cut_words(input_str,words_dic)

print("分词结果：")

print(result)

end = time.time()

print("运行时间:%.2f秒"%(end-start))

if __name__=="__main__":

main()

二、逆向最大匹配法

import time

#使用逆向最大匹配算法实现中文分词

words_dic = []

def init():

file=open('词库','r',encoding='gb18030')#自行输入

for line in file.readlines():

lind =line.strip()

v=line.split(' ')[1]

words_dic.append(v)

#实现逆向最大匹配算法的切词方法

def cut_words(raw_sentence,words_dic):

#统计词典中词的最大长度

max_length = max(len(word) for word in words_dic )

sentence = raw_sentence.strip()

#统计序列的长度

words_length = len(sentence)

cut_words_list = []

while words_length > 0: #判断是否需要继续切词

max_cut_length = min(max_length,words_length)

subsentence = sentence[-max_cut_length:]

while max_cut_length >0:

if subsentence in words_dic:

cut_words_list.append(subsentence)

break

elif max_cut_length == 1:

cut_words_list.append(subsentence)

break

else:

max_cut_length = max_cut_length - 1

subsentence = subsentence[-max_cut_length:]

sentence = sentence[0:-max_cut_length]

words_length = words_length - max_cut_length

cut_words_list.reverse()#自身反转

words = "/".join(cut_words_list)

return words

def main():

init()

while True:

print("请输入您要分词的序列")

input_str = input()

if not input_str:

break

start = time.time()

result = cut_words(input_str,words_dic)

print("分词结果：")

print(result)

end = time.time()

print("运行时间:%.2f秒"%(end-start))

if __name__=="__main__":

main()

心得体会及思考

关于算法效率，我引入了time包。

由于不知道word.dict如何进行打开处理，我将其转换成txt文档进行处理

————————————————

原文链接：https://blog.csdn.net/weixin_45791919/article/details/133131871

中文分词——正向最大匹配法和逆向最大匹配法的实现

中文分词——正向最大匹配法和逆向最大匹配法的实现

实验目的及要求

实验目的

实验要求

基本要求

实验提交内容

实验原理

实验环境（使用的软件）

实验过程（实验步骤、记录、数据、分析）

一、正向最大匹配法

二、逆向最大匹配法

心得体会及思考

推荐阅读更多精彩内容