代码来源
北京理工大学慕课-嵩天老师课程,统计三国演义人物出现最多的前15位。
#CalThreeKingdomsV2.py
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此","商议","如何","军士"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
思路
用jieda库切片,所以根本不需要考虑标点符号、空格的去除。
把把一个人的不同称谓统一成一类
把每个词出现次数写进字典{词语:出现次数}
把一些显然易见不是人名的词从counts字典中删掉(这依赖于多运行几次这段代码,然后设置一个excludes词库,再运行,再扩充词库。
把counts字典,用.itmes弄成键值对信息,再用list转成列表。见下:
>>> a={"2d":"哈",23:"s"}
>>> print(a)
{'2d': '哈', 23: 's'}
>>> c=a.items()
>>> print(c)
dict_items([('2d', '哈'), (23, 's')])
>>> list(c)
[('2d', '哈'), (23, 's')]
>>> print(c)
dict_items([('2d', '哈'), (23, 's')])
>>> print(list(c))
[('2d', '哈'), (23, 's')]
>>>
用.sort函数排序,其中排序依据key用一个匿名函数lambda表达,这里搞不太清,反正是用列表的二维x[1]作为排序依据,reverse=True即从大到小输出为新的items列表。
for循环跑15次,把前15输出。
另外
老师还讲了莎士比亚《哈姆雷特》单词出现的排序。
#CalHamletV1.py
def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
于是
是否可以统计《哈姆雷特》和《三国演义》除符号、空格外的数量。
def getText():
txt = open("hamlet.txt", "r").read()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.strip(ch) #将文本中特殊字符替换为空格
return txt
l=getText()
count=len(l)
print("全文总字母数:{}".format(count))
对于哈姆雷特,老师也是这么处理的,我先不管了。
但......
#CalThreeKingdomsV2.py
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
for ch in '!"#$%&()*+,-。/:;<=>?@[\\]^_‘{|}~':
txt = txt.strip(ch) #将文本中特殊字符替换为空格
count=len(txt)
print(count)
输出结果:602415
我在那一长串字符中,加了个空格,本来预期,字数会减少,结果纹丝不动,我又删了几个字符,以为会增多,结果也纹丝不动,好吧,有问题。
再写。
#CalThreeKingdomsV2.py
import re
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
clear='[!"#$%&()*+,-。/:;<=>?@[\\] ^_‘{|}~]'
str=re.sub(clear,"",txt)
count=len(str)
print(count)
输出结果:555212
反正,貌似是靠谱的,这里面用了re库(我目前完全不懂),还发现,要注意它的使用
>>> str=re.sub('[a]',"d",txt)
>>> print(str)
dddd
若不然,就错了
>>> str=re.sub([a],"d",txt)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
str=re.sub([a],"d",txt)
File "D:\download\python\lib\re.py", line 210, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "D:\download\python\lib\re.py", line 294, in _compile
return _cache[type(pattern), pattern, flags]
TypeError: unhashable type: 'list'
双引号也行哈
txt="adda"
>>> str=re.sub("[a]","d",txt)
>>> print(str)
dddd