新闻摘要内容提取的算法如下:
1.按照算法对文本中的单词计算重要性,将符合阈值的设为关键字
2.按照句子中单词的重要性给句子计算重要性
3.按照句子的重要性为其排序
4.取出top-k个句子为摘要
准备工作:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest
stopwords = set(stopwords.words('english') + list(punctuation))
max_cut = 0.9
min_cut = 0.1
这里说一下punctuation和nlargest:
punctuation 是一个列表,包含了英文中的标点和符号。
nlargest() 函数可以很快地求出一个容器中最大的n个数字,排序方式是堆排序。
步骤一:
计算单词重要性:
def compute_frequencies(word_sent):
freq = defaultdict(int)
for s in word_sent:
for word in s:
if word not in stopwords:
freq[word] += 1
m = float(max(freq.values()))
for w in freq.keys():
freq[w] /= m
if freq[w] >= max_cut or freq[w] <= min_cut:
del freq[w]
return freq
步骤二:
计算句子重要性:
def summarize(text,n):
sents = sent_tokenize(text)
assert n <= len(sents)
word_sent = [word_tokenize(s.lower()) for s in sents]
freq = compute_frequencies(word_sent)
ranking = defaultdict(int)
for i ,word in enumerate(word_sent):
for w in word:
if w in freq:
ranking[i] += freq[w]
sents_idx = rank(ranking,n)
return [sents[j] for j in sents_idx]
步骤三:
排序:
def rank(ranking,n):
return nlargest(n,ranking,key=ranking.get)