前言
注意:NLTK只能应用于英文,中文不行。NLTK包的安装就不再赘述。这里我们使用python3和NLTK包来对英文文本进行分词和词性标注。
分词
我们首先给出一段文本,如下所示:
text = "President Donald Trump used racist language on Sunday to attack progressive Democratic congresswomen," \
" falsely implying they weren't natural-born American citizens. Trump did not name who he was attacking " \
"in Sunday's tirade but earlier this week he referenced New York Rep. Alexandria Ocasio-Cortez when the " \
"President was defending House Speaker Nancy Pelosi."
导入NLTK包进行分词,并输出结果:
tokens = nltk.word_tokenize(text) # 分词
print(tokens)
['President', 'Donald', 'Trump', 'used', 'racist', 'language', 'on', 'Sunday', 'to', 'attack', 'progressive', 'Democratic', 'congresswomen', ',', 'falsely', 'implying', 'they', 'were', "n't", 'natural-born', 'American', 'citizens', '.', 'Trump', 'did', 'not', 'name', 'who', 'he', 'was', 'attacking', 'in', 'Sunday', "'s", 'tirade', 'but', 'earlier', 'this', 'week', 'he', 'referenced', 'New', 'York', 'Rep.', 'Alexandria', 'Ocasio-Cortez', 'when', 'the', 'President', 'was', 'defending', 'House', 'Speaker', 'Nancy', 'Pelosi', '.']
接下来去除其中的停止词和标点:
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%', '"', "'"]
no_punctuations = [w for w in tokens if w not in english_punctuations] # 去除标点
stop_words = set(stopwords.words('english')) # 获得英文停止词
filtered = [w for w in no_punctuations if w not in stop_words] # 去除停止词
print(filtered)
['President', 'Donald', 'Trump', 'used', 'racist', 'language', 'Sunday', 'attack', 'progressive', 'Democratic', 'congresswomen', 'falsely', 'implying', "n't", 'natural-born', 'American', 'citizens', 'Trump', 'name', 'attacking', 'Sunday', "'s", 'tirade', 'earlier', 'week', 'referenced', 'New', 'York', 'Rep.', 'Alexandria', 'Ocasio-Cortez', 'President', 'defending', 'House', 'Speaker', 'Nancy', 'Pelosi']
对单词的词性进行标记:
tagged = nltk.pos_tag(filtered) # 标记词性
print(tagged)
[('President', 'NNP'), ('Donald', 'NNP'), ('Trump', 'NNP'), ('used', 'VBD'), ('racist', 'JJ'), ('language', 'NN'), ('Sunday', 'NNP'), ('attack', 'RB'), ('progressive', 'JJ'), ('Democratic', 'JJ'), ('congresswomen', 'NNS'), ('falsely', 'RB'), ('implying', 'VBG'), ("n't", 'RB'), ('natural-born', 'JJ'), ('American', 'JJ'), ('citizens', 'NNS'), ('Trump', 'NNP'), ('name', 'NN'), ('attacking', 'VBG'), ('Sunday', 'NNP'), ("'s", 'POS'), ('tirade', 'NN'), ('earlier', 'RBR'), ('week', 'NN'), ('referenced', 'VBD'), ('New', 'NNP'), ('York', 'NNP'), ('Rep.', 'NNP'), ('Alexandria', 'NNP'), ('Ocasio-Cortez', 'NNP'), ('President', 'NNP'), ('defending', 'VBG'), ('House', 'NNP'), ('Speaker', 'NNP'), ('Nancy', 'NNP'), ('Pelosi', 'NNP')]