前言

注意：NLTK只能应用于英文，中文不行。NLTK包的安装就不再赘述。这里我们使用python3和NLTK包来对英文文本进行分词和词性标注。

分词

我们首先给出一段文本，如下所示：

text = "President Donald Trump used racist language on Sunday to attack progressive Democratic congresswomen," \
       " falsely implying they weren't natural-born American citizens. Trump did not name who he was attacking " \
       "in Sunday's tirade but earlier this week he referenced New York Rep. Alexandria Ocasio-Cortez when the " \
       "President was defending House Speaker Nancy Pelosi."

导入NLTK包进行分词，并输出结果：

tokens = nltk.word_tokenize(text)  # 分词
print(tokens)

['President', 'Donald', 'Trump', 'used', 'racist', 'language', 'on', 'Sunday', 'to', 'attack', 'progressive', 'Democratic', 'congresswomen', ',', 'falsely', 'implying', 'they', 'were', "n't", 'natural-born', 'American', 'citizens', '.', 'Trump', 'did', 'not', 'name', 'who', 'he', 'was', 'attacking', 'in', 'Sunday', "'s", 'tirade', 'but', 'earlier', 'this', 'week', 'he', 'referenced', 'New', 'York', 'Rep.', 'Alexandria', 'Ocasio-Cortez', 'when', 'the', 'President', 'was', 'defending', 'House', 'Speaker', 'Nancy', 'Pelosi', '.']

接下来去除其中的停止词和标点：

english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%', '"', "'"]
no_punctuations = [w for w in tokens if w not in english_punctuations]  # 去除标点
stop_words = set(stopwords.words('english'))  # 获得英文停止词
filtered = [w for w in no_punctuations if w not in stop_words]  # 去除停止词
print(filtered)

['President', 'Donald', 'Trump', 'used', 'racist', 'language', 'Sunday', 'attack', 'progressive', 'Democratic', 'congresswomen', 'falsely', 'implying', "n't", 'natural-born', 'American', 'citizens', 'Trump', 'name', 'attacking', 'Sunday', "'s", 'tirade', 'earlier', 'week', 'referenced', 'New', 'York', 'Rep.', 'Alexandria', 'Ocasio-Cortez', 'President', 'defending', 'House', 'Speaker', 'Nancy', 'Pelosi']

对单词的词性进行标记：

tagged = nltk.pos_tag(filtered)  # 标记词性
print(tagged)

[('President', 'NNP'), ('Donald', 'NNP'), ('Trump', 'NNP'), ('used', 'VBD'), ('racist', 'JJ'), ('language', 'NN'), ('Sunday', 'NNP'), ('attack', 'RB'), ('progressive', 'JJ'), ('Democratic', 'JJ'), ('congresswomen', 'NNS'), ('falsely', 'RB'), ('implying', 'VBG'), ("n't", 'RB'), ('natural-born', 'JJ'), ('American', 'JJ'), ('citizens', 'NNS'), ('Trump', 'NNP'), ('name', 'NN'), ('attacking', 'VBG'), ('Sunday', 'NNP'), ("'s", 'POS'), ('tirade', 'NN'), ('earlier', 'RBR'), ('week', 'NN'), ('referenced', 'VBD'), ('New', 'NNP'), ('York', 'NNP'), ('Rep.', 'NNP'), ('Alexandria', 'NNP'), ('Ocasio-Cortez', 'NNP'), ('President', 'NNP'), ('defending', 'VBG'), ('House', 'NNP'), ('Speaker', 'NNP'), ('Nancy', 'NNP'), ('Pelosi', 'NNP')]

NLTK包的简单使用

NLTK包的简单使用

前言

分词

推荐阅读更多精彩内容