概述
- nltk是一个自然语言处理工具包,在NLP领域中,最常使用的一个Python库。
https://yiyibooks.cn/yiyi/nltk_python/index.html
NLTK 安装
Step 1:用pip命令安装nltk
pip install nltk
Step 2:运行python命令
Step 3:输入import nltk
Step 4:输入nltk.download()
然后 会弹出 一个这个 框 点击Download 全部下载就好
-- 执行nltk.download() 可能会出现这个错误,执行下面代码取消ssl全局认证就好
>>> import ssl
>>> ssl._create_default_https_context = ssl._create_unverified_context
>>> nltk.download()
NLTK 入门
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>
>>>
搜索文本
- concordance()函数显示一个指定单词的每一次出现,连同一些上下文一起显示。
>>> text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
>>>
- similar()找出相似上下文中的其他词
>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet
>>>
- 函数common_contexts允许我们研究两个或两个以上的词共同的上下文
>>> text2.common_contexts(["monstrous", "very"])
a_pretty is_pretty am_glad be_glad a_lucky
>>>
简单统计
- 频率分布
>>>> fdist1 = FreqDist(text1)
>>> print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50)
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),
("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
>>> fdist1['whale']
906
>>>fdist1.plot(50, cumulative=True)
第一次调用FreqDist时,传递文本的名称作为参数,我们可以看到已经被计算出来的《白鲸记》中的总的词数(“outcomes”)—— 260,819。表达式most_common(50)给出文本中50 个出现频率最高的单词类型,
-聊天语料库中(text5)所有长度超过7 个字符,且出现次数超过7 次的词
>>> fdist5 = FreqDist(text5)
>>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question',
'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football',
'innocent', 'listening', 'remember', 'seriously', 'something', 'together',
'tomorrow', 'watching']
>>>
- 词语搭配和双连词
一个搭配是异乎寻常地经常在一起出现的词序列。red wine 是一个搭配,而the wine 不是。搭配的一个特点是其中的词不能被类似的词置换。
要获取搭配,我们先从提取文本词汇中的词对,也就是双连词开始。使用函数bigrams()很容易
实现
>>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
在这里我们看到词对than-done是一个双连词,在Python 中写成('than', 'done')。现在,搭配基本上就是频繁的双连词,除非我们更加注重包含不常见词的情况。特别的,我们希望找到比我们基于单个词的频率预期得到的更频繁出现的双连词。collocations() 函数为我们做这些。
>>> text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
>>> text8.collocations()
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build
>>>
NLTK 频率分布类中定义的函数
处理停用词
- 由于一些常用字或者词使用的频率相当的高,英语中比如a,the, he等,中文中比如:我、它、个等,每个页面几乎都包含了这些词汇,如果搜索引擎它们当关键字进行索引,那么所有的网站都会被索引,而且没有区分度,所以一般把这些词直接去掉,不可当做关键词。
from nltk.corpus import stopwords
words = stopwords.words('english')
print(words)
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
import nltk
from nltk.corpus import stopwords
with open('news.txt', 'r', encoding='utf-8') as f:
data = f.read()
# 处理停用词
stopwords = stopwords.words('english')
tokens = []
result = data.split()
for res in result:
if res not in stopwords:
tokens.append(res)
freq = nltk.FreqDist(tokens)
print(freq.most_common(100))
freq.plot(10)
[('IPO', 10), ('Chinese', 9), ('said', 7), ('US', 7), ('UP', 6), ('Fintech', 6), ('said.', 5), ('market', 5), ('company', 4), ('investors', 4), ('brokerage', 4), ('trading', 4), ('"We', 4), ('percent', 4), ('American', 3), ('technology', 3), ('shares', 3), ('online', 3), ('The', 3), ('trade', 3), ('innovative', 3), ('global', 3), ('public', 3), ('McCooey', 3), ('recent', 3), ('easier', 2), ('stocks', 2), ('Beijing-based', 2), ('startup', 2), ('Nasdaq', 2), ('CEO', 2), ('Wu', 2), ('services', 2), ('Asia', 2), ('closed', 2), ('offering', 2), ('13', 2), ('user', 2), ('experience', 2), ('become', 2), ('one', 2), ('take', 2), ('slow', 2), ('rose', 2), ('year', 2), ('IPOs', 2), ('Futu', 2), ('activity', 2), ('government', 2), ('rest', 2), ('A', 1), ('seeks', 1), ('make', 1), ('buy', 1), ('rely', 1), ('latest', 1), ('so.', 1), ('Holding', 1), ('Ltd,', 1), ('whose', 1), ('solid', 1), ('debut', 1), ('Stock', 1), ('Market', 1), ('Wednesday,', 1), ('internet-based', 1), ('developed', 1), ('securities', 1), ('system', 1), ('investors.', 1), ('launched', 1), ('enable', 1), ('clients', 1), ('easily', 1), ('overseas', 1), ('foreign', 1), ('exchanges,', 1), ('especially', 1), ('US,', 1), ('according', 1), ('Tianhua', 1), ('Wu.', 1), ('also', 1), ('company,"', 1), ('"By', 1), ('introducing', 1), ('enhancing', 1), ('capabilities,', 1), ('platform', 1), ('makes', 1), ('enables', 1), ('users', 1), ('seamlessly', 1), ('connect', 1), ('markets."', 1), ('Shares', 1), ('Fintech,', 1), ('known', 1), ('Tiger', 1), ('Brokers,', 1)]
获取文本语料库和词汇资源
古腾堡语料库
NLTK 包含古腾堡项目(Project Gutenberg)电子文本档案的经过挑选的一小部分文本,该项目大约有25,000本免费电子图书
- 导入:from nltk.corpus import gutenberg
>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
>>> emma = gutenberg.words('austen-emma.txt')
>>> len(emma)
192427
- 前面,我们演示了如何使用text1.concordance()对text1这样的文本进行索引。然而,这是假设你正在使用由from nltk.book import * 导入的9 个文本之一。现在你开始研究nltk.corpus中的数据,像前面的例子一样,你必须采用以下语句对来处理索引和其它任务:
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprize")
- raw()函数给我们没有进行过任何语言学处理的文件的内容。
- sents()函数把文本划分成句子,其中每一个句子是一个单词列表
>>> raw = gutenberg.raw("burgess-busterbrown.txt")
>>> raw[1:20]
'The Adventures of B'
>>> words = gutenberg.words("burgess-busterbrown.txt")
>>> words[1:20]
['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.',
'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster',
'Bear']
>>> sents = gutenberg.sents("burgess-busterbrown.txt")
>>> sents[1:20]
[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as',
'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched',
'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...], ...]
网络和聊天文本
NLTK的网络文本集合的内容包括Firefox交流论坛,在纽约无意听到的对话, 加勒比海盗的电影剧本,个人广告和葡萄酒的评论等
- 导入: from nltk.corpus import webtext
布朗语料库
布朗语料库是第一个百万词级的英语电子语料库的,由布朗大学于1961 年创建。这个语料库包含500 个不同来源的文本,按照文体分类,如:新闻、社论等。
- 导入: from nltk.corpus import brown
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]
- 布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——很方便的资源。让我们来比较不同文体中的情态动词的用法。第一步是产生特定文体的计数。记住做下面的实验之前要import nltk:
>>> from nltk.corpus import brown
>>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist(w.lower() for w in news_text)
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals:
... print(m + ':', fdist[m], end=' ')
...
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
- 统计每一个感兴趣的文体。我们使用NLTK 提供的带条件的频率分布函数, ConditionalFreqDist() 以一个配对列表作为输入。
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
science_fiction 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13
路透社语料库
路透社语料库包含10,788 个新闻文档,共计130 万字。这些文档分成90 个主题,按照“训练”和“测试”分为两组, test/* 的为测试组, training为训练组
- 导入:from nltk.corpus import reuters
就职演说语料库
- 导入: from nltk.corpus import inaugural
文本语料库的结构
NLTK 中定义的基本语料库函数:使用help(nltk.corpus.reader)可以找到更多的文档,也可以阅读http://nltk.org/howto上的在线语料库的HOWTO。
加载自己的语料库
如果你有自己收集的文本文件,并且想使用前面讨论的方法访问它们,你可以很容易地在NLTK 中的PlaintextCorpusReader帮助下加载它们。检查你的文件在文件系统中的位置;在下面的例子中,我们假定你的文件在/usr/share/dict目录下。不管是什么位置,将变量corpus_root 的值设置为这个目录。PlaintextCorpusReader初始化函数 的第二个参数可以是一个如['a.txt', 'test/b.txt']这样的fileids列表,或者一个匹配所有fileids 的模式如:'[abc]/.*.txt'
>>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict'
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*')
>>> wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
>>> wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]
条件频率分布
条件频率分布是频率分布的集合,每个频率分布有一个不同的“条件”, FreqDist(mylist)会计算列表中每个元素项目出现的次数,条件频率是用ConditionalFreqDist数据类型实现的,FreqDist()以一个简单的列表作为输入,ConditionalFreqDist() 以一个配对列表作为输入。
我们是用布朗语料库新闻和言情这两个文本,对于每个文体我们遍历文体中的每个词,以产生文体与词的配对,
>>>> genre_word = [
... (genre, word)
... for genre in ['news', 'romance']
... for word in brown.words(categories=genre)]
>>> len(genre_word)
170576
- 因此,在下面的代码中我们可以看到,列表genre_word的前几个配对将是( 'news',word) 的形式,而最后几个配对将是 ('romance', word)的形式。
>>> genre_word[:4]
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]
>>> genre_word[-4:]
[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]
- 现在,我们可以使用此配对列表创建ConditionalFreqDist,并将它保存在一个变量cfd中。像往常一样,我们可以输入变量的名称来检查它并确认它有两个条件:
>>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['news', 'romance'] # [_conditions-cfd]
- 让我们访问这两个条件,它们每一个都只是一个频率分布:
>>> print(cfd['news'])
<FreqDist with 14394 samples and 100554 outcomes>
>>> print(cfd['romance'])
<FreqDist with 8452 samples and 70022 outcomes>
>>> cfd['romance'].most_common(20)
[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502),
('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993),
('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690),
('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]
>>> cfd['romance']['could']
193
NLTK 中的条件频率分布:定义、访问和可视化一个计数的条件频率分布的常用方法和习惯用法。
WordNet
WordNet是面向语义的英语词典,类似与传统辞典,但具有更丰富的结构。NLTK 中包括英语WordNet,共有155,287 个词和117,659 个同义词集合.
同义词
from nltk.corpus import wordnet as wn
dog_set = wn.synsets('dog')
print('dog的同义词集为:', dog_set)
print('dog的各同义词集包含的单词有:',[dog.lemma_names() for dog in dog_set])
print('dog的各同义词集的具体定义是:',[dog.definition() for dog in dog_set])
print('dog的各同义词集的例子是:',[dog.examples() for dog in dog_set])
dog的同义词集为: [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
dog的各同义词集包含的单词有: [['dog', 'domestic_dog', 'Canis_familiaris'], ['frump', 'dog'], ['dog'], ['cad', 'bounder', 'blackguard', 'dog', 'hound', 'heel'], ['frank', 'frankfurter', 'hotdog', 'hot_dog', 'dog', 'wiener', 'wienerwurst', 'weenie'], ['pawl', 'detent', 'click', 'dog'], ['andiron', 'firedog', 'dog', 'dog-iron'], ['chase', 'chase_after', 'trail', 'tail', 'tag', 'give_chase', 'dog', 'go_after', 'track']]
dog的各同义词集的具体定义是: ['a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds', 'a dull unattractive unpleasant girl or woman', 'informal term for a man', 'someone who is morally reprehensible', 'a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll', 'a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward', 'metal supports for logs in a fireplace', 'go after with the intent to catch']
dog的各同义词集的例子是: [['the dog barked all night'], ['she got a reputation as a frump', "she's a real dog"], ['you lucky dog'], ['you dirty dog'], [], [], ['the andirons were too hot to touch'], ['The policeman chased the mugger down the alley', 'the dog chased the rabbit']]
上位词和下位词
上位词(hypernym),指概念上外延更广的主题词。 例如:”花”是”鲜花”的上位词,”植物”是”花”的上位词,”音乐”是”mp3”的上位词。反过来就是下位词了
-
上位词和下位词通过 hypernym_paths() 和 hyponyms()来访问。
下位词
>>> motorcar = wn.synset('car.n.01')
>>> types_of_motorcar = motorcar.hyponyms()
>>> types_of_motorcar[0]
Synset('ambulance.n.01')
>>> sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())
['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon',
'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible',
'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car',
'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap',
'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover',
'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car',
'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer',
'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan',
'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car',
'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car',
'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon',
'wagon']
- 上位词
我们也可以通过访问上位词来浏览层次结构。有些词有多条路径,因为它们可以归类在一个以上的分类中。car.n.01与entity.n.01间有两条路径,因为wheeled_vehicle.n.01可以同时被归类为车辆和容器。
>>> motorcar.hypernyms()
[Synset('motor_vehicle.n.01')]
>>> paths = motorcar.hypernym_paths()
>>> len(paths)
2
>>> [synset.name() for synset in paths[0]]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',
'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01',
'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']
>>> [synset.name() for synset in paths[1]]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',
'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01',
'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'
- 我们可以用如下方式得到一个最一般的上位(或根上位)同义词集:
>>> motorcar.root_hypernyms()
[Synset('entity.n.01')]
反义词
antonyms = []
for syn in wordnet.synsets("small"):
for l in syn.lemmas():
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(antonyms) # ['large', 'big', 'big']
walks = wn.synsets('walk')
supplys = wn.synsets('supply')
hots = wn.synsets('hot')
print('walk的反义词包括: ', [walk_lemma.antonyms() for walk_lemmas in [walk.lemmas() for walk in walks] for walk_lemma in walk_lemmas if walk_lemma.antonyms()!=[]])
print('supply的反义词包括: ', [supply_lemma.antonyms() for supply_lemmas in [supply.lemmas() for supply in supplys] for supply_lemma in supply_lemmas if supply_lemma.antonyms()!=[]])
print('hot的反义词包括: ', [hot_lemma.antonyms() for hot_lemmas in [hot.lemmas() for hot in hots] for hot_lemma in hot_lemmas if hot_lemma.antonyms()!=[]])
walk的反义词包括: [[Lemma('ride.v.02.ride')]]
supply的反义词包括: [[Lemma('demand.n.02.demand')], [Lemma('recall.v.06.recall')]]
hot的反义词包括: [[Lemma('cold.a.01.cold')], [Lemma('cold.a.02.cold')]]
语义相似度
回想一下每个同义词集都有一个或多个上位词路径连接到一个根上位词,如entity.n.01。连接到同一个根的两个同义词集可能有一些共同的上位词。如果两个同义词集共用一个非常具体的上位词——在上位词层次结构中处于较低层的上位词——它们一定有密切的联系。
>>> right = wn.synset('right_whale.n.01')
>>> orca = wn.synset('orca.n.01')
>>> minke = wn.synset('minke_whale.n.01')
>>> tortoise = wn.synset('tortoise.n.01')
>>> novel = wn.synset('novel.n.01')
>>> right.lowest_common_hypernyms(minke)
[Synset('baleen_whale.n.01')]
>>> right.lowest_common_hypernyms(orca)
[Synset('whale.n.02')]
>>> right.lowest_common_hypernyms(tortoise)
[Synset('vertebrate.n.01')]
>>> right.lowest_common_hypernyms(novel)
[Synset('entity.n.01')]
WordNet同义词集的集合上定义的相似度能够包括上面的概念。例如,path_similarity是基于上位词层次结构中相互连接的概念之间的最短路径在0-1范围的打分(两者之间没有路径就返回-1)。同义词集与自身比较将返回1。考虑以下的相似度:露脊鲸与小须鲸、逆戟鲸、乌龟以及小说。数字本身的意义并不大,当我们从海洋生物的语义空间转移到非生物时它是减少的。
>>> right.path_similarity(minke)
0.25
>>> right.path_similarity(orca)
0.16666666666666666
>>> right.path_similarity(tortoise)
0.07692307692307693
>>> right.path_similarity(novel)
0.043478260869565216
NLP 的流程
处理流程:打开一个URL,读里面HTML 格式的内容,去除标记,并选择字符的切片;然后分词,是否转换为nltk.Text对象是可选择的;我们也可以将所有词汇小写并提取词汇表。
用正则表达式为文本分词
>>> raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
... well without--Maybe it's always pepper that makes people hot-tempered,'..."""
>>>> re.split(r'[ \t\n]+', raw)
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',
'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper',
'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',
"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
正则表达式«[ \t\n]+»匹配一个或多个空格、制表符(\t)或换行符(\n)
>>> re.findall(r'\w+|\S\w*', raw)
["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',
'(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",
'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does',
'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that',
'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']
NLTK 的正则表达式分词器
- 函数nltk.regexp_tokenize()与re.findall()类似(我们一直在使用它进行分词)。然而,nltk.regexp_tokenize()分词效率更高,且不需要特殊处理括号。为了增强可读性,我们将正则表达式分几行写,每行添加一个注释。特别的(?x) "verbose 标志”告诉Python去掉嵌入的空白字符和注释。
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
使用verbose 标志时,不可以再使用' '来匹配一个空格字符;使用\s代替。regexp_tokenize()函数有一个可选的gaps参数。设置为True时,正则表达式指定标识符间的距离,就像使用re.split()一样。
断句
- 在将文本分词之前,我们需要将它分割成句子。NLTK通过Punkt 句子分割器来实现
>>> text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
>>> sents = nltk.sent_tokenize(text)
>>> pprint.pprint(sents[79:89])
['"Nonsense!"',
'said Gregory, who was very rational when anyone else\nattempted paradox.',
'"Why do all the clerks and navvies in the\n'
'railway trains look so sad and tired, so very sad and tired?',
'I will\ntell you.',
'It is because they know that the train is going right.',
'It\n'
'is because they know that whatever place they have taken a ticket\n'
'for that place they will reach.',
'It is because after they have\n'
'passed Sloane Square they know that the next station must be\n'
'Victoria, and nothing but Victoria.',
'Oh, their wild rapture!',
'oh,\n'
'their eyes like stars and their souls again in Eden, if the next\n'
'station were unaccountably Baker Street!"',
'"It is you who are unpoetical," replied the poet Syme.']
分类和标注词汇
- 将单词按它们的词性分类并进行相应地标注的过程,称为词语性质标注、词性标注或简称标注。词性也称为词类或词汇类别。用于特定任务的标记的集合被称为一个标记集。
使用词性标注器
一个词语性质标注器或者词性标注器处理一个单词序列,为每个词附加一个词性标记:
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
我们看到and是CC,并列连词;now和completely是RB,副词;for是IN,介词;something是NN,名词;different是JJ,形容词。
- 让我们来看看另一个例子,这次包括一些同形同音异义词:
>>> text = word_tokenize("They refuse to permit us to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
请注意refuse和permit都以现在时动词(VBP)和名词(NN)形式出现。例如refUSE是一个动词,意为“拒绝”,而REFuse是一个名词,意为“垃圾”(即它们不是同音词)。因此,我们需要
- 词汇类别如“名词”和词性标记如NN,看上去似乎有其用途,但在细节上将使许多读者感到晦涩。你可能想知道要引进这种额外的信息的理由是什么。很多这些类别源于对文本中单词分布的粗略分析。考虑下面的分析,涉及woman(名词),bought(动词),over(介词)和the(限定词)。text.similar()方法接收一个单词w,找出所有上下文w1w w2,然后找出所有出现在相同上下文中的词w',即w1w'w2。
>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
>>> text.similar('woman')
Building word-context index...
man day time year car moment world family house boy child country job
state girl place war way case question
>>> text.similar('bought')
made done put said found had seen given left heard been brought got
set was called felt in that told
>>> text.similar('over')
in on to of and for with from at by that into as up out down through
about all is
>>> text.similar('the')
a his this their its her an that our any all one these my in your no
some other and
可以观察到,搜索woman找到名词;搜索bought找到的大部分是动词;搜索over一般会找到介词;搜索the找到几个限定词
已经标注的语料库
表示已经标注的词符
- 按照NLTK的约定,一个已标注的词符使用一个由词符和标记组成的元组来表示。我们可以使用函数str2tuple()从表示一个已标注的词符的标准字符串创建一个这样的特殊元组:
>>> tagged_token = nltk.tag.str2tuple('fly/NN')
>>> tagged_token
('fly', 'NN')
>>> tagged_token[0]
'fly'
>>> tagged_token[1]
'NN'
读取已标注的语料库
NLTK中包括的若干语料库已标注了词性,形式如下
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.只要语料库包含已标注的文本,NLTK的语料库接口都将有一个tagged_words()方法。使你不必理会这些不同的文件格式, 注意,词性标记已转换为大写的,自从布朗语料库发布以来这已成为标准的做法。
>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
>>> nltk.corpus.brown.tagged_words(tagset='universal')
[('The', 'DET'), ('Fulton', 'NOUN'), ...]
>>> print(nltk.corpus.nps_chat.tagged_words())
[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
>>> nltk.corpus.conll2000.tagged_words()
[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]
>>> nltk.corpus.treebank.tagged_words()
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
通用词性标记集
已标注的语料库使用许多不同的标记集约定来标注词汇。为了帮助我们开始,我们将看一看一个简化的标记集
>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
>>> tag_fd.most_common()
[('NOUN', 30640), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389),
('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264),
('NUM', 2166), ('X', 106)]
探索已标注的语料库
- 观察跟在often后面的词汇
>>> brown_learned_text = brown.words(categories='learned')
>>> sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often'))
[',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming',
'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', ...]
- 使用tagged_words()方法查看跟随词的词性标记
>>> brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
>>> tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
>>> fd = nltk.FreqDist(tags)
>>> fd.tabulate()
PRT ADV ADP . VERB ADJ
2 8 7 4 37 6
可以看到often后面最高频率的词性是动词。名词从来没有在这个位置出现(当前示例这个语料库中)
如果语料库也被分割成句子,将有一个tagged_sents()方法将已标注的词划分成句子,而不是将它们表示成一个大列表。在我们开始开发自动标注器时,这将是有益的,因为它们在句子列表上被训练和测试,而不是词。
接下来,让我们看一些较大范围的上下文,找出涉及特定标记和词序列的词(在这种情况下,<Verb> to <Verb>。在code-three-word-phrase中,我们考虑句子中的每个三词窗口,检查它们是否符合我们的标准。如果标记匹配,我们输出对应的词。
>>>from nltk.corpus import brown
def process(sentence):
for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
print(w1, w2, w3)
>>> for tagged_sent in brown.tagged_sents():
... process(tagged_sent)
...
combined to achieve
continue to place
serve to protect
wanted to wait
allowed to place
expected to become
...</pre>
- brown.tagged_sents() 针对一条完整的句子,然后再将句子里面的每个词进行词行标注
- brown.tagged_words() 针对的是一个单独的词
>>> tagged_sent = brown.tagged_sents()[0]
>>> tagged_sent
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]
>>>
>>> tagged_word = brown.tagged_words()[0]
>>> tagged_word
('The', 'AT')
>>>
- nltk.trigrams 将三个元素链接在一起 组合为一个 数组
- nltk.bigrams 将两个元素链接在一起 组合为一个数组
>>> import nltk
>>> from nltk.corpus import brown
>>>
>>> tagged_sent = brown.tagged_sents()[0]
>>> tagged_sent
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]
>>>
>>> bigrams = (list(nltk.bigrams(tagged_sent))
>>> bigrams
[(('The', 'AT'), ('Fulton', 'NP-TL')), (('Fulton', 'NP-TL'), ('County', 'NN-TL')), (('County', 'NN-TL'), ('Grand', 'JJ-TL')), (('Grand', 'JJ-TL'), ('Jury', 'NN-TL')), (('Jury', 'NN-TL'), ('said', 'VBD')), (('said', 'VBD'), ('Friday', 'NR')), (('Friday', 'NR'), ('an', 'AT')), (('an', 'AT'), ('investigation', 'NN')), (('investigation', 'NN'), ('of', 'IN')), (('of', 'IN'), ("Atlanta's", 'NP$')), (("Atlanta's", 'NP$'), ('recent', 'JJ')), (('recent', 'JJ'), ('primary', 'NN')), (('primary', 'NN'), ('election', 'NN')), (('election', 'NN'), ('produced', 'VBD')), (('produced', 'VBD'), ('``', '``')), (('``', '``'), ('no', 'AT')), (('no', 'AT'), ('evidence', 'NN')), (('evidence', 'NN'), ("''", "''")), (("''", "''"), ('that', 'CS')), (('that', 'CS'), ('any', 'DTI')), (('any', 'DTI'), ('irregularities', 'NNS')), (('irregularities', 'NNS'), ('took', 'VBD')), (('took', 'VBD'), ('place', 'NN')), (('place', 'NN'), ('.', '.'))]
>>>
>>> trigrams = list(nltk.trigrams(tagged_sent))
>>> trigrams
[(('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')), (('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL')), (('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL')), (('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD')), (('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR')), (('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT')), (('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN')), (('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')), (('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$')), (('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ')), (("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN')), (('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN')), (('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD')), (('election', 'NN'), ('produced', 'VBD'), ('``', '``')), (('produced', 'VBD'), ('``', '``'), ('no', 'AT')), (('``', '``'), ('no', 'AT'), ('evidence', 'NN')), (('no', 'AT'), ('evidence', 'NN'), ("''", "''")), (('evidence', 'NN'), ("''", "''"), ('that', 'CS')), (("''", "''"), ('that', 'CS'), ('any', 'DTI')), (('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS')), (('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD')), (('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN')), (('took', 'VBD'), ('place', 'NN'), ('.', '.'))]
>>>
最后,让我们看看与它们的标记关系高度模糊不清的词。了解为什么要标注这样的词是因为它们各自的上下文可以帮助我们弄清楚标记之间的区别。
>>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
>>> data = nltk.ConditionalFreqDist((word.lower(), tag)
... for (word, tag) in brown_news_tagged)
>>> for word in sorted(data.conditions()):
... if len(data[word]) > 3:
... tags = [tag for (tag, _) in data[word].most_common()]
... print(word, ' '.join(tags))
...
best ADJ ADV NP V
better ADJ ADV V DET
close ADV ADJ V N
cut V N VN VD
even ADV DET ADJ V
grant NP N V -
hit V VD VN N
lay ADJ V NP VD
left VD ADJ N VN
like CNJ V ADJ P -
near P ADV ADJ DET
open ADJ V N ADV
past N ADJ DET P
present ADJ ADV V N
read V VN VD NP
right ADJ N DET ADV
second NUM ADV DET N
set VN V VD N -
that CNJ V WH DET
自动标注
默认标注器
- 最简单的标注器是为每个词符分配同样的标记。这似乎是一个相当平庸的一步,但它建立了标注器性能的一个重要的底线。为了得到最好的效果,我们用最有可能的标记标注每个词。让我们找出哪个标记是最有可能的(现在使用未简化标记集):
>>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
>>> nltk.FreqDist(tags).max()
'NN'
现在我们可以创建一个将所有词都标注成NN的标注器。
>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = word_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),
('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'),
('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'),
('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
不出所料,这种方法的表现相当不好。在一个典型的语料库中,它只标注正确了八分之一的标识符,正如我们在这里看到的:
>>> default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028
默认的标注器给每一个单独的词分配标记,即使是之前从未遇到过的词。碰巧的是,一旦我们处理了几千词的英文文本之后,大多数新词都将是名词。正如我们将看到的,这意味着,默认标注器可以帮助我们提高语言处理系统的稳定性。
正则表达式标注器
正则表达式标注器基于匹配模式分配标记给词符。例如,我们可能会猜测任一以ed结尾的词都是动词过去分词,任一以's结尾的词都是名词所有格。可以用一个正则表达式的列表表示这些:
>>> patterns = [
... (r'.*ing$', 'VBG'), # gerunds
... (r'.*ed$', 'VBD'), # simple past
... (r'.*es$', 'VBZ'), # 3rd singular present
... (r'.*ould$', 'MD'), # modals
... (r'.*\'s$', 'NN$'), # possessive nouns
... (r'.*s$', 'NNS'), # plural nouns
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
... (r'.*', 'NN') # nouns (default)
... ]
请注意,这些是顺序处理的,第一个匹配上的会被使用。现在我们可以建立一个标注器,并用它来标记一个句子。做完这一步会有约五分之一是正确的。
>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(brown_sents[3])
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'),
('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'),
("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'),
('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...]
>>> regexp_tagger.evaluate(brown_tagged_sents)
0.20326391789486245
查询标注器
- 很多高频词没有NN标记。让我们找出100个最频繁的词,存储它们最有可能的标记。然后我们可以使用这个信息作为“查找标注器”(NLTK UnigramTagger)的模型:
>>> fd = nltk.FreqDist(brown.words(categories='news'))
>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
>>> most_freq_words = fd.most_common(100)
>>> likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)
>>> baseline_tagger.evaluate(brown_tagged_sents)
0.45578495136941344
现在应该并不奇怪,仅仅知道100个最频繁的词的标记就使我们能正确标注很大一部分词符(近一半,事实上)。让我们来看看它在一些未标注的输入文本上做的如何:
>>> sent = brown.sents(categories='news')[3]
>>> baseline_tagger.tag(sent)
[('``', '``'), ('Only', None), ('a', 'AT'), ('relative', None),
('handful', None), ('of', 'IN'), ('such', None), ('reports', None),
('was', 'BEDZ'), ('received', None), ("''", "''"), (',', ','),
('the', 'AT'), ('jury', None), ('said', 'VBD'), (',', ','),
('``', '``'), ('considering', None), ('the', 'AT'), ('widespread', None),
('interest', None), ('in', 'IN'), ('the', 'AT'), ('election', None),
(',', ','), ('the', 'AT'), ('number', None), ('of', 'IN'),
('voters', None), ('and', 'CC'), ('the', 'AT'), ('size', None),
('of', 'IN'), ('this', 'DT'), ('city', None), ("''", "''"), ('.', '.')]
许多词都被分配了一个None标签,因为它们不在100个最频繁的词之中。在这些情况下,我们想分配默认标记NN。换句话说,我们要先使用查找表,如果它不能指定一个标记就使用默认标注器,这个过程叫做回退。我们可以做到这个,通过指定一个标注器作为另一个标注器的参数,如下所示。现在查找标注器将只存储名词以外的词的词-标记对,只要它不能给一个词分配标记,它将会调用默认标注器。
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags,
... backoff=nltk.DefaultTagger('NN'))
让我们把所有这些放在一起,写一个程序来创建和评估具有一定范围的查找标注器
def performance(cfd, wordlist):
lt = dict((word, cfd[word].max()) for word in wordlist)
baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
def display():
import pylab
word_freqs = nltk.FreqDist(brown.words(categories='news')).most_common()
words_by_freq = [w for (w, _) in word_freqs]
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
sizes = 2 ** pylab.arange(15)
perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
pylab.plot(sizes, perfs, '-bo')
pylab.title('Lookup Tagger Performance with Varying Model Size')
pylab.xlabel('Model Size')
pylab.ylabel('Performance')
pylab.show()
>>> display()
可以观察到,随着模型规模的增长,最初的性能增加迅速,最终达到一个稳定水平,这时模型的规模大量增加性能的提高很小。
N-gram标注
一元标注
一元标注器基于一个简单的统计算法:对每个标识符分配这个独特的标识符最有可能的标记。例如,它将分配标记JJ给词frequent的所有出现,因为frequent用作一个形容词(例如a frequent word)比用作一个动词(例如I frequent this cafe)更常见。一个一元标注器的行为就像一个查找标注器】,除了有一个更方便的建立它的技术,称为训练。在下面的代码例子中,我们训练一个一元标注器,用它来标注一个句子,然后评估:
>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
>>> unigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'),
(',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'),
('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'),
('direct', 'JJ'), ('.', '.')]
>>> unigram_tagger.evaluate(brown_tagged_sents)
0.9349006503968017
分离训练和测试数据
现在,我们正在一些数据上训练一个标注器,必须小心不要在相同的数据上测试,如我们在前面的例子中的那样。一个只是记忆它的训练数据,而不试图建立一个一般的模型的标注器会得到一个完美的得分,但在标注新的文本时将是无用的。相反,我们应该分割数据,90%为测试数据,其余10%为测试数据:
>>> size = int(len(brown_tagged_sents) * 0.9)
>>> size
4160
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
0.811721...
一般的N-gram标注
在基于一元处理一个语言处理任务时,我们使用上下文中的一个项目。标注的时候,我们只考虑当前的词符,与更大的上下文隔离。给定一个模型,我们能做的最好的是为每个词标注其先验的最可能的标记。这意味着我们将使用相同的标记标注一个词,如wind,不论它出现的上下文是the wind还是to wind。
-
一个n-gram tagger标注器是一个一元标注器的一般化,它的上下文是当前词和它前面n-1个标识符的词性标记,如下图所示。要选择的标记是圆圈里的tn,灰色阴影的是上下文。在下面所示的n-gram标注器的例子中,我们让n=3;也就是说,我们考虑当前词的前两个词的标记。一个n-gram标注器挑选在给定的上下文中最有可能的标记。
NgramTagger类使用一个已标注的训练语料库来确定对每个上下文哪个词性标记最有可能。这里我们看n-gram标注器的一个特殊情况,二元标注器。首先,我们训练它,然后用它来标注未标注的句子:
>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'),
('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'),
('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'),
('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
>>> unseen_sent = brown_sents[4203]
>>> bigram_tagger.tag(unseen_sent)
[('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'), ('Congo', 'NP'),
('is', 'BEZ'), ('13.5', None), ('million', None), (',', None), ('divided', None),
('into', None), ('at', None), ('least', None), ('seven', None), ('major', None),
('``', None), ('culture', None), ('clusters', None), ("''", None), ('and', None),
('innumerable', None), ('tribes', None), ('speaking', None), ('400', None),
('separate', None), ('dialects', None), ('.', None)]
请注意,二元标注器能够标注训练中它看到过的句子中的所有词,但对一个没见过的句子表现很差。只要遇到一个新词(如13.5),就无法给它分配标记。它不能标注下面的词(如million),即使是在训练过程中看到过的,只是因为在训练过程中从来没有见过它前面有一个None标记的词。因此,标注器标注句子的其余部分也失败了。它的整体准确度得分非常低:
>>> bigram_tagger.evaluate(test_sents)
0.102063...
当n越大,上下文的特异性就会增加,我们要标注的数据中包含训练数据中不存在的上下文的几率也增大。这被称为数据稀疏问题,在NLP中是相当普遍的。因此,我们的研究结果的精度和覆盖范围之间需要有一个权衡(这与信息检索中的精度/召回权衡有关)。
组合标注器
解决精度和覆盖范围之间的权衡的一个办法是尽可能的使用更精确的算法,但却在很多时候落后于具有更广覆盖范围的算法。例如,我们可以按如下方式组合二元标注器、一元注器和一个默认标注器,如下:
- 尝试使用二元标注器标注标识符。
- 如果二元标注器无法找到一个标记,尝试一元标注器。
- 如果一元标注器也无法找到一个标记,使用默认标注器。
大多数NLTK标注器允许指定一个回退标注器。回退标注器自身可能也有一个回退标注器:
>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
>>> t2.evaluate(test_sents)
0.844513...
存储标注器
在大语料库上训练一个标注器可能需要大量的时间。没有必要在每次我们需要的时候训练一个标注器,很容易将一个训练好的标注器保存到一个文件以后重复使用。让我们保存我们的标注器t2到文件t2.pkl。
>>> from pickle import dump
>>> output = open('t2.pkl', 'wb')
>>> dump(t2, output, -1)
>>> output.close()
现在,我们可以在一个单独的Python进程中,我们可以载入保存的标注器。
>>> from pickle import load
>>> input = open('t2.pkl', 'rb')
>>> tagger = load(input)
>>> input.close()
现在让我们检查它是否可以用来标注。
>>> text = """The board's action shows what free enterprise
... is up against in our complex maze of regulatory laws ."""
>>> tokens = text.split()
>>> tagger.tag(tokens)
[('The', 'AT'), ("board's", 'NN$'), ('action', 'NN'), ('shows', 'NNS'),
('what', 'WDT'), ('free', 'JJ'), ('enterprise', 'NN'), ('is', 'BEZ'),
('up', 'RP'), ('against', 'IN'), ('in', 'IN'), ('our', 'PP$'), ('complex', 'JJ'),
('maze', 'NN'), ('of', 'IN'), ('regulatory', 'NN'), ('laws', 'NNS'), ('.', '.')]