自然语言处理| NLTK

自然语言处理(NLP)

自然语言处理(natural language processing)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。

自然语言处理应用

  • 搜索引擎,比如谷歌,雅虎等等。谷歌等搜索引擎会通过NLP了解到你是一个科技发烧友,所以它会返回科技相关的结果。
  • 社交网站信息流,比如 Facebook 的信息流。新闻馈送算法通过自然语言处理了解到你的兴趣,并向你展示相关的广告以及消息,而不是一些无关的信息。
  • 语音助手,诸如苹果 Siri。
  • 垃圾邮件程序,比如 Google 的垃圾邮件过滤程序 ,这不仅仅是通常会用到的普通的垃圾邮件过滤,现在,垃圾邮件过滤器会对电子邮件的内容进行分析,看看该邮件是否是垃圾邮件。

NLTK

NLTK是构建Python程序以使用人类语言数据的领先平台。它为50多种语料库和词汇资源(如WordNet)提供了易于使用的界面,还提供了一套用于分类,标记化,词干化,标记,解析和语义推理的文本处理库。NLTK是Python上著名的⾃然语⾔处理库 ⾃带语料库,具有词性分类库 ⾃带分类,分词,等等功能。NLTK被称为“使用Python进行教学和计算语言学工作的绝佳工具”,以及“用自然语言进行游戏的神奇图书馆”。

安装语料库

pip install nltk

注意,这只是安装好了一个框子,里面是没东西的

# 新建一个ipython,输入import nltk nltk.download()

我觉得下book 和popular下好就可以了

[图片上传中...(image-992734-1563252309567-5)]

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

功能⼀览表

在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

安装好了,我们来愉快的玩耍

了解Tokenize

把长句⼦拆成有“意义”的⼩部件,,使用的是nltk.word_tokenize

>>> import nltk>>> sentence = "hello,,world">>> tokens = nltk.word_tokenize(sentence)>>> tokens['hello', ',', ',world']

标记文本

>>> import nltk>>> sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good.""">>> tokens = nltk.word_tokenize(sentence)>>> tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']>>> tagged = nltk.pos_tag(tokens)  # 标记词性>>> tagged[0:6][('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]

加载内置语料库

在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

分词(注意只能分英语)

>>> from nltk.tokenize import word_tokenize >>> from nltk.text import Text>>> input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow.">>> tokens = word_tokenize(input_str)>>> tokens[:5]['Today', "'s", 'weather', 'is', 'good']>>> tokens = [word.lower() for word in tokens] #小写>>> tokens[:5]['today', "'s", 'weather', 'is', 'good']

查看对应单词的位置和个数

>>> t = Text(tokens)>>> t.count('good')1>>> t.index('good')4

还可以画图

t.plot(8)

[图片上传中...(image-82d36b-1563252309565-2)]

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

停用词

from nltk.corpus import stopwordsstopwords.fileids() # 具体的语言
###  果然没有中文['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian',  'spanish', 'swedish', 'turkish']  ```

看下英文的停用词

stopwords.raw('english').replace('\n',' ') #会有很多\n,这里替换

```"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "

具体使用

test_words = [word.lower() for word in tokens] #  tokens是上面的句子test_words_set = set(test_words) # 集合test_words_set.intersection(set(stopwords.words('english')))>>>{'and', 'have', 'in', 'is', 'no', 'the', 'to', 'very', 'we'}

在 "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."中有这么多个停用词

'and', 'have', 'in', 'is', 'no', 'the', 'to', 'very', 'we'

过滤停用词

filtered = [w for w in test_words_set if(w not in stopwords.words('english'))]filtered
['today', 'good', 'windy', 'sunny', 'afternoon', 'play', 'basketball', 'tomorrow', 'weather', 'classes', ',', '.', "'s"]

词性标注

from nltk import pos_tagtags = pos_tag(tokens)tags
[('Today', 'NN'), ("'s", 'POS'), ('weather', 'NN'), ('is', 'VBZ'), ('good', 'JJ'), (',', ','), ('very', 'RB'), ('windy', 'JJ'), ('and', 'CC'), ('sunny', 'JJ'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('no', 'DT'), ('classes', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('afternoon', 'NN'), (',', ','), ('We', 'PRP'), ('have', 'VBP'), ('to', 'TO'), ('play', 'VB'), ('basketball', 'NN'), ('tomorrow', 'NN'), ('.', '.')]

[图片上传中...(image-ee155b-1563252309564-1)]

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

分块

from nltk.chunk import RegexpParsersentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('died','VBD')]grammer = "MY_NP: {<DT>?<JJ>*<NN>}"cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result)out:result.draw() #调用matplotlib库画出来
在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

命名实体识别

命名实体识别是NLP里的一项很基础的任务,就是指从文本中识别出命名性指称项,为关系抽取等任务做铺垫。狭义上,是识别出人命、地名和组织机构名这三类命名实体(时间、货币名称等构成规律明显的实体类型可以用正则表达式等方式识别)。当然,在特定的领域中,会相应地定义领域内的各种实体类型。

from nltk import ne_chunksentence = "Edison went to Tsinghua University today."print(ne_chunk(pos_tag(word_tokenize(sentence))))
(S  (PERSON Edison/NNP)  went/VBD  to/TO  (ORGANIZATION Tsinghua/NNP University/NNP)  today/NN  ./.)

自然语言处理(natural language processing)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。

自然语言处理应用

  • 搜索引擎,比如谷歌,雅虎等等。谷歌等搜索引擎会通过NLP了解到你是一个科技发烧友,所以它会返回科技相关的结果。
  • 社交网站信息流,比如 Facebook 的信息流。新闻馈送算法通过自然语言处理了解到你的兴趣,并向你展示相关的广告以及消息,而不是一些无关的信息。
  • 语音助手,诸如苹果 Siri。
  • 垃圾邮件程序,比如 Google 的垃圾邮件过滤程序 ,这不仅仅是通常会用到的普通的垃圾邮件过滤,现在,垃圾邮件过滤器会对电子邮件的内容进行分析,看看该邮件是否是垃圾邮件。

NLTK

NLTK是构建Python程序以使用人类语言数据的领先平台。它为50多种语料库和词汇资源(如WordNet)提供了易于使用的界面,还提供了一套用于分类,标记化,词干化,标记,解析和语义推理的文本处理库。NLTK是Python上著名的⾃然语⾔处理库 ⾃带语料库,具有词性分类库 ⾃带分类,分词,等等功能。NLTK被称为“使用Python进行教学和计算语言学工作的绝佳工具”,以及“用自然语言进行游戏的神奇图书馆”。

安装语料库

pip install nltk

注意,这只是安装好了一个框子,里面是没东西的

# 新建一个ipython,输入import nltk nltk.download()

我觉得下book 和popular下好就可以了

在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

功能⼀览表

在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

安装好了,我们来愉快的玩耍

了解Tokenize

把长句⼦拆成有“意义”的⼩部件,,使用的是nltk.word_tokenize

>>> import nltk>>> sentence = "hello,,world">>> tokens = nltk.word_tokenize(sentence)>>> tokens['hello', ',', ',world']

标记文本

>>> import nltk>>> sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good.""">>> tokens = nltk.word_tokenize(sentence)>>> tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']>>> tagged = nltk.pos_tag(tokens)  # 标记词性>>> tagged[0:6][('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]

加载内置语料库

在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

分词(注意只能分英语)

>>> from nltk.tokenize import word_tokenize >>> from nltk.text import Text>>> input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow.">>> tokens = word_tokenize(input_str)>>> tokens[:5]['Today', "'s", 'weather', 'is', 'good']>>> tokens = [word.lower() for word in tokens] #小写>>> tokens[:5]['today', "'s", 'weather', 'is', 'good']

查看对应单词的位置和个数

>>> t = Text(tokens)>>> t.count('good')1>>> t.index('good')4

还可以画图

t.plot(8)
在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

停用词

from nltk.corpus import stopwordsstopwords.fileids() # 具体的语言
###  果然没有中文['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian',  'spanish', 'swedish', 'turkish']  ```

看下英文的停用词

stopwords.raw('english').replace('\n',' ') #会有很多\n,这里替换

```"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "

具体使用

test_words = [word.lower() for word in tokens] #  tokens是上面的句子test_words_set = set(test_words) # 集合test_words_set.intersection(set(stopwords.words('english')))>>>{'and', 'have', 'in', 'is', 'no', 'the', 'to', 'very', 'we'}

在 "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."中有这么多个停用词

'and', 'have', 'in', 'is', 'no', 'the', 'to', 'very', 'we'

过滤停用词

filtered = [w for w in test_words_set if(w not in stopwords.words('english'))]filtered
['today', 'good', 'windy', 'sunny', 'afternoon', 'play', 'basketball', 'tomorrow', 'weather', 'classes', ',', '.', "'s"]

词性标注

from nltk import pos_tagtags = pos_tag(tokens)tags
[('Today', 'NN'), ("'s", 'POS'), ('weather', 'NN'), ('is', 'VBZ'), ('good', 'JJ'), (',', ','), ('very', 'RB'), ('windy', 'JJ'), ('and', 'CC'), ('sunny', 'JJ'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('no', 'DT'), ('classes', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('afternoon', 'NN'), (',', ','), ('We', 'PRP'), ('have', 'VBP'), ('to', 'TO'), ('play', 'VB'), ('basketball', 'NN'), ('tomorrow', 'NN'), ('.', '.')]
在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

分块

from nltk.chunk import RegexpParsersentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('died','VBD')]grammer = "MY_NP: {<DT>?<JJ>*<NN>}"cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result)out:result.draw() #调用matplotlib库画出来
在这里插入图片描述

<figcaption style="line-height: inherit; margin: 0px; padding: 0px; margin-top: 10px; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em;">在这里插入图片描述</figcaption>

命名实体识别

命名实体识别是NLP里的一项很基础的任务,就是指从文本中识别出命名性指称项,为关系抽取等任务做铺垫。狭义上,是识别出人命、地名和组织机构名这三类命名实体(时间、货币名称等构成规律明显的实体类型可以用正则表达式等方式识别)。当然,在特定的领域中,会相应地定义领域内的各种实体类型。

from nltk import ne_chunksentence = "Edison went to Tsinghua University today."print(ne_chunk(pos_tag(word_tokenize(sentence))))
(S  (PERSON Edison/NNP)  went/VBD  to/TO  (ORGANIZATION Tsinghua/NNP University/NNP)  today/NN  ./.)
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,189评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,577评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,857评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,703评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,705评论 5 366
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,620评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,995评论 3 396
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,656评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,898评论 1 298
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,639评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,720评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,395评论 4 319
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,982评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,953评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,195评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 44,907评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,472评论 2 342

推荐阅读更多精彩内容