在NLP任务中,常需要分析单词的词性,借助nltk库的pos_tag方法可以较好地实现。
以下是一个例子:
import nltk
line = 'i love this world which was beloved by all the people here'
tokens = nltk.word_tokenize(line)
# ['i', 'love', 'this', 'world', 'which', 'was', 'beloved', 'by',
# 'all', 'the', 'people', 'here']
pos_tags = nltk.pos_tag(tokens)
# [('i', 'RB'), ('love', 'VBP'), ('this', 'DT'), ('world', 'NN'), ('which', 'WDT'),
# ('was', 'VBD'), ('beloved', 'VBN'), ('by', 'IN'), ('all', 'PDT'), ('the', 'DT'),
# ('people', 'NNS'), ('here', 'RB')]
for word,pos in pos_tags:
if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
print word,pos
# world NN
# people NNS
作为nltk的替代,TextBlob库能够更进一步进行词组划分,例如“computer science”会被当做一个单词,而非"computer"和"science"
from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
blob = TextBlob(txt)
print(blob.noun_phrases)
# [u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']
更多例子请参考nltk官方教科书第五章
其中pos_tag分析出来的词性含义按照宾夕法尼亚大学tag词性对照表
tag | 含义 |
---|---|
CC | Coordinating conjunction |
CD | Cardinal number |
DT | Determiner |
EX | Existential there |
FW | Foreign word |
IN | Preposition or subordinating conjunction |
JJ | Adjective |
JJR | Adjective, comparative |
JJS | Adjective, superlative |
LS | List item marker |
MD | Modal |
NN | Noun, singular or mass |
NNS | Noun, plural |
NNP | Proper noun, singular |
NNPS | Proper noun, plural |
PDT | Predeterminer |
POS | Possessive ending |
PRP | Personal pronoun |
PRP$ | Possessive pronoun |
RB | Adverb |
RBR | Adverb, comparative |
RBS | Adverb, superlative |
RP | Particle |
SYM | Symbol |
TO | to |
UH | Interjection |
VB | Verb, base form |
VBD | Verb, past tense |
VBG | Verb, gerund or present participle |
VBN | Verb, past participle |
VBP | Verb, non-3rd person singular present |
VBZ | Verb, 3rd person singular present |
WDT | Wh-determiner |
WP | Wh-pronoun |
WP$ | Possessive wh-pronoun |
WRB | Wh-adverb |