使用方法
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')
wordnet_lemmatizer = WordNetLemmatizer()
words = [('bottles', wordnet.NOUN), ('vases', wordnet.NOUN), ('lit', wordnet.VERB), ('said', wordnet.VERB), ('earlier', wordnet.ADJ)]
for word_tuple in words:
word = word_tuple[0]
pos = word_tuple[1]
porter_stemmer.stem(word) # output: 'bottl', 'vase', 'lit', 'said', 'earlier'
lancaster_stemmer.stem(word) # output: 'bottl', 'vas', 'lit', 'said', 'ear'
snowball_stemmer.stem(word) # output: 'bottl', 'vase', 'lit', 'said', 'earlier'
wordnet_lemmatizer.lemmatize(word) # output: 'bottle', 'vas', 'lit', 'said', 'earlier'
wordnet_lemmatizer.lemmatize(word, pos=pos) # output: 'bottle', 'vas', 'light', 'say', 'early'
结论
仅由上例可见,在有词性的情况下,WordNetLemmatizer获取英语单词原形的效果要更好。
[注] 词形还原工具对比