sklearn.feature_extraction.text 中有4种文本特征提取方法:
- CountVectorizer
- TfidfVectorizer
- TfidfTransformer
- HashingVectorizer
CountVectorizer会将文本中的词语转换为词频矩阵,它通过fit_transform函数计算各个词语在文档中出现的次数。
参数
属性
属性表 | 作用 |
---|---|
vocabulary_ | 词汇表;字典型 |
get_feature_names() | 所有文本的词汇;列表型 |
stop_words_ | 返回停用词表 |
方法
方法表 | 作用 |
---|---|
fit_transform(X) | 拟合模型,并返回term-document矩阵 |
fit(raw_documents[, y]) | 学习文档集中的vocabulary dictionary |
入门示例
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 为输入列表元素,即代表一个文章的字符串
cv = CountVectorizer() #创建词袋数据结构
cv_fit = cv.fit_transform(texts)
# 上述代码等价于下面两行
# cv.fit(texts)
# cv_fit=cv.transform(texts)
print(cv.get_feature_names()) #['bird', 'cat', 'dog', 'fish'] 列表形式呈现文章生成的词典
print(cv.vocabulary_) # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式,key:词,value:该词(特征)的索引,同时是tf矩阵的列号
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
print(cv_fit)
#(0,3)1 第0个列表元素,**词典中索引为3的元素**, 词频
#(0,1)1
#(0,2)1
#(1,1)2
#(1,2)1
#(2,0)1
#(2,3)1
#(3,0)1
print(cv_fit.toarray()) #.toarray() 是将结果转化为稀疏矩阵矩阵的表示方式;
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]
print(cv_fit.toarray().sum(axis=0)) #每个词在所有文档中的词频
#[2 3 2 2]
复现
功能包括:
- 去停词等文本预处理操作
- fit
- transform
- 支持 n-gram
import numpy as np
with open('data.txt', 'r', encoding='utf-8') as f:
data = [i.strip() for i in f.readlines()]
class MyCountVectorizer(object):
vocabulary = {}
corpus = []
def __init__(self, n=1, remove_stop_words=False):
self.n = n
self.remove_stop_words = remove_stop_words
def clean(self, corpus):
if self.remove_stop_words:
# Load stopword list
with open('stopwords.txt') as f:
stop_words = [w.strip() for w in f.readlines()]
for text in corpus:
# Lower case
text = text.lower()
# Remove special punctuation
for c in """!"'#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“”‘’""":
text = text.replace(c, ' ')
if self.remove_stop_words:
word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1 and (word not in stop_words)]
else:
word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1]
# corpus: document size * vocabulary size
n_gram_word_ls = []
for idx in range(len(word_ls)):
if idx + self.n > len(word_ls):
break
n_gram_word = ' '.join(word_ls[idx: idx + self.n])
n_gram_word_ls.append(n_gram_word)
self.corpus.append(n_gram_word_ls)
def fit(self, corpus):
# Create a dictionary of terms which map to columns of the term-frequency matrix.
self.clean(corpus)
for row in self.corpus:
for word in row:
if word not in self.vocabulary:
self.vocabulary[word] = len(self.vocabulary)
return
def transform(self):
# Create a term-frequency matrix of appropriate size (document size * vocabulary size)
tf_matrix = []
size = len(self.vocabulary)
for doc in self.corpus:
# Count how often the word appears in the document
word_count = {}
for word in doc:
word_count[word] = word_count.get(word, 0) + 1
# Construct the term-frequency vector of the row
row = [0 for i in range(size)]
for word, value in word_count.items():
row[self.vocabulary[word]] = value
tf_matrix.append(row)
tf_matrix = np.array(tf_matrix)
return tf_matrix
def get_vocab(self):
# Returns the dictionary of terms
return self.vocabulary
cv = MyCountVectorizer(1, True)
cv.fit(data)
print(cv.get_vocab())
term_frequency_matrix = cv.transform()
print(term_frequency_matrix.shape)