以下是一个综合示例,展示如何使用词袋模型和TF-IDF进行文本分类。
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# 示例文本和标签
texts = [
"The quick brown fox jumps over the lazy dog",
"I love watching the quick brown fox",
"The dog was lazy and the fox was quick",
"This is a test document about machine learning",
"Another document about deep learning"
]
labels = [0, 0, 0, 1, 1] # 0: Fox, 1: Machine Learning
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)
# 使用词袋模型
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)
# 使用TF-IDF
vectorizer_tfidf = TfidfVectorizer()
X_train_tfidf = vectorizer_tfidf.fit_transform(X_train)
X_test_tfidf = vectorizer_tfidf.transform(X_test)
# 训练模型
model_bow = MultinomialNB()
model_bow.fit(X_train_bow, y_train)
model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, y_train)
# 预测
y_pred_bow = model_bow.predict(X_test_bow)
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)
# 评估模型
accuracy_bow = accuracy_score(y_test, y_pred_bow)
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
print(f"词袋模型准确率: {accuracy_bow:.2f}")
print(f"TF-IDF模型准确率: {accuracy_tfidf:.2f}")
注意事项
-
参数调整:
CountVectorizer
和TfidfVectorizer
有许多参数可以调整,如max_df
、min_df
、stop_words
等,以优化模型性能。 - 特征选择:在高维特征空间中,可以使用特征选择方法(如递归特征消除)来减少特征数量,提高模型性能。
- 模型选择:根据具体任务选择合适的模型,如朴素贝叶斯、支持向量机、随机森林等。
通过以上示例,你可以了解词袋模型和TF-IDF的基本用法,并将其应用于文本分类任务。