文本和图片在数据有很大区别,图片对每个像素进行rgb颜色值编码,数值也在0-255之间,数据非常规整。文本长度不固定,每段文字中的单个字词又有区别。因此,对文本数据进行处理,使之适合机器学习。
文本需要经过以下几步处理:
- 建 token:将每个字进行数字编码
- 数字化:将文本根据token转换为数字序列
- 统一文本长度:确定长度后,对文本截长补短
- 嵌入数据:将每个数字转化为向量,见tensorflow的嵌入描述。在代码上,实际是通过增加一个Embedding layer来实现的
精度提高的方法:
- 增大tokenizer数目,文本统一长度曾长
- 需要用RNN 和 LSTM替换MLP和CNN,原因是,MLP/CNN是静态数据,当涉及到时间序列问题,则需要用RNN/LSTM
影评数据来源:http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
import os
import re
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN
from keras.layers.recurrent import LSTM
import matplotlib.pyplot as plt
import numpy as np
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
np.random.seed(10)
def rm_tags(text):
re_tag = re.compile(r'<[^>]+>')
return re_tag.sub('', text)
def read_files(filetype):
path = "data/aclImdb/"
file_list = []
positive_path = os.path.join(path, filetype, "pos")
for f in os.listdir(positive_path):
file_list += [os.path.join(positive_path, f)]
negative_path = os.path.join(path, filetype, "neg")
for f in os.listdir(negative_path):
file_list += [os.path.join(negative_path, f)]
print("read ", filetype, " files: ", len(file_list))
all_labels = ([1] * 12500 + [0] * 12500)
all_text = []
for fi in file_list:
with open(fi, encoding = 'utf8') as file_input:
all_text += [rm_tags(" ".join(file_input.readlines()))]
return all_labels, all_text
y_train_label, train_text = read_files("train")
y_test_label, test_text = read_files("test")
token = Tokenizer(num_words = 2000)
token.fit_on_texts(train_text)
print(token.document_count)
#print(token.word_index)
print("*" * 20)
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
x_train = sequence.pad_sequences(x_train_seq, maxlen = 100)
x_test = sequence.pad_sequences(x_test_seq, maxlen = 100)
print(train_text[0])
print("*" * 20)
print("len = ", len(x_train_seq[0]))
print(x_train_seq[0])
print("*" * 20)
print("len = ", len(x_train[0]))
print(x_train[0])
model = Sequential()
model.add(Embedding(output_dim = 32,
input_dim = 2000,
input_length = 100))
model.add(Dropout(0.2))
# MLP, RNN, LSTM
model.add(Flatten())
#model.add(SimpleRNN(units = 16))
#model.add(LSTM(32))
model.add(Dense(units = 256,
activation = "relu"))
model.add(Dropout(0.25))
model.add(Dense(units = 1,
activation = "sigmoid"))
model.summary()
model.compile(loss = "binary_crossentropy",
optimizer = "adam",
metrics = ["accuracy"])
train_history = model.fit(x_train, y_train_label, batch_size = 100,
epochs = 10,
verbose = 2,
validation_split = 0.2)
def show_train_history(train_history, train, val):
plt.plot(train_history.history[train])
plt.plot(train_history.history[val])
plt.title("Train History")
plt.ylabel(train)
plt.xlabel("Epochs")
plt.legend(["train", "validation"], loc="upper left")
plt.show()
show_train_history(train_history, "acc", "val_acc")
show_train_history(train_history, "loss", "val_loss")
scores = model.evaluate(x_test, y_test_label, verbose = 1)
print(" loss: ", scores[0])
print("accuracy: ", scores[1])
predict = model.predict_classes(x_test)
print(predict[:10])
def display_test_sentiment(i):
sem = { 1: "positive", 0: "negative"}
print(test_text[i])
print(" real: ", sem[y_test_label[i]])
print("predict: ", sem[predict[i][0]])
display_test_sentiment(2)
网络结构:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 100, 32) 64000
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 3200) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 819456
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 257
=================================================================
Total params: 883,713
Trainable params: 883,713
Non-trainable params: 0
_________________________________________________________________
训练结果:
精度:
- MLP: accuracy: 0.81576
- RNN: accuracy: 0.8096
- LSTM: accuracy: 0.83528
RNN似乎并没有比MLP更好,原因需要进一步研究。