本文的目的是学习论文:ACL2017- Deep Pyramid Convolutional Neural Networks for Text Categorization
期间也第一次使用fastNLP这个框架,也将使用fastNLP这个框架遇到的问题一同写出来。
之后可能会再写一篇单独对模型构建的解析,涉及DPCNN中的shortcut机制,和Pyramid(金字塔)原理。
主要参考了与阳光共进早餐-的博客
数据集:AG-News数据集
fastNLP:fastNLP
数据预处理
首先数据集已经为我们分好label和text,并且该数据集只包含了4种类型的新闻分别是:'Business', 'Sci/Tech','World', 'Sports‘,为了之后的方便我要将label和text合并,并且将label转化为数字类型,值得注意的是fastNLP的label必须从0开始。
import pandas as pd
data_label=pd.read_table("train_labels.txt",sep="\t")
type_mapping = {
'Business':0,
'Sci/Tech':1,
'World':2,
'Sports':3
}
data_label.columns=['label']
data_label['label']=data_label['label'].map(type_mapping)
label
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
将二者合并
text label
0 Wall St. Bears Claw Back Into the Black (Reute... 0
1 Carlyle Looks Toward Commercial Aerospace (Reu... 0
2 Oil and Economy Cloud Stocks' Outlook (Reuters... 0
3 Iraq Halts Oil Exports from Main Southern Pipe... 0
4 Oil prices soar to all-time record posing new... 0
5 Stocks End Up But Near Year Lows (Reuters) Re... 0
6 Money Funds Fell in Latest Week (AP) AP - Asse... 0
7 Fed minutes show dissent over inflation (USATO... 0
8 Safety Net (Forbes.com) Forbes.com - After ear... 0
9 Wall St. Bears Claw Back Into the Black NEW Y... 0
10 Oil and Economy Cloud Stocks' Outlook NEW YOR... 0
11 No Need for OPEC to Pump More-Iran Gov TEHRAN... 0
12 Non-OPEC Nations Should Up Output-Purnomo JAK... 0
13 Google IPO Auction Off to Rocky Start WASHING... 0
测试集也同样处理,然后保存到本地
data_train.to_csv("data_train.txt")
以上都是基于pandas的数据处理,接下来使用fastNLP的接口处理数据,以方便后面的网络搭建。
接下来的就是传统的词嵌入过程,文章中使用的是unsupervised embedding的嵌入方法,但是我没看懂(逃,所以就用传统的词嵌入了,就是构建词典后,将词索引化,然后使所有句子统一长度即添加‘0’,文章中似乎把这个过程叫做PAD。(这里似乎可以把词的索引用TF-IDF处理或者直接用GloVe?)
读取数据
from fastNLP import DataSet
from fastNLP import Instance
from fastNLP import Vocabulary
ata_train=DataSet.read_csv("data_train.txt",headers=('a','text','label'))//这里的‘a’是上面合并的时候的最左列的索引数字,DataSet似乎是得将所有列都读入才可以
data_test=DataSet.read_csv("data_test.txt",headers=('a','text','label'))
将文本小写,将label转化为INT格式
data_train.apply(lambda x: int(x['label']),new_field_name='label')
data_train.apply(lambda x: x['text'].lower(), new_field_name='text')
data_test.apply(lambda x: int(x['label']),new_field_name='label')
data_test.apply(lambda x: x['text'].lower(), new_field_name='text')
分词
def split_sent(instance):
return instance['text'].split()
data_train.apply(split_sent,new_field_name='description_words')
data_test.apply(split_sent,new_field_name='description_words')
统计分词后的长度,得到最大长度,以此来添加‘0’,并且作为后面网络maxfeature的参数
data_train.apply(lambda x: len(x['description_words']),new_field_name='description_seq_len')
data_test.apply(lambda x: len(x['description_words']),new_field_name='description_seq_len')
max_seq_len_train=0
max_seq_len_test=0
for i in range (len(data_train)):
if(data_train[i]['description_seq_len'] > max_seq_len_train):
max_seq_len_train = data_train[i]['description_seq_len']
else:
pass
for i in range (len(data_test)):
if(data_test[i]['description_seq_len'] > max_seq_len_test):
max_seq_len_test = data_test[i]['description_seq_len']
else:
pass
max_sentence_length = max_seq_len_train
if (max_seq_len_test > max_sentence_length):
max_sentence_length = max_seq_len_test
print ('max_sentence_length:',max_sentence_length)
根据训练集来建立词典
vocab = Vocabulary(min_freq=2)
data_train.apply(lambda x:[vocab.add(word) for word in x['description_words']])
vocab.build_vocab()
data_train.apply(lambda x: [vocab.to_index(word) for word in x['description_words']],new_field_name='description_words')
data_test.apply(lambda x: [vocab.to_index(word) for word in x['description_words']],new_field_name='description_words')
PADDING
def padding_words(data):
for i in range(len(data)):
if data[i]['description_seq_len'] <= max_sentence_length:
padding = [0] * (max_sentence_length - data[i]['description_seq_len'])
data[i]['description_words'] += padding
else:
pass
return data
data_train= padding_words(data_train)
data_test = padding_words(data_test)
data_train.apply(lambda x: len(x['description_words']), new_field_name='description_seq_len')
data_test.apply(lambda x: len(x['description_words']), new_field_name='description_seq_len')
fastNLP中要标注出数据集中的输入和输出,确定input和target。rename_filed的目的在后面会说明。
data_train.rename_field("description_words","description_word_seq")
data_train.rename_field("label","label_seq")
data_test.rename_field("description_words","description_word_seq")
data_test.rename_field("label","label_seq")
data_train.set_input("description_word_seq")
data_test.set_input("description_word_seq")
data_train.set_target("label_seq")
data_test.set_target("label_seq")
print("dataset processed successfully!")
网络搭建
DPCNN中包含了ResNet网络中的shortcut机制。
为什么在前面需要rename_filed的,是因为在fastNLP中构建神经网络必须要有forward函数并且返回一个字典类型的数据。
具体内容可以参考fastNLP的Tutorials
import torch
import torch.nn as nn
class ResnetBlock(nn.Module):
def __init__(self, channel_size):
super(ResnetBlock, self).__init__()
self.channel_size = channel_size
self.maxpool = nn.Sequential(
nn.ConstantPad1d(padding=(0, 1), value=0),
nn.MaxPool1d(kernel_size=3, stride=2)
)
self.conv = nn.Sequential(
nn.BatchNorm1d(num_features=self.channel_size),
nn.ReLU(),
nn.Conv1d(self.channel_size, self.channel_size, kernel_size=3, padding=1),
nn.BatchNorm1d(num_features=self.channel_size),
nn.ReLU(),
nn.Conv1d(self.channel_size, self.channel_size, kernel_size=3, padding=1),
)
def forward(self, x):
x_shortcut = self.maxpool(x)
x = self.conv(x_shortcut)
x = x + x_shortcut
return x
class DPCNN(nn.Module):
def __init__(self,max_features,word_embedding_dimension,max_sentence_length,num_classes):
super(DPCNN, self).__init__()
self.max_features = max_features
self.embed_size = word_embedding_dimension
self.maxlen = max_sentence_length
self.num_classes=num_classes
self.channel_size = 250
self.embedding = nn.Embedding(self.max_features, self.embed_size)
torch.nn.init.normal_(self.embedding.weight.data,mean=0,std=0.01)
self.embedding.weight.requires_grad = True
# region embedding
self.region_embedding = nn.Sequential(
nn.Conv1d(self.embed_size, self.channel_size, kernel_size=3, padding=1),
nn.BatchNorm1d(num_features=self.channel_size),
nn.ReLU(),
nn.Dropout(0.2)
)
self.conv_block = nn.Sequential(
nn.BatchNorm1d(num_features=self.channel_size),
nn.ReLU(),
nn.Conv1d(self.channel_size, self.channel_size, kernel_size=3, padding=1),
nn.BatchNorm1d(num_features=self.channel_size),
nn.ReLU(),
nn.Conv1d(self.channel_size, self.channel_size, kernel_size=3, padding=1),
)
self.seq_len = self.maxlen
resnet_block_list = []
while (self.seq_len > 2):
resnet_block_list.append(ResnetBlock(self.channel_size))
self.seq_len = self.seq_len // 2
# print('seqlen{}'.format(self.seq_len))
self.resnet_layer = nn.Sequential(*resnet_block_list)
self.fc = nn.Sequential(
nn.Linear(self.channel_size*self.seq_len, self.num_classes),
nn.BatchNorm1d(self.num_classes),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(self.num_classes, self.num_classes)
)
def forward(self, description_word_seq):
x = self.embedding(description_word_seq)
x = x.permute(0, 2, 1)
x = self.region_embedding(x)
x = self.conv_block(x)
x = self.resnet_layer(x)
x = x.permute(0, 2, 1)
x = x.contiguous().view(x.size(0), -1)
output = self.fc(x)
return {'output': output}
def predict(self, description_word_seq):
"""
:param word_seq: torch.LongTensor, [batch_size, seq_len]
:return predict: dict of torch.LongTensor, [batch_size, seq_len]
"""
output = self(description_word_seq)
_, predict = output['output'].max(dim=1)
return {'predict': predict}
使用的一些超参
word_embedding_dimension = 300
num_classes = 4
pickle_path = 'result/'
训练网络和测试网络可以参考与阳光共进早餐-的博客
这里要注意的是,fastNLP的版本似乎是pytorch 0.4,而我是用的是1.0.1,在原来博主的参数中save_path=None,则fastNLP会调用0.4版本的dict,即模型参数,和版本不匹配
trainer=Trainer(model=model,train_data=data_train,dev_data=data_test,loss=loss,metrics=metric,save_path=None,batch_size=64,n_epochs=5,optimizer=Adam(lr=0.001, weight_decay=0.0001))
trainer.train()
所以我在调用这个接口时候,为save_path设置了参数,这样接口调用的是save之后的模型的dict
trainer=Trainer(model=model,train_data=data_train,dev_data=data_test,loss=loss,metrics=metric,save_path='CD',batch_size=64,n_epochs=5,optimizer=Adam(lr=0.001, weight_decay=0.0001))
trainer.train()
最后的效果
Evaluation at Epoch 1/5. Step:1875/9375. AccuracyMetric: acc=0.881053
Evaluation at Epoch 2/5. Step:3750/9375. AccuracyMetric: acc=0.884474
Evaluation at Epoch 3/5. Step:5625/9375. AccuracyMetric: acc=0.889342
Evaluation at Epoch 4/5. Step:7500/9375. AccuracyMetric: acc=0.889737
Evaluation at Epoch 5/5. Step:9375/9375. AccuracyMetric: acc=0.895789
In Epoch:5/Step:9375, got best dev performance:AccuracyMetric: acc=0.895789
Reloaded the best model.
new_model.pkl saved in result/
[tester]
AccuracyMetric: acc=0.895789
{'AccuracyMetric': {'acc': 0.895789}}