Pytorch学习记录-更深的TorchText学习01
简单实现torchtext之后,我希望能够进一步学习torchtext。找到两个教程
1. practical-torchtext简介
有效使用torchtext的教程,包括两个部分
- 文本分类
- 词级别的语言模型
1.1 目标
torchtext的文档仍然相对不完整,目前有效地使用torchtext需要阅读相当多的代码。这组教程旨在提供使用torchtext的工作示例,以使更多用户能够充分利用这个风扇库。
1.2 使用
torchtext的当前pip版本存在一些错误,这些错误会导致某些代码运行不正确。这些错误目前只在torchtext的github存储库的主分支上修复。因此,教程建议使用以下命令从github存储库安装torchtext:
pip install --upgrade git+https://github.com/pytorch/text
2. 基于torchtext处理的文本分析
我看了一下,第一课是基于这两天反复操作的那个教程,但是作者进行了丰富和解释,就放在这里再跑一次了
2.0 简介
前面都是一致的,加载数据,预处理之后生成dataset,输入模型。
使用的数据集还是之前的Kaggle垃圾信息数据。
import pandas as pd
import numpy as np
import torch
from torch.nn import init
from torchtext.data import Field
2.1 声明Fields
Field类用于确定数据是如何预处理并转化为数字格式。
很简单。标签的预处理更加容易,因为它们已经转换为二进制编码。我们需要做的就是告诉Field类标签已经处理完毕。我们通过将use_vocab = False关键字传递给构造函数来完成此操作
tokenize=lambda x: x.split()
TEXT=Field(sequential=True, tokenize=tokenize, lower=True)
LABEL=Field(sequential=False, use_vocab=False)
2.2 创建Dataset
我们将使用TabularDataset类来读取我们的数据,因为它是csv格式(截至目前,TabularDataset处理csv,tsv和json文件)
对于列车和验证数据,我们需要处理标签。我们传入的字段必须与列的顺序相同。对于我们不使用的字段,我们传入一个元组,其中第二个元素是None
%%time
from torchtext.data import TabularDataset
tv_datafields=[
('id',None),
('comment_text',TEXT),
("toxic", LABEL),
("severe_toxic", LABEL),
("threat", LABEL),
("obscene", LABEL),
("insult", LABEL),
("identity_hate", LABEL)
]
trn,vld=TabularDataset.splits(
path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code\data\torchtextdata',
train='train.csv',
validation='valid.csv',
format='csv',
skip_header=True,
fields=tv_datafields
)
Wall time: 4.99 ms
%%time
tst_datafields=[
('id',None),
('comment_text',TEXT)
]
tst=TabularDataset(
path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code\data\torchtextdata\test.csv',
format='csv',
skip_header=True,
fields=tst_datafields
)
Wall time: 3.01 ms
2.3 构建字典
对于TEXT字段将单词转换为整数,需要告诉整个词汇是什么。为此,我们运行TEXT.build_vocab,传入数据集以构建词汇表。
%%time
TEXT.build_vocab(trn)
TEXT.vocab.freqs.most_common(10)
print(TEXT.vocab.freqs.most_common(10))
[('the', 78), ('to', 41), ('you', 33), ('of', 30), ('and', 26), ('a', 26), ('is', 24), ('that', 22), ('i', 20), ('if', 19)]
Wall time: 3.99 ms
在这里,dataset中每一个元素都是一个Example对象,包含有若干独立数据
# 查看trn这个field的标签
print(trn[0].__dict__.keys())
# 查看某一行中的文本,在结果中可以看到,已有的文本是已经被分好词的
print(trn[10].comment_text)
dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
['"', 'fair', 'use', 'rationale', 'for', 'image:wonju.jpg', 'thanks', 'for', 'uploading', 'image:wonju.jpg.', 'i', 'notice', 'the', 'image', 'page', 'specifies', 'that', 'the', 'image', 'is', 'being', 'used', 'under', 'fair', 'use', 'but', 'there', 'is', 'no', 'explanation', 'or', 'rationale', 'as', 'to', 'why', 'its', 'use', 'in', 'wikipedia', 'articles', 'constitutes', 'fair', 'use.', 'in', 'addition', 'to', 'the', 'boilerplate', 'fair', 'use', 'template,', 'you', 'must', 'also', 'write', 'out', 'on', 'the', 'image', 'description', 'page', 'a', 'specific', 'explanation', 'or', 'rationale', 'for', 'why', 'using', 'this', 'image', 'in', 'each', 'article', 'is', 'consistent', 'with', 'fair', 'use.', 'please', 'go', 'to', 'the', 'image', 'description', 'page', 'and', 'edit', 'it', 'to', 'include', 'a', 'fair', 'use', 'rationale.', 'if', 'you', 'have', 'uploaded', 'other', 'fair', 'use', 'media,', 'consider', 'checking', 'that', 'you', 'have', 'specified', 'the', 'fair', 'use', 'rationale', 'on', 'those', 'pages', 'too.', 'you', 'can', 'find', 'a', 'list', 'of', "'image'", 'pages', 'you', 'have', 'edited', 'by', 'clicking', 'on', 'the', '""my', 'contributions""', 'link', '(it', 'is', 'located', 'at', 'the', 'very', 'top', 'of', 'any', 'wikipedia', 'page', 'when', 'you', 'are', 'logged', 'in),', 'and', 'then', 'selecting', '""image""', 'from', 'the', 'dropdown', 'box.', 'note', 'that', 'any', 'fair', 'use', 'images', 'uploaded', 'after', '4', 'may,', '2006,', 'and', 'lacking', 'such', 'an', 'explanation', 'will', 'be', 'deleted', 'one', 'week', 'after', 'they', 'have', 'been', 'uploaded,', 'as', 'described', 'on', 'criteria', 'for', 'speedy', 'deletion.', 'if', 'you', 'have', 'any', 'questions', 'please', 'ask', 'them', 'at', 'the', 'media', 'copyright', 'questions', 'page.', 'thank', 'you.', '(talk', '•', 'contribs', '•', ')', 'unspecified', 'source', 'for', 'image:wonju.jpg', 'thanks', 'for', 'uploading', 'image:wonju.jpg.', 'i', 'noticed', 'that', 'the', "file's", 'description', 'page', 'currently', "doesn't", 'specify', 'who', 'created', 'the', 'content,', 'so', 'the', 'copyright', 'status', 'is', 'unclear.', 'if', 'you', 'did', 'not', 'create', 'this', 'file', 'yourself,', 'then', 'you', 'will', 'need', 'to', 'specify', 'the', 'owner', 'of', 'the', 'copyright.', 'if', 'you', 'obtained', 'it', 'from', 'a', 'website,', 'then', 'a', 'link', 'to', 'the', 'website', 'from', 'which', 'it', 'was', 'taken,', 'together', 'with', 'a', 'restatement', 'of', 'that', "website's", 'terms', 'of', 'use', 'of', 'its', 'content,', 'is', 'usually', 'sufficient', 'information.', 'however,', 'if', 'the', 'copyright', 'holder', 'is', 'different', 'from', 'the', "website's", 'publisher,', 'then', 'their', 'copyright', 'should', 'also', 'be', 'acknowledged.', 'as', 'well', 'as', 'adding', 'the', 'source,', 'please', 'add', 'a', 'proper', 'copyright', 'licensing', 'tag', 'if', 'the', 'file', "doesn't", 'have', 'one', 'already.', 'if', 'you', 'created/took', 'the', 'picture,', 'audio,', 'or', 'video', 'then', 'the', 'tag', 'can', 'be', 'used', 'to', 'release', 'it', 'under', 'the', 'gfdl.', 'if', 'you', 'believe', 'the', 'media', 'meets', 'the', 'criteria', 'at', 'wikipedia:fair', 'use,', 'use', 'a', 'tag', 'such', 'as', 'or', 'one', 'of', 'the', 'other', 'tags', 'listed', 'at', 'wikipedia:image', 'copyright', 'tags#fair', 'use.', 'see', 'wikipedia:image', 'copyright', 'tags', 'for', 'the', 'full', 'list', 'of', 'copyright', 'tags', 'that', 'you', 'can', 'use.', 'if', 'you', 'have', 'uploaded', 'other', 'files,', 'consider', 'checking', 'that', 'you', 'have', 'specified', 'their', 'source', 'and', 'tagged', 'them,', 'too.', 'you', 'can', 'find', 'a', 'list', 'of', 'files', 'you', 'have', 'uploaded', 'by', 'following', '[', 'this', 'link].', 'unsourced', 'and', 'untagged', 'images', 'may', 'be', 'deleted', 'one', 'week', 'after', 'they', 'have', 'been', 'tagged,', 'as', 'described', 'on', 'criteria', 'for', 'speedy', 'deletion.', 'if', 'the', 'image', 'is', 'copyrighted', 'under', 'a', 'non-free', 'license', '(per', 'wikipedia:fair', 'use)', 'then', 'the', 'image', 'will', 'be', 'deleted', '48', 'hours', 'after', '.', 'if', 'you', 'have', 'any', 'questions', 'please', 'ask', 'them', 'at', 'the', 'media', 'copyright', 'questions', 'page.', 'thank', 'you.', '(talk', '•', 'contribs', '•', ')', '"']
2.4 构建迭代器
在训练期间,将使用一种称为BucketIterator的特殊迭代器。当数据传递到神经网络时,我们希望将数据填充为相同的长度,以便我们可以批量处理它们。
如果序列的长度差异很大,则填充将消耗大量浪费的内存和时间。BucketIterator将每个批次的相似长度的序列组合在一起,以最小化填充。
from torchtext.data import Iterator, BucketIterator
# sort_key就是告诉BucketIterator使用哪个key值去进行组合,很明显,在这里是comment_text
# repeat设定为False是因为之后要打包这个迭代层
train_iter, val_iter=BucketIterator.splits(
(trn,vld),
batch_sizes=(64,64),
device=-1,
sort_key=lambda x:len(x.comment_text),
sort_within_batch=False,
repeat=False
)
# 现在就可以看一下输出的BucketIterator是怎样的。
batch=next(train_iter.__iter__());batch
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
[torchtext.data.batch.Batch of size 25]
[.comment_text]:[torch.LongTensor of size 494x25]
[.toxic]:[torch.LongTensor of size 25]
[.severe_toxic]:[torch.LongTensor of size 25]
[.threat]:[torch.LongTensor of size 25]
[.obscene]:[torch.LongTensor of size 25]
[.insult]:[torch.LongTensor of size 25]
[.identity_hate]:[torch.LongTensor of size 25]
batch.__dict__.keys()
dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
test_iter = Iterator(tst, batch_size=64, device=-1, sort=False, sort_within_batch=False, repeat=False)
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
2.5 打包迭代器
目前,迭代器返回一个名为torchtext.data.Batch的自定义数据类型。这使得代码重用变得困难(因为每次列名更改时,我们都需要修改代码),并且使得torchtext很难与其他库一起用于某些用例(例如torchsample和fastai)。
这里教程将写了一个简单的包装器,使批量易于使用。具体地说,我们将批处理转换为元素形式(x,y),其中x是自变量(模型的输入),y是因变量(监督数据)。
class BatchWrapper:
def __init__(self, dl, x_var, y_vars):
self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
def __iter__(self):
for batch in self.dl:
x = getattr(batch, self.x_var) # we assume only one input in this wrapper
if self.y_vars is not None: # we will concatenate y into a single tensor
y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
else:
y = torch.zeros((1))
yield (x, y)
def __len__(self):
return len(self.dl)
train_dl = BatchWrapper(train_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
valid_dl = BatchWrapper(val_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test_dl = BatchWrapper(test_iter, "comment_text", None)
验证一下,这里有一个理解,iter方法是用来迭代出tensor的?似乎这样是可以的。
next(train_dl.__iter__())
(tensor([[ 63, 66, 354, ..., 334, 453, 778],
[ 4, 82, 63, ..., 55, 523, 650],
[664, 2, 4, ..., 520, 30, 22],
...,
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1],
[ 1, 1, 1, ..., 1, 1, 1]]),
tensor([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[1., 1., 0., 1., 1., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]]))
2.6 训练一个文本分类器
依旧是LSTM
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
class LSTM(nn.Module):
def __init__(self, hidden_dim, emb_dim=300, num_linear=1):
super().__init__()
self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1)
self.linear_layers = []
for _ in range(num_linear - 1):
self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
self.linear_layer = nn.ModuleList(self.linear_layers)
self.predictor = nn.Linear(hidden_dim, 6)
def forward(self, seq):
hdn, _ = self.encoder(self.embedding(seq))
feature = hdn[-1, :, :]
for layer in self.linear_layers:
feature = layer(feature)
preds = self.predictor(feature)
return preds
em_sz = 100
nh = 500
nl = 3
model = LSTM(nh, emb_dim=em_sz)
%%time
import tqdm
opt=optim.Adam(model.parameters(),lr=1e-2)
loss_func=nn.BCEWithLogitsLoss()
epochs=2
Wall time: 0 ns
for epoch in range(1, epochs + 1):
running_loss = 0.0
running_corrects = 0
model.train()
for x, y in tqdm.tqdm(train_dl):
opt.zero_grad()
preds = model(x)
loss = loss_func(y, preds)
loss.backward()
opt.step()
running_loss += loss.item()* x.size(0)
epoch_loss = running_loss / len(trn)
val_loss = 0.0
model.eval() # 评估模式
for x, y in valid_dl:
preds = model(x)
loss = loss_func(y, preds)
val_loss += loss.item()* x.size(0)
val_loss /= len(vld)
print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))
test_preds = []
for x, y in tqdm.tqdm(test_dl):
preds = model(x)
preds = preds.data.numpy()
# 模型的实际输出是logit,所以再经过一个sigmoid函数
preds = 1 / (1 + np.exp(-preds))
test_preds.append(preds)
test_preds = np.hstack(test_preds)
print(test_preds)
100%|██████████| 1/1 [00:06<00:00, 6.28s/it]
Epoch: 1, Training Loss: 14.2130, Validation Loss: 4.4170
100%|██████████| 1/1 [00:04<00:00, 4.20s/it]
Epoch: 2, Training Loss: 10.5315, Validation Loss: 3.3947
100%|██████████| 1/1 [00:00<00:00, 2.87it/s]
[[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99978834 0.99761593 0.5279695 0.9961003 0.9957486 0.3662841 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]
[0.99982786 0.99812 0.53367174 0.99682033 0.9966144 0.3649216 ]]
2.7 测试数据并查看
test_preds = []
for x, y in tqdm.tqdm(test_dl):
preds = model(x)
# if you're data is on the GPU, you need to move the data back to the cpu
# preds = preds.data.cpu().numpy()
preds = preds.data.numpy()
# the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
preds = 1 / (1 + np.exp(-preds))
test_preds.append(preds)
test_preds = np.hstack(test_preds)
100%|██████████| 1/1 [00:00<00:00, 2.77it/s]
df = pd.read_csv("./data/torchtextdata/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
df[col] = test_preds[:, i]
df.head(3)