pad_sentences()讲完了,下面我们退回到train.py -> model.train() -> base_model.py -> train() ->run_epoch() -> ner_model.py -> run_epoch() -> get_feed_dict()
def get_feed_dict(self, words, labels=None, lr=None, dropout=None):
"""Given some data, pad it and build a feed dictionary
Args:
words: list of sentences. A sentence is a list of ids of a list of
words. A word is a list of ids
labels: list of ids
lr: (float) learning rate
dropout: (float) keep prob
Returns:
dict {placeholder: value}
"""
# perform padding of the given data
if self.config.use_chars:
char_ids, word_ids = zip(*words)
word_ids, sequence_lengths = pad_sequences(word_ids, 0)
char_ids, word_lengths = pad_sequences(char_ids, pad_tok=0,
nlevels=2)
else:
word_ids, sequence_lengths = pad_sequences(words, 0)
# build feed dictionary
feed = {
self.word_ids: word_ids,
self.sequence_lengths: sequence_lengths
}
if self.config.use_chars:
feed[self.char_ids] = char_ids
feed[self.word_lengths] = word_lengths
if labels is not None:
labels, _ = pad_sequences(labels, 0)
feed[self.labels] = labels
if lr is not None:
feed[self.lr] = lr
if dropout is not None:
feed[self.dropout] = dropout
return feed, sequence_lengths
构建完pad处理后的word_ids和sequence_length之后,我们构建一个字典feed,传入通用的参数word_ids、sequence_lengths,如果这里使用了char_embedding技术,我们还会将char_ids和word_lengths导入。如果label是非空的,我们还会把labels进行pad处理(很好理解,data都pad过了,相应的label也必须pad,要有同样的长度嘛)
- 如果lr(learning_rate)参数非空的话,也会将此录入feed中,dropout参数同理,最后的最后返回feed和sequence_length。
get_feed_dict讲完了,下面回到run_epoch()
train.py -> model.train() -> base_model.py -> train() ->run_epoch() -> ner_model.py -> run_epoch()
def run_epoch(self, train, dev, epoch):
"""Performs one complete pass over the train set and evaluate on dev
Args:
train: dataset that yields tuple of sentences, tags
dev: dataset
epoch: (int) index of the current epoch
Returns:
f1: (python float), score to select model on, higher is better
"""
# progbar stuff for logging
batch_size = self.config.batch_size
nbatches = (len(train) + batch_size - 1) // batch_size
prog = Progbar(target=nbatches)
# iterate over dataset
for i, (words, labels) in enumerate(minibatches(train, batch_size)):
fd, _ = self.get_feed_dict(words, labels, self.config.lr,
self.config.dropout)
_, train_loss, summary = self.sess.run(
[self.train_op, self.loss, self.merged], feed_dict=fd)
prog.update(i + 1, [("train loss", train_loss)])
# tensorboard
if i % 10 == 0:
self.file_writer.add_summary(summary, epoch*nbatches + i)
metrics = self.run_evaluate(dev)
msg = " - ".join(["{} {:04.2f}".format(k, v)
for k, v in metrics.items()])
self.logger.info(msg)
return metrics["f1"]
get_feed_dict()函数返回的是dict类型feed和list类型sequence_length。
这里的话sequence_length我们不需要,我们只要feed。
- 接下来,执行session.run操作,TensorFlow中,我们先规划好计算图,再编写代码,之后调用tf.Session.run()。简洁高效。
在实际代码中,一般写成如下形式:
with tf.Session() as sess:
sess.run( )
函数的传入参数具有如下形式:
run(fetches, feed_dict=None, options=None, run_metadata=None)
参数:
fetches参数可以是单个图元素(single graph element),也可以是任意嵌套的列表list,元组tuple,名称元组namedtuple,字典dict或包含图元素的OrderedDict。
如:
sess.run([train_step, loss_mse], feed_dict = ...)
返回值:
函数返回值与 ‘ fetches ' 参数具有相同的形状。
所以这里我们就知道了:我们给session传入了嵌套的图元素(train_op、loss、merged)并喂入数据(字典类型,里面有word_ids、char_ids、word_lengths、labels、learning_rate、dropout),train_op是优化器,loss是损失值,merge是tensorboard的可视化参数,我们只需要后两个参数。
- 接下来使用Prog类打印进度train loss。
- 接下来,每训练10个数据,在tensorboard上记录一次结果。这里file_writer定义在base_model.py -> add_summary(self)中:
self.file_writer = tf.summary.FileWriter(self.config.dir_output,
self.sess.graph)
- 接着打印matrics(指标)信息,使用self.run_evaluate(dev)函数,这是做一次评估,传入的参数是开发集(dev),这个函数比较有用啊,后期在对测试集进行评估的时候也要用到。我们来看。