实体命名识别详解(十七)

pad_sentences()讲完了,下面我们退回到train.py -> model.train() -> base_model.py -> train() ->run_epoch() -> ner_model.py -> run_epoch() -> get_feed_dict()

    def get_feed_dict(self, words, labels=None, lr=None, dropout=None):
        """Given some data, pad it and build a feed dictionary

        Args:
            words: list of sentences. A sentence is a list of ids of a list of
                words. A word is a list of ids
            labels: list of ids
            lr: (float) learning rate
            dropout: (float) keep prob

        Returns:
            dict {placeholder: value}

        """
        # perform padding of the given data
        if self.config.use_chars:
            char_ids, word_ids = zip(*words)
            word_ids, sequence_lengths = pad_sequences(word_ids, 0)
            char_ids, word_lengths = pad_sequences(char_ids, pad_tok=0,
                nlevels=2)
        else:
            word_ids, sequence_lengths = pad_sequences(words, 0)

        # build feed dictionary
        feed = {
            self.word_ids: word_ids,
            self.sequence_lengths: sequence_lengths
        }

        if self.config.use_chars:
            feed[self.char_ids] = char_ids
            feed[self.word_lengths] = word_lengths

        if labels is not None:
            labels, _ = pad_sequences(labels, 0)
            feed[self.labels] = labels

        if lr is not None:
            feed[self.lr] = lr

        if dropout is not None:
            feed[self.dropout] = dropout

        return feed, sequence_lengths

构建完pad处理后的word_ids和sequence_length之后,我们构建一个字典feed,传入通用的参数word_ids、sequence_lengths,如果这里使用了char_embedding技术,我们还会将char_ids和word_lengths导入。如果label是非空的,我们还会把labels进行pad处理(很好理解,data都pad过了,相应的label也必须pad,要有同样的长度嘛)

  • 如果lr(learning_rate)参数非空的话,也会将此录入feed中,dropout参数同理,最后的最后返回feed和sequence_length

get_feed_dict讲完了,下面回到run_epoch()

train.py -> model.train() -> base_model.py -> train() ->run_epoch() -> ner_model.py -> run_epoch()

    def run_epoch(self, train, dev, epoch):
        """Performs one complete pass over the train set and evaluate on dev

        Args:
            train: dataset that yields tuple of sentences, tags
            dev: dataset
            epoch: (int) index of the current epoch

        Returns:
            f1: (python float), score to select model on, higher is better

        """
        # progbar stuff for logging
        batch_size = self.config.batch_size
        nbatches = (len(train) + batch_size - 1) // batch_size
        prog = Progbar(target=nbatches)

        # iterate over dataset
        for i, (words, labels) in enumerate(minibatches(train, batch_size)):
            fd, _ = self.get_feed_dict(words, labels, self.config.lr,
                    self.config.dropout)

            _, train_loss, summary = self.sess.run(
                    [self.train_op, self.loss, self.merged], feed_dict=fd)

            prog.update(i + 1, [("train loss", train_loss)])

            # tensorboard
            if i % 10 == 0:
                self.file_writer.add_summary(summary, epoch*nbatches + i)

        metrics = self.run_evaluate(dev)
        msg = " - ".join(["{} {:04.2f}".format(k, v)
                for k, v in metrics.items()])
        self.logger.info(msg)

        return metrics["f1"]

get_feed_dict()函数返回的是dict类型feedlist类型sequence_length
这里的话sequence_length我们不需要,我们只要feed

  • 接下来,执行session.run操作,TensorFlow中,我们先规划好计算图,再编写代码,之后调用tf.Session.run()。简洁高效。
    在实际代码中,一般写成如下形式:
with tf.Session() as sess:
    sess.run( )

函数的传入参数具有如下形式:

run(fetches, feed_dict=None, options=None, run_metadata=None)

参数:
fetches参数可以是单个图元素(single graph element),也可以是任意嵌套的列表list,元组tuple,名称元组namedtuple,字典dict或包含图元素的OrderedDict。
如:

sess.run([train_step, loss_mse], feed_dict = ...)

返回值:
函数返回值与 ‘ fetches ' 参数具有相同的形状。
所以这里我们就知道了:我们给session传入了嵌套的图元素(train_op、loss、merged)并喂入数据(字典类型,里面有word_ids、char_ids、word_lengths、labels、learning_rate、dropout),train_op是优化器,loss是损失值,merge是tensorboard的可视化参数,我们只需要后两个参数。

  • 接下来使用Prog类打印进度train loss
  • 接下来,每训练10个数据,在tensorboard上记录一次结果。这里file_writer定义在base_model.py -> add_summary(self)中:
        self.file_writer = tf.summary.FileWriter(self.config.dir_output,
                self.sess.graph)
  • 接着打印matrics(指标)信息,使用self.run_evaluate(dev)函数,这是做一次评估,传入的参数是开发集(dev),这个函数比较有用啊,后期在对测试集进行评估的时候也要用到。我们来看。
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容