实体命名识别详解（十一）

from model.data_utils import CoNLLDataset
from model.ner_model import NERModel
from model.config import Config


def main():
    # create instance of config
    config = Config()

    # build model
    model = NERModel(config)
    model.build()
    # model.restore_session("results/crf/model.weights/") # optional, restore weights
    # model.reinitialize_weights("proj")

首先导入了三个类。

先进行Config类的实例化。之前我们分析过，这里就不多赘述了，用到了再细说。
NER模型实例化。

class NERModel(BaseModel):
    """Specialized class of Model for NER"""

    def __init__(self, config):
        super(NERModel, self).__init__(config)
        self.idx_to_tag = {idx: tag for tag, idx in
                           self.config.vocab_tags.items()}

等等等等，我被搞蒙圈了，，，这是啥意思啊？传入NERModel类的参数是实例化的config，然而看NERModel类的定义，它传入的参数可是BaseModel类啊！！！！况且，，类的参数怎么是另一个类？？？我他妈服了。
查了资料，首先NERModel是继承了BaseModel这个类，然后它重写了一个初始化函数，传入参数是config，之后是一个super函数，super函数是用于调用父类（超类）的一个方法，所以这里是调用了BaseModel的初始化方法。并将config参数传入。
既然如此，那啥也憋说了，先看他妈的BaseModel类。

class BaseModel(object):
    """Generic class for general methods that are not specific to NER"""

    def __init__(self, config):
        """Defines self.config and self.logger

        Args:
            config: (Config instance) class with hyper parameters,
                vocab and embeddings

        """
        self.config = config
        self.logger = config.logger
        self.sess   = None
        self.saver  = None

首先BaseModel继承了object类（奇怪这个o居然没有大写2333）object类是所有类的鸡肋。。哦不对，基类！！
看介绍！Generic class for general methods that are not specific to NER非特定于NER的一般方法的泛型类。（什么鬼有道翻译233不过意思差不多懂了）百度翻译靠谱点（笑哭）：不特定于NER的通用方法的泛型类。
传入类的参数是config（Config类的实例）然后进行一些基本的配置。

———————————————————————2019年7月15日更———————————————————————
接下来是一个类实例的build方法，进ner_model.py中看一下。

    def build(self):
        # NER specific functions
        self.add_placeholders()
        self.add_word_embeddings_op()
        self.add_logits_op()
        self.add_pred_op()
        self.add_loss_op()

        # Generic functions that add training op and initialize session
        self.add_train_op(self.config.lr_method, self.lr, self.loss,
                self.config.clip)
        self.initialize_session() # now self.sess is defined and vars are init

字面意思来看，这里建立NER模型特定的一些参数。

首先构建placeholder，目前在我看来就是形参
第二步，构建词嵌入操作。
第三步，添加logits操作。
第四步，增加预测操作。
第五步，增加损失函数操作。
第六步，增加训练操作。
第七步，初始化session操作。

第一步，add_placeholders()

    def add_placeholders(self):
        """Define placeholders = entries to computational graph"""
        # shape = (batch size, max length of sentence in batch)
        self.word_ids = tf.placeholder(tf.int32, shape=[None, None],
                        name="word_ids")

        # shape = (batch size)
        self.sequence_lengths = tf.placeholder(tf.int32, shape=[None],
                        name="sequence_lengths")

        # shape = (batch size, max length of sentence, max length of word)
        self.char_ids = tf.placeholder(tf.int32, shape=[None, None, None],
                        name="char_ids")

        # shape = (batch_size, max_length of sentence)
        self.word_lengths = tf.placeholder(tf.int32, shape=[None, None],
                        name="word_lengths")

        # shape = (batch size, max length of sentence in batch)
        self.labels = tf.placeholder(tf.int32, shape=[None, None],
                        name="labels")

        # hyper parameters
        self.dropout = tf.placeholder(dtype=tf.float32, shape=[],
                        name="dropout")
        self.lr = tf.placeholder(dtype=tf.float32, shape=[],
                        name="lr")

定义placeholder就等于进入构建计算图的操作。
我们看这里定义了words_ids，这是把句子中的全部单词转化成id形式，故shape为(batch size, max length of sentence in batch)。
char_ids，把chars转换为id的形式，这个操作我是真的没懂为啥，对于句子的理解不应该是从word入手吗？？？算了，这个以后我们再讨论。shape为 (batch size, max length of sentence, max length of word)，补充一下，由于这里我们还不清楚batch_size、max length of sentences、max length of word是多大，，，这得具体到实际处理过程中才明白，所以这里我们先设置为NONE。
word_length，单词的长度，我也不知道作者搞这个干嘛。。。
labels，标签，shape为 (batch size, max length of sentence in batch)，应该是和word_length的shape一样。
最后定义了俩hyper parameters，超参数，一个是drop_out，一个是learning_rate（学习率）。

第二步，add_word_embeddings_op()

构建词嵌入操作。

    def add_word_embeddings_op(self):
        """Defines self.word_embeddings

        If self.config.embeddings is not None and is a np array initialized
        with pre-trained word vectors, the word embeddings is just a look-up
        and we don't train the vectors. Otherwise, a random matrix with
        the correct shape is initialized.
        """
        with tf.variable_scope("words"):
            if self.config.embeddings is None:
                self.logger.info("WARNING: randomly initializing word vectors")
                _word_embeddings = tf.get_variable(
                        name="_word_embeddings",
                        dtype=tf.float32,
                        shape=[self.config.nwords, self.config.dim_word])
            else:
                _word_embeddings = tf.Variable(
                        self.config.embeddings,
                        name="_word_embeddings",
                        dtype=tf.float32,
                        trainable=self.config.train_embeddings)

            word_embeddings = tf.nn.embedding_lookup(_word_embeddings,
                    self.word_ids, name="word_embeddings")

        with tf.variable_scope("chars"):
            if self.config.use_chars:
                # get char embeddings matrix
                _char_embeddings = tf.get_variable(
                        name="_char_embeddings",
                        dtype=tf.float32,
                        shape=[self.config.nchars, self.config.dim_char])
                char_embeddings = tf.nn.embedding_lookup(_char_embeddings,
                        self.char_ids, name="char_embeddings")

                # put the time dimension on axis=1
                s = tf.shape(char_embeddings)
                char_embeddings = tf.reshape(char_embeddings,
                        shape=[s[0]*s[1], s[-2], self.config.dim_char])
                word_lengths = tf.reshape(self.word_lengths, shape=[s[0]*s[1]])

                # bi lstm on chars
                cell_fw = tf.contrib.rnn.LSTMCell(self.config.hidden_size_char,
                        state_is_tuple=True)
                cell_bw = tf.contrib.rnn.LSTMCell(self.config.hidden_size_char,
                        state_is_tuple=True)
                _output = tf.nn.bidirectional_dynamic_rnn(
                        cell_fw, cell_bw, char_embeddings,
                        sequence_length=word_lengths, dtype=tf.float32)

                # read and concat output
                _, ((_, output_fw), (_, output_bw)) = _output
                output = tf.concat([output_fw, output_bw], axis=-1)

                # shape = (batch size, max sentence length, char hidden size)
                output = tf.reshape(output,
                        shape=[s[0], s[1], 2*self.config.hidden_size_char])
                word_embeddings = tf.concat([word_embeddings, output], axis=-1)

        self.word_embeddings =  tf.nn.dropout(word_embeddings, self.dropout)

先看函数体介绍：
如果self.config.embeddings非空并且已经用numpy的array函数初始化过，那么这里embeddings就只执行一个lookup操作，否则的话我们就建立一个随机的初始化矩阵。
先概览一下，这里用with构建了两个命名域words和chars并在其中操作，最后给出经过dropout处理后的self.word_embeddings。

先来看words命名域中的操作：
如果config实体中的embeddings没有建立，那么调用logger.info（）方法生成日志并打印"WARNING: randomly initializing word vectors"，构建一个临时的_word_embeddings，并使用tensorflow.get_variable()进行初始化，shape为[self.config.nwords, self.config.dim_word]即【单词数，单词维度】。
如果config中已经构建了embeddings,，这里同样构建一个临时变量_word_embeddings，将我们之前构建过的词向量【self.config.embeddings】赋值过来。不过这里将config中的train_embeddings返回给trainable，进config.py中看一下，

    # training
    train_embeddings = False
    nepochs          = 15
    dropout          = 0.5
    batch_size       = 20
    lr_method        = "adam"
    lr               = 0.001
    lr_decay         = 0.9
    clip             = -1 # if negative, no clipping
    nepoch_no_imprv  = 3

这里train_embeddings设置为FALSE，我不知道是为啥，是啥意思，此外，tf.get_variable()和tf.Variable()都是初始化函数，有什么分别呢？查了一下，，如果trainable为TRUE，将变量添加到图形集合？？？WTF好叭，看来我还需要再出一章主题来讲讲TensorFlow中的Variabel（）和get_variable（）等函数以及其参数。
最后一步呢，使用TensorFlow的embedding_lookup函数，根据训练好的词向量，在句子中进行搜索，在 embedding 张量列表中查找 ids。传入的第一个参数是embedding张量，第二个参数是单词的id。这样，words命名域中的内容就分析完成，接下来看chars命名域。

chars命名域
从config.py中的一句话，

    # NOTE: if both chars and crf, only 1.6x slower on GPU
    use_crf = True # if crf, training is 1.7x slower on CPU
    use_chars = True # if char embedding, training is 3.5x slower on CPU

这是设置字符嵌入的方法，NER为何要进行字符嵌入操作？？我还是懵懂。
OK，刚刚大致搜索了一下，知乎上的这篇感觉蛮不错，作者讲的很通俗，目前NER上的bilstm+char+crf的模型中，char representation（由char embedding得到）是通过将单词的字符当成一个序列，经过CNN或者RNN以后得到的，然后与对应的word_embedding concat起来，，比如"word"这个单词，[w,o,r,d]就组成一个序列，这个序列经过char embedding层将字符映射成n维的embedding以后，输入到cnn或者rnn，然后得到一个m维的char representation，然后再与word这个单词对应的word embedding concat起来，一般word embedding是用预训练好的，而char representation随机初始化后跟随网络的训练而调整。
OK，那就先构建char_embedding，在chars命名域中，先建一个_char_embeddings的临时矩阵，和之前_word_embeddings一样（笑），将shape设为【self.config.nwords, self.config.dim_word】，初始化_char_embeddings，然后同上，一个embedding_lookup方法。
下一段，将时间维度放到1轴上。首先得到char_embedding的形状，然后对char_embeddings进行reshape（）操作，生成新的char_embeddings，TensorFlow使用tf.reshape重置张量。

reshape(
    tensor,
    shape,
    name=None
)

所以，这里tensor是char_embeddings，shape是[s[0]s[1], s[-2], self.config.dim_char]，这里s[-2]是s中倒数第二组元素。
然后重定义word_length，之前的格式是【batch_size，max_length of sentence】
然后定义binary LSTM模型，这个我倒是还不很熟悉，不过看定义的fw* 和 bw，应该是前向和后向，双向LSTM嘛，字面意思也好理解。。。算了后期再出一章BiLSTM的专题（2333欠好多。
接下来是输出，一个临时的_output变量，bidirectional_dynamic_rnn，看字面意思，双向动态rnn函数？？？WTF。
最后从_output中提取出output_fw、output_bw并用TensorFlow的concat函数，tf.concat用于将多个张量在某维度合并起来，类似于numpy.concatenate。
最后使用tensorflow的reshape函数对output进行变换。
最后再将word_embeddings和我们的output（char embeddings）进行concat操作，赋值给新的word_embeddings。
最后的最后，整体来一个dropout操作。。。。
累死了但实际上我只说了大致流程，我还是个菜鸡（笑。
这一篇先到这！

实体命名识别详解（十一）

推荐阅读更多精彩内容