一、模型定义

NJUNMT中的模型分为Sequence2Sequence和Transformer两种。两种模型总体上都可以看做是Encoder-Decoder模型，但使用的具体encoder和decoder的类型存在不同。Seq2Seq模型使用RNN，而Transformer不使用RNN，依赖于attention机制。

模型.jpg

1. Sequence2Sequence

这一部分介绍NJUNMT中seq2seq模型的主要构成部件。
seq2seq模型配置示例：

#njunmt/example_configs/toy_seq2seq.yml
model: njunmt.models.SequenceToSequence
model_params:
  '''是否对所有fflayer进行归一化'''
  fflayers.layer_norm: true 
  '''源语言embedding维度'''
  embedding.dim.source: 8
  '''目标语言embedding维度'''
  embedding.dim.target: 8
  modality.params:
    multiply_embedding_mode:
    '''各层是否共享embedding和softmax权重？'''
    share_embedding_and_softmax_weights: false
     '''logits的保留率 1.0为不进行dropout'''
    dropout_logit_keep_prob: 1.0
    '''损失函数'''
    loss: crossentropy
    timing:
'''初始激活值'''
  initializer: random_uniform
  ...

1.1 Embedding

Embedding(词嵌入)到底是什么
神经网络中的embedding层的参数是一个矩阵，跟整个模型一起训练得来。矩阵乘以某个词的独热向量得到这个词的词向量表示。

NJUNMT中用的是经过n次封装后的tf.nn.embedding_lookup()
详解TF中的Embedding操作！

1.1 Encoder

seq2seq模型的encoder配置示例：

model_params:
  ...
  """指定encoder类型"""
  encoder.class: njunmt.encoders.BiUnidirectionalRNNEncoder
  encoder.params:
    rnn_cell:
      """指定rnn cell类型，可以是LSTMCell/GRUCell"""
      cell_class: GRUCell
      cell_params:
      """每个cell中神经元的数量"""
      num_units: 9
      """input 的保留率 1.0为不进行dropout"""
      dropout_input_keep_prob: 0.9
      """input 的保留率 1"""
      dropout_state_keep_prob: 1.0
      """RNN层数"""
      num_layers: 2
    ...

seq2seq模型使用RNN encoder。njunmt/encoders/rnn_encoder.py中定义了三种encoder类型：
UnidirectionalRNNEncoder：单向多层RNN
StackBidirectionalRNNEncoder：双向多层RNN
BiUnidirectionalRNNEncoder：底层为双向RNN，上层为双向多层RNN
*疑问：为什么要使用这样的RNN架构？

Encoder类均包含encode()方法：

def encode(inputs, sequence_length, **kwargs):
    outputs, states = tf.nn.dynamic_rnn(...)
    return (outputs, final_states)

tf.nn.dynamic_rnn()方法运行已创建RNN cell，dynamic在于不同batch传入的sequence_length可以不同，每一个batch内动态进行padding.

Sequence2Sequence类中的_encode()方法：
对input进行embedding，然后调用Encoder的encode()方法

    def _encode(self, input_fields):
        features = self._input_to_embedding_fn(...)
        encoder_output = self._encoder.encode(features, feature_length)
        return encoder_output

1.2 Bridge

衔接encoder final state和decoder initial state的东西。
njunmt/utils/bridges.py中定义了4种Bridge:
ZeroBridge：decoder initial state 为全0
PassThroughBridge：直接将encoder final state传给decoder
InitialStateBridge：用一层全连接网络对encoder final state进行映射再传给decoder
Variable Bridge：通过学习决定decoder initial state

1.3 Decoder

seq2seq模型使用RNN decoder。njunmt/decoders/rnn_decoder.py中定义了三种decoder类型：
SimpleDecoder：多层RNN，无Attention机制
AttentionDecoder：多层RNN，使用concat(decoder_input, attention_context)作为输入
CondAttentionDecoder：底层是使用decoder_input作为输入的单层RNN，上层是使用attention_context作为输入的多层RNN

Decoder类的decode()方法主要是调用了dynamic_decode()方法：

def dynamic_decode(decoder,
                   encoder_output,
                   bridge,
                   helper,
                   target_to_embedding_fn,
                   outputs_to_logits_fn,
                   parallel_iterations=32,
                   swap_memory=False,
                   **kwargs):
     ...
    '''一些初始化，包括把target input 变成embedding'''

    with tf.variable_scope(decoder.name):
        '''调用decoder.prepare()方法：
        1. 用bridge初始化decoder的states，
        2. 从encoder_output获得attention（如果需要）'''
        initial_cache = decoder.prepare(encoder_output, bridge, helper)  
        if decoder.mode == ModeKeys.INFER:
            assert "beam_size" in kwargs
            beam_size = kwargs["beam_size"]
            initial_cache = stack_beam_size(initial_cache, beam_size)

    """while_loop中的循环体"""
    def body_traininfer(time, inputs, cache, outputs_ta,
                        finished, *args):

        with tf.variable_scope(decoder.name):
            '''调用decoder.step()方法'''
            outputs, next_cache = decoder.step(inputs, cache)
        outputs_ta = nest.map_structure(lambda ta, out: ta.write(time, out),
                                        outputs_ta, decoder_output_remover.apply(outputs))
        inner_loop_vars = [time + 1, None, None, outputs_ta, None]
        sample_ids = None
        if decoder.mode == ModeKeys.INFER:
            log_probs, lengths = args[0], args[1]
            bs_stat_ta = args[2]
            predicted_ids = args[3]
            with tf.variable_scope(self.name):
                decoder_top_features = self.merge_top_features(ret_val)
            logits = outputs_to_logits_fn(decoder_top_features)
            # sample next symbols
            sample_ids, beam_ids, next_log_probs, next_lengths \
                = helper.sample_symbols(logits, log_probs, finished, lengths, time=time)
            predicted_ids = gather_states(tf.reshape(predicted_ids, [-1, time + 1]), beam_ids)

            next_cache["decoding_states"] = gather_states(next_cache["decoding_states"], beam_ids)
            bs_stat = BeamSearchStateSpec(
                log_probs=next_log_probs,
                beam_ids=beam_ids)
            bs_stat_ta = nest.map_structure(lambda ta, out: ta.write(time, out),
                                            bs_stat_ta, bs_stat)
            next_predicted_ids = tf.concat([predicted_ids, tf.expand_dims(sample_ids, axis=1)], axis=1)
            next_predicted_ids = tf.reshape(next_predicted_ids, [-1])
            next_predicted_ids.set_shape([None])
            inner_loop_vars.extend([next_log_probs, next_lengths, bs_stat_ta, next_predicted_ids])

        next_finished, next_input_symbols = helper.next_symbols(time=time, sample_ids=sample_ids)
        next_inputs = target_to_embedding_fn(next_input_symbols, time + 1)

        next_finished = tf.logical_or(next_finished, finished)
        inner_loop_vars[1] = next_inputs
        inner_loop_vars[2] = next_cache
        inner_loop_vars[4] = next_finished
        return inner_loop_vars

    loop_vars = [initial_time, initial_inputs, initial_cache,
                 initial_outputs_ta, initial_finished]

    if decoder.mode == ModeKeys.INFER:  
        ...'''增加一些infer需要的参数'''

    '''while(cond(args)) loop(body(args))
    cond: tf.logical_not(tf.reduce_all(initial_finished))
    tf.logical_not:布尔值（组成的张量）not运算
    tf.reduce_all:计算张量在维度上的逻辑和（按某轴向全部取and）
    finished为布尔张量 记录什么东西结束了
    循环到所有finished标记为真为止'''
    conf = lambda *args: tf.logical_not(tf.reduce_all(args[4]))
    res = tf.while_loop(
        conf,
        body_traininfer,
        loop_vars=loop_vars,
        parallel_iterations=parallel_iterations,
        swap_memory=swap_memory)

    ...'''从res中取出final_outputs'''

    if decoder.mode == ModeKeys.INFER:
        ...'''return final_outputs和一些state'''

    return final_outputs

decoder output-(merge)->logits-(softmax)->
*这部分比较复杂，还有很多没看懂

Sequence2Sequence类中的_decode()方法：

    def _decode(self, encoder_output, input_fields):

        if '''mode == TRAIN or EVAL''':
            ...'''准备label和helper'''
        else:  ''' mode == INFER'''
            ...'''准备helper'''

        '''调用Decoder的decode()方法'''
        decoder_output, decoding_res = self._decoder.decode(
            encoder_output, bridge, helper,
            self._target_to_embedding_fn,
            self._outputs_to_logits_fn,
            beam_size=...)
        return decoder_output, decoding_res

1.3 Attention

如果不使用注意力机制，那么decoder在每一个时间片使用的都是同一个上下文向量（Context）,即encoder output。
在有注意力机制的seq2seq模型中，针对decoder的每个时间片，有不同的Context，每个Context分配给encoder隐藏层状态(h)的各时间片的权重不同。

context计算公式.png

注意力机制使得decoder第i个时间片的输出不仅仅关注相对应的encoder第j个时间片的隐藏层状态，还可以将前后的隐藏层状态以一定权重输入进来。

上图中的α即attention weight，根据encoder的隐藏层状态（h）和decoder的隐藏层状态（H）得到。在不同种类的注意力机制中，attention weight有不同的计算方法。

1.3.1 加法注意力

BahnandauAttention中使用的是加法注意力：用一个全连接层得到attention score(logits，即下图中的e)，然后softmax得到attention weight(下图中的α)

image.png

另一种常见的字母表达方式：把上式中decoder隐藏层状态H替换为Q(query)，encoder隐藏层状态h替换为K(keys)，即为：

attention_weight = softmax(tanh(W1*Q+W2*K))

1.3.2 乘法注意力

其他注意力模型中用到的是乘法注意力：乘法注意力不用使用一个全连接层，所以空间复杂度占优；另外由于乘法可以使用优化的矩阵乘法运算，所以计算上也一般占优。

乘法注意力.png

把上式中decoder隐藏层状态H替换为Q(query)，encoder隐藏层状态h替换为K(keys)，即为：

attention_weight = softmax(dot_product(Q, K))

Attention的看这里不要看上面
[深度概念]·Attention机制实践解读

#in encoder
EO, _ = encoder(...)
attention_values = EO
attention_keys = FC(EO)

#in decoder
attention_query = cell_output
#in attention.build()
query = FC(attention_query)
keys = attention_keys
memory = attention_values
score = att_fn(query, keys)#att_fn可以是加法乘法等
attention_weight = softmax(score)
context = attention_weight * memory
#then use context to infer

image.png

关于注意力的计算详细：
提出Transformer的《Attention is all you need》论文原文
 《attention is all you need》解读

2. Transformer

Transformer模型.png

Transformer模型简介

*以下来自：Transformer for NMT
总的来说，模型由encoder&decoder两部分组成，左侧的Stage 2，3构成一层的encoder，右侧的Stage 2，3，4构成一层decoder, 在每一层中的每个Stage称为一个子层（sublayer）. 在transformer中，encoder和decoder同样可以堆叠N个来构建deeper model，论文中使用了6个encoders和6个decoders. 除此之外再加入输入embedding，位置编码（positional encoding）和输出的dense layer，就是完整的transformer. 下面我们具体地分析各部分结构。

*Transformer很复杂，还没仔细看

二、运行过程

2.1 一些TF基础知识

tensorflow中的Graph（图）和Session（会话）的关系（大盘鸡与红烧肉）

TensorFlow 中的几个关键概念：Tensor，Operation，Graph，Session

tensorflow中的Session()和run()
Hook? tf.train.SessionRunHook()介绍

*一个相似完整流程参考
[tf]使用attention机制进行NMT

NJUNMT
bin/train.py中实例化了一个TrainingExperiment对象，带入模型配置运行训练。

    training_runner = TrainingExperiment(
        model_configs=model_configs)

    training_runner.run()

TrainingExperiment的run()方法定义：

#njunmt/nmt_experiment.py
 def run(self):
        """ 建立源语言与目标语言词汇表"""
        vocab_source = Vocab(...)
        vocab_target = Vocab(...)
        eval_dataset = {
            "vocab_source": vocab_source,
            "vocab_target": vocab_target,
            "features_file": self._model_configs["data"]["eval_features_file"],
            "labels_file": self._model_configs["data"]["eval_labels_file"]}

        config = tf.ConfigProto()
        config.gpu_options.allow_growth = True
        config.allow_soft_placement = True

        """model_fn()建立模型"""
        estimator_spec = model_fn(...)
        train_ops = estimator_spec.train_ops
        hooks = estimator_spec.training_hooks

       """build training session"""
        sess = tf.train.MonitoredSession(
            session_creator=tf.train.ChiefSessionCreator(
                scaffold=tf.train.Scaffold(),
                checkpoint_dir=None,
                master="",
                config=config),
            hooks=tuple(hooks) + tuple(build_eval_metrics
                                       (self._model_configs, eval_dataset,
                                        model_name=estimator_spec.name)))

        ...
        """somehow 得到处理好的train data"""
        train_data = ...

        eidx = [0, 0]
        update_cycle = [self._model_configs["train"]["update_cycle"], 1]#[目标cycle数，当前cycle数]

        def step_fn(step_context):
            step_context.session.run(train_ops["zeros_op"])
            try:
                while update_cycle[0] != update_cycle[1]:
                    data = train_data.next()
                    step_context.session.run(
                        train_ops["collect_op"], feed_dict=data["feed_dict"])
                    update_cycle[1] += 1
                data = train_data.next()
                update_cycle[1] = 1
                return step_context.run_with_hooks(
                    train_ops["train_op"], feed_dict=data["feed_dict"])
            except StopIteration:
                eidx[1] += 1

        while not sess.should_stop():
            if eidx[0] != eidx[1]:
                tf.logging.info("STARTUP Epoch {}".format(eidx[1]))
                eidx[0] = eidx[1]
            sess.run_step_fn(step_fn)

*疑问：这里的sess.run_step_fn(step_fn)是怎么运行的？tf.train.MonitoredSession()类中没有查到run_step_fn()方法

关于tf.train.MonitoredSession()类
官方文档给的定义是：
Session-like object that handles initialization, recovery and hooks.
是一个处理初始化，模型恢复，和处理Hooks的类似于Session的类。

关于这里的Experiment，training_hooks都还缺乏了解

关于tf.estimator
Estimator对象包装由model_fn指定的模型,其中,给定输入和其他一些参数,返回需要进行训练、计算,或预测的操作.
EstimatorSpec是定义model_fn返回的类（继承自namedtuple），用于初始化Estimator

#njunmt/model/model_builder.py
class EstimatorSpec(
    namedtuple('EstimatorSpec', ['name', 'input_fields',
                                 'predictions', 'loss', 'train_ops',
                                 'training_chief_hooks', 'training_hooks'])):

"""创建NMT模型，根据train，evaluation，inference三种mode具体有不同的返回内容"""
def model_fn(...):
    ...
    """创建模型实例"""
    model = eval(model_str)(
        params=model_configs["model_params"],...)
    ...
    """调用各Model类的build()方法"""
    def _build_model():
        ...
        if mode == ModeKeys.INFER:
            # model_output is prediction
            return _input_fields, _model_output
        elif mode == ModeKeys.EVAL:
            # model_output = (loss_sum, weight_sum), attention
            return _input_fields, _model_output[0], _model_output[1]
        elif mode == ModeKeys.TRAIN:  # mode == TRAIN
            ..."""something for train"""
            return _input_fields, _loss, grads
        ...

    model_returns = parallelism(_build_model)
    input_fields = model_returns[0]
    if mode == ModeKeys.INFER:
        ..."""something for inference"""
        return EstimatorSpec(...)

    if mode == ModeKeys.EVAL:
        ... """something for evaluation"""
        return EstimatorSpec(...)

    assert mode == ModeKeys.TRAIN
    ..."""something for train"""
    return EstimatorSpec(...)

三、调参基础

机器学习中的正则化(Regularization)
Normalization（归一化）
《理解Dropout》分享
 Seq2Seq中的beam search算法
 Tensorflow中优化器--AdamOptimizer详解

激活函数.png

TensorFlow之RNN：堆叠RNN、LSTM、GRU及双向LSTM
[深度概念]·Attention机制实践解读
 变形金刚”为何强大：从模型到代码全面解析Google Tensor2Tensor系统

NJUNMT学习