seq2seq.py中attention model源码分析及各种变形

http://blog.csdn.net/wuzqChom/article/details/77918780
https://zhuanlan.zhihu.com/p/27769667

是在seq2seq_model中的tf.contrib.legacy_seq2seq.model_with_buckets函数中调用的tf.contrib.legacy_seq2seq.embedding_attention_seq2seq.
model_with_buckets函数是用来计算output和loss的。而embedding_attention_seq2seq是负责计算output的部分（和state）。接口情况如下：

           tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(
                encoder_inputs,# shape=[encoder_size, batch_size] (6,32) (?, )
                decoder_inputs,# shape=[decoder_size, batch_size] (6,32)
                cell,# MultiRNNCell 返回的网络结构
                num_encoder_symbols=source_vocab_size,# 词典大小 10
                num_decoder_symbols=target_vocab_size,# 词典大小 10
                embedding_size=size,# embedding的维度 32
                output_projection=output_projection,# 采用sample_softmax，因此有两个参数 (w,b)
                feed_previous=do_decode,# train or predict
                dtype=dtype)

这样就跳转到seq2seq.py源文件中。
其调用关系如下：
embedding_attention_seq2seq() -> embedding_attention_decoder() -> attention_decoder() -> attention()

embedding_attention_seq2seq()负责将encoder_input转换成embedding形式（由序号变成向量），代码如下

with variable_scope.variable_scope(
                scope or "embedding_attention_seq2seq", dtype=dtype) as scope:
    dtype = scope.dtype
    # Encoder.
    encoder_cell = core_rnn_cell.EmbeddingWrapper(
        cell,
        embedding_classes=num_encoder_symbols,  # 10
        embedding_size=embedding_size)  # 32维 构造embedding的网络结构
    encoder_outputs, encoder_state = core_rnn.static_rnn(
        encoder_cell, encoder_inputs, dtype=dtype)  # 用构造出来的结构将encoder_input转换成embedding形式，即encoder_outputs。

    # 计算得出attention_states 
    top_states = [
        array_ops.reshape(e, [-1, 1, cell.output_size]) for e in encoder_outputs
    ]
    attention_states = array_ops.concat(top_states, 1)

    # Decoder.
    output_size = None

    if isinstance(feed_previous, bool):
        return embedding_attention_decoder(
            decoder_inputs,
            encoder_state,
            attention_states,
            cell,
            num_decoder_symbols,
            embedding_size,
            num_heads=num_heads,  # 1 代表后面的公式三种做几次加权
            output_size=output_size,  # None
            output_projection=output_projection,  # (w,b)
            feed_previous=feed_previous,
            initial_state_attention=initial_state_attention)  # False

embedding_attention_decoder()接口上文已说明。它通过通过embedding_ops.embedding_lookup()函数把decoder_inputs转换为向量的形式（刚才是通过函数转换，这次是通过构建变量然后查找）。具体代码如下：

if output_size is None:
    output_size = cell.output_size  # 32
if output_projection is not None:
    proj_biases = ops.convert_to_tensor(output_projection[1], dtype=dtype)
    proj_biases.get_shape().assert_is_compatible_with([num_symbols])  # 把b转换成tensor形式，并验证是否是词典大小10

with variable_scope.variable_scope(
                scope or "embedding_attention_decoder", dtype=dtype) as scope:
    embedding = variable_scope.get_variable("embedding",
                                            [num_symbols, embedding_size])
    loop_function = _extract_argmax_and_embed(
        embedding, output_projection,
        update_embedding_for_previous) if feed_previous else None  # False
    emb_inp = [
        embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs
    ]  # 改变decoder_inputs维度 (?, )变成(?, 32)
    return attention_decoder(
        emb_inp,
        initial_state,
        attention_states,
        cell,
        output_size=output_size,
        num_heads=num_heads,
        loop_function=loop_function,  # None
        initial_state_attention=initial_state_attention)  # false

attention_decoder()核心函数。

对于这幅图，z是以下代码中的state，h是hidden_state。对应的权重分别为attention_vec_size和k。具体计算交给attention函数处理。

c是input和attn结合之后得出的输入cell的x。
x和state一起给cell，得到新的state和cell_output。
state输给attention函数，得到新的attn。
attn和cell_output得到最终的output。
在后文提到的beam search体现在：它的input不再是直接由decoder_input得来，而是由前一个output(prev)给loop_function计算得到。

with variable_scope.variable_scope(
                scope or "attention_decoder", dtype=dtype) as scope:
    dtype = scope.dtype
    # 从输入数据中得出这些参数维度
    batch_size = array_ops.shape(decoder_inputs[0])[0]  # 32 保持了输入时的结构shape=[decoder_size, batch_size] (6,32)
    attn_length = attention_states.get_shape()[1].value
    attn_size = attention_states.get_shape()[2].value  # 和embedding_size是一样的

    hidden = array_ops.reshape(attention_states,
                               [-1, attn_length, 1, attn_size])  # hidden就是h（attention_states）
    hidden_features = []
    v = []
    attention_vec_size = attn_size  # attention query vector
    for a in xrange(num_heads):  # 1
        k = variable_scope.get_variable("AttnW_%d" % a,
                                        [1, 1, attn_size, attention_vec_size])
        hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))  # attention state和k卷积
        v.append(
            variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))  # v的size设置为和attn_size一样

    state = initial_state

    # 准备阶段
    outputs = []
    prev = None
    batch_attn_size = array_ops.stack([batch_size, attn_size])
    attns = [
        array_ops.zeros(
            batch_attn_size, dtype=dtype) for _ in xrange(num_heads)
    ]
    for a in attns:  # Ensure the second shape of attention vectors is set.
        a.set_shape([None, attn_size])
    if initial_state_attention:
        attns = attention(initial_state)

    for i, inp in enumerate(decoder_inputs):
        # 循环，依次将每一个时刻的state都做一次attention，然后和该时刻的decoder_inputs值共同决定该时刻的输入
        if loop_function is not None and prev is not None:
            with variable_scope.variable_scope("loop_function", reuse=True):
                inp = loop_function(prev, i)
        input_size = inp.get_shape().with_rank(2)[1]  # 32
        x = linear([inp] + attns, input_size, True)  # x是综合了输入和attention的结果
        # Run the RNN.
        # # 使用输入和上一个时刻的隐状态共同决定当前时刻的隐状态和解码的输出
        cell_output, state = cell(x, state)
        # 调用attention函数
        if i == 0 and initial_state_attention:
            with variable_scope.variable_scope(
                    variable_scope.get_variable_scope(), reuse=True):
                attns = attention(state)  # state就是z， 即decoder的隐层状态
        else:
            attns = attention(state)

        with variable_scope.variable_scope("AttnOutputProjection"):
            output = linear([cell_output] + attns, output_size, True)  # 综合attention的结果和cell本身的输出
        outputs.append(output)  # 最终的输出

return outputs, state

最后的attention函数。

输入的query就是公式1中的d_{t}

def attention(query):
    """Put attention masks on hidden using hidden_features and query."""
    ds = []  # 存储最终结果
    if nest.is_sequence(query):  # If the query is a tuple, flatten it.
        query_list = nest.flatten(query)
        for q in query_list:  # Check that ndims == 2 if specified.
            ndims = q.get_shape().ndims
            if ndims:
                assert ndims == 2
        query = array_ops.concat(query_list, 1)
    for a in xrange(num_heads):
        with variable_scope.variable_scope("Attention_%d" % a):
            # y是公式（1）中的$W_2^{d_t}$
            y = linear(query, attention_vec_size, True)
            y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
            # Attention mask is a softmax of v^T * tanh(...).
            # s是公式1的结果
            s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y),
                                    [2, 3])
            # 公式（2）
            a = nn_ops.softmax(s)
            # Now calculate the attention-weighted vector d.
            # 公式（3）
            d = math_ops.reduce_sum(
                array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
            ds.append(array_ops.reshape(d, [-1, attn_size]))
    return ds

在GNMT里

后三个函数都没有变化。结构如下：

在seq2seq_model.py中，首先用命令

cell = Stack_Residual_RNNCell.Stack_Residual_RNNCell(list_of_cell)

重新定义了整个网络的cell，而不是直接用tf.contrib.rnn.MultiRNNCell。新代码如下：

def __call__(self, inputs, state, scope=None):
    with vs.variable_scope(scope or type(self).__name__):
        cur_state_pos = 0
        cur_inp = inputs
        if self._use_residual_connections:  # new
            past_inp = tf.zeros_like(cur_inp)  # past_inp负责保存之前的cur_inp的值，这样计算新的cur_inp的时候就可以加上past_inp
        new_states = []
        for i, cell in enumerate(self._cells):
            with vs.variable_scope("Cell%d" % i):
                if self._state_is_tuple:
                    if not nest.is_sequence(state):
                        raise ValueError(
                            "Expected state to be a tuple of length %d, but received: %s"
                            % (len(self.state_size), state))
                    cur_state = state[i]
                else:
                    cur_state = array_ops.slice(
                        state, [0, cur_state_pos], [-1, cell.state_size])
                    cur_state_pos += cell.state_size
                if self._use_residual_connections:  # new
                    cur_inp += past_inp
                    past_inp = cur_inp
                cur_inp, new_state = cell(cur_inp, cur_state)
                new_states.append(new_state)
    new_states = (tuple(new_states) if self._state_is_tuple
                  else array_ops.concat(1, new_states))
    return cur_inp, new_states

之后跳转到前文提到的四个函数的第一个embedding_attention_seq2seq中。
首先用如下语句定义第一、二层双向RNN，获得这一部分的输出outputs。

encoder_fw_cell = rnn_cell.EmbeddingWrapper(single_cell_1, embedding_classes=num_encoder_symbols,
                                            embedding_size=embedding_size / 2)
encoder_bw_cell = rnn_cell.EmbeddingWrapper(single_cell_2, embedding_classes=num_encoder_symbols,
                                            embedding_size=embedding_size / 2)
outputs, _, _ = rnn.bidirectional_rnn(encoder_fw_cell, encoder_bw_cell, encoder_inputs, dtype=dtype)

再用刚才的Stack_Residual_RNNCell完成剩下来的num_layers层的RNN cell2结构。这时候的输出encoder_outputs, encoder_state就相当于源代码中embedding之后的输出。接下来的代码就和源代码一样了。
----不明白为什么要定义两个cell cell2？为什么双向embedding之后的内容不能直接用？

在Neural_Conversation_Models里

讨论在attention环境中的beam search，它改写的部分仅限于对于inp的计算：

if loop_function is not None:
       with variable_scope.variable_scope("loop_function", reuse=True):
             if prev is not None:
                  inp = loop_function(prev, i, log_beam_probs, beam_path, beam_symbols)

即：通过beam search的方法每次找出beam_size个中的最优解作为inp。具体loop_function：

def loop_function(prev, i, log_beam_probs, beam_path, beam_symbols):
    if output_projection is not None:
        prev = nn_ops.xw_plus_b(
            prev, output_projection[0], output_projection[1])
    # prev变成了one-hot形式的prob向量

    probs = tf.log(tf.nn.softmax(prev)) # softmax

    if i > 1:
        probs = tf.reshape(probs + log_beam_probs[-1],
                           [-1, beam_size * num_symbols]) # 加上之前保留的prob

    best_probs, indices = tf.nn.top_k(probs, beam_size) # 从中选出beam_size个最优的
    indices = tf.stop_gradient(tf.squeeze(tf.reshape(indices, [-1, 1])))  
    best_probs = tf.stop_gradient(tf.reshape(best_probs, [-1, 1]))

    symbols = indices % num_symbols  # 最终的词
    beam_parent = indices // num_symbols  # 从哪个beam选来的

    beam_symbols.append(symbols)
    beam_path.append(beam_parent)
    log_beam_probs.append(best_probs)

    emb_prev = embedding_ops.embedding_lookup(embedding, symbols)
    emb_prev = tf.reshape(emb_prev, [beam_size, embedding_size]) # output size!
    if not update_embedding:
        emb_prev = array_ops.stop_gradient(emb_prev)
    return emb_prev
return loop_function

这里返回的是(beam_size,input_size)，但是实际上prev接受的应该是(batch_size,input_size)。代码跑不通……

最后编辑于：2017.12.11 06:30:23

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 214,444评论 6赞 496
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 91,421评论 3赞 389
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 160,036评论 0赞 349
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 57,363评论 1赞 288
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 66,460评论 6赞 386
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,502评论 1赞 292
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,511评论 3赞 412
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,280评论 0赞 270
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,736评论 1赞 307
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,014评论 2赞 328
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,190评论 1赞 342
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,848评论 5赞 338
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,531评论 3赞 322
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,159评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,411评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,067评论 2赞 365
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,078评论 2赞 352

seq2seq.py中attention model源码分析及各种变形

在GNMT里

在Neural_Conversation_Models里

推荐阅读更多精彩内容