话不多说先上图
基本的seq2seq模型:
outputs, states = basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell)
其中encoder_inputs是编码器的输入,decoder_inputs是解码器的输入
在模型中采用feed_previous参数来判断是否采用解码器输入还是前一刻的输出作为解码器输入。通常在训练中采用解码器输入,预测中采用前一刻的输出。
output_projection:
在decoder中会用到的参数。如果不指定,模型的输出是[batch_size, num_decoder_symbols]。但当num_decoder_symbols很大时,模型会采用一个较小的tensor[batch_size, num_samples]计算loss function(此时采用sampled softmax loss function),随后通过output_projection映射到原先的tensor。
bucket:
变长seq2seq中所采用的机制,通过将inputs和targets放入不同长度的桶中避免了对每一种长度组合都要新建一个graph。
在每次采用SGD更新模型参数时,会根据概率随机地从所有buckets中选择一个,并从中随机选取batch_size个训练样例,并对当前sub-graph中的参数进行优化,每个sub-graph之间权值共享。
train_bucket_sizes = [len(train_data.inputs[b])
for b in xrange(len(BUCKETS))]
train_total_size = float(sum(train_bucket_sizes))
train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size
for i in xrange(len(train_bucket_sizes))]
foreach batch:
random_number_01 = np.random.random_sample()
#根据数据概率分布函数随机选取bucket
bucket_id = min([i for i in xrange(len(train_buckets_scale))
if train_buckets_scale[i] > random_number_01])
def seq2seq_f(encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection,
do_decode):
return tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(
encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection=output_projection,
feed_previous=do_decode)
tf.contrib.legacy_seq2seq.model_with_buckets(
encoder_inputs,
decoder_inputs,
targets,
target_weights,
buckets,
lambda x, y: seq2seq_f(x, y,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection,
False),
softmax_loss_function=softmax_loss_function)