tensor2tensor中抽象出了一个Modality
类,用来解耦模型主干和依赖任务的数据形式转化。例如一个self-attention模块既可以用于离散的字词序列,也可以用于图像的某个维度向量序列上,前提是需要转换成特定格式。
Modality
就是负责具体数据转化,包括词嵌入、交换维度、输出映射、计算损失值等,所以Modality
是在Problem
类中的hparams
方法中设置的,依赖于具体数据。
T2TModel
类中bottom
、top
和loss
方法中具体调用来做相应数据格式转化和损失计算。
Modality
包含四个主要方法:bottom
、targets_bottom
、top
和loss
,下面结合机器翻译中使用到的SymbolModality
详细阐述各个方法具体做了什么。
bottom & targets_bottom
在SymbolModality
中关于输入数据转化的核心方法是bottom_simple
和_get_weights
,bottom
和targets_bottom
是前二者的封装。
-
_get_weights
创建字典维度的embedding,借助reuse机制即可用于编码器和解码器的embedding lookup,也可用于计算logits的维度映射。def _get_weights(self, hidden_dim=None): """Create or get concatenated embedding or softmax variable. Args: hidden_dim: dim of the variable. Defaults to self._body_input_depth Returns: a list of self._num_shards Tensors. """ if hidden_dim is None: hidden_dim = self._body_input_depth num_shards = self._model_hparams.symbol_modality_num_shards shards = [] for i in range(num_shards): shard_size = (self._vocab_size // num_shards) + ( 1 if i < self._vocab_size % num_shards else 0) var_name = "weights_%d" % i shards.append( tf.get_variable( var_name, [shard_size, hidden_dim], initializer=tf.random_normal_initializer(0.0, hidden_dim ** -0.5))) if num_shards == 1: ret = shards[0] else: ret = tf.concat(shards, 0) # Convert ret to tensor. if not tf.contrib.eager.in_eager_mode(): ret = common_layers.convert_gradient_to_tensor(ret) return ret
-
bottom_simple
将离散值输入通过gather
函数做词嵌入。这里的gather
函数是通过对离散值进行one_hot编码,然后与embedding做矩阵乘法得到。def bottom_simple(self, x, name, reuse): with tf.variable_scope(name, reuse=reuse): # Ensure the inputs are 3-D if len(x.get_shape()) == 4: x = tf.squeeze(x, axis=3) while len(x.get_shape()) < 3: x = tf.expand_dims(x, axis=-1) var = self._get_weights() x = common_layers.dropout_no_scaling( x, 1.0 - self._model_hparams.symbol_dropout) ret = common_layers.gather(var, x) if self._model_hparams.multiply_embedding_mode == "sqrt_depth": ret *= self._body_input_depth ** 0.5 ret *= tf.expand_dims(tf.to_float(tf.not_equal(x, 0)), -1) return ret
由于
tensor2tensor
中默认填充符<PAD>的index=0,ret *= tf.expand_dims(tf.to_float(tf.not_equal(x, 0)), -1)
就是将index=0的embedding重置为全零。这样序列真实长度和attention mask都可以从embedding中计算得到。 -
bottom
和targets_bottom
控制embedding共享机制,默认情况下会共享编码器和解码器的embedding,减少参数的同时获得更多更新次数。def bottom(self, x): if (self._model_hparams.shared_embedding_and_softmax_weights or self._model_hparams.get("shared_embedding")): return self.bottom_simple(x, "shared", reuse=None) return self.bottom_simple(x, "input_emb", reuse=None) def targets_bottom(self, x): if (self._model_hparams.shared_embedding_and_softmax_weights or self._model_hparams.get("shared_embedding")): try: return self.bottom_simple(x, "shared", reuse=True) except ValueError: # perhaps there were no inputs, and this is a new variable. return self.bottom_simple(x, "shared", reuse=None) else: return self.bottom_simple(x, "target_emb", reuse=None)
top
top
负责映射隐层向量到字典维度,其中映射矩阵可以共享使用embedding矩阵,梯度反向传播的路径明显缩短,可以更充分的训练embedding矩阵。
def top(self, body_output, _):
"""Generate logits.
Args:
body_output: A Tensor with shape [batch, p0, p1, body_input_depth]
Returns:
logits: A Tensor with shape [batch, p0, p1, ?, vocab_size].
"""
if self._model_hparams.symbol_modality_skip_top:
return tf.expand_dims(body_output, 3)
if self._model_hparams.shared_embedding_and_softmax_weights:
scope_name = "shared"
reuse = True
else:
scope_name = "softmax"
reuse = False
with tf.variable_scope(scope_name, reuse=reuse):
body_output_shape = common_layers.shape_list(body_output)
var = self._get_weights(body_output_shape[-1])
if (self._model_hparams.factored_logits and
self._model_hparams.mode == tf.estimator.ModeKeys.TRAIN):
# insert channels dimension
body_output = tf.expand_dims(body_output, 3)
return common_layers.FactoredTensor(body_output, var)
else:
body_output = tf.reshape(body_output, [-1, body_output_shape[-1]])
logits = tf.matmul(body_output, var, transpose_b=True)
if (common_layers.is_xla_compiled() and
self._model_hparams.mode == tf.estimator.ModeKeys.TRAIN):
# TPU does not react kindly to extra dimensions.
# TODO(noam): remove this once TPU is more forgiving of extra dims.
return logits
else:
return tf.reshape(logits,
body_output_shape[:-1] + [1, self._vocab_size])
loss
loss就是交叉熵损失,加上labe_smoothing技巧。weights_fn=weights_nonzero
, 计算损失时忽略targets中等于零的位置
def loss(self, top_out, targets, weights_fn=None):
"""Compute loss numerator and denominator for one shard of output."""
logits = top_out
if weights_fn is None:
weights_fn = self.targets_weights_fn
return common_layers.padded_cross_entropy(
logits,
targets,
self._model_hparams.label_smoothing,
weights_fn=weights_fn)