tf.slim模块学习

翻译自《TF-slim官方文档》

tf.slim是一个用于定义、训练和评估复杂模型的tensorflow轻量级库。tf-slim的组件能轻易地与原生tensorflow框架还有其他的框架（例如tf.contrib.learn）进行整合。

Why TF-Slim?

TF-Slim库能使我们构建、训练和验证神经网络变得更简单：

TF-Slim使我们可使用更少的代码来做相同的事，省去了那些重复的模板代码。这主要是通过新定义一个参数空间（arg_scope）来实现的，使用slim不仅可大大减少重复代码量，而且代码可读性更好，更容易维护；
构建模型更简单，提供了一般的正则化功能，但是使用更简单；
在slim库中已经有很多官方实现的网络，如VGG、ResNet、Inception等，可直接拿来使用；
slim使得扩展模型，载入预训练模型变得更简单。

TF-Slim中的主要模块

TF-Slim包括几个相互独立的模块(其中很多模块都已经被整合到TensorFlow标准接口中了)：

arg_scope 提供一个新的命名空间，使得用户可以在此空间下定义默认参数；
data 包括定义dataset definition, data providers, parallel_reader, and decoding utilities;
evaluation 验证模型训练好坏的模块；
layers tensorflow用来构建模型的高级接口；
leanring 训练模型的接口模块；
losses 包括一些常见的loss函数，如交叉熵的接口；
metrics 包括一些常用的评估模型好坏的指标，比如准确率等；
nets 常见的可用模型；
queues 输入数据队列的上下文管理器；
regularizers 权重正则化；
variables 经过再次封装的易用的变量创建接口。

Defining Models

借用TF-Slim我们可以非常方便的定义我们的模型。

Variables

使用TensorFlow自带的函数定义变量较为复杂，而使用slim定义变量则非常简单方便。

如我们在指定CPU上定义一个权重参数，并给他添加L2正则化：

weights = slim.variable('weights',
                             shape=[10, 10, 3 , 3],
                             initializer=tf.truncated_normal_initializer(stddev=0.1),
                             regularizer=slim.l2_regularizer(0.05),
                             device='/CPU:0')

在原始TensorFlow中，存在两种变量：一般变量和瞬时（transient）变量。大部分变量都是一般变量，即一旦创建，就可以一直存在，可以在保存模型时保存起来。而瞬时变量则是存在于某个会话（session）中的变量，一旦会话关闭，变量也即不复存在。

TF-Slim除了将变量分为上面两种类型外，还将变量进一步划分为模型变量和非模型变量。模型变量指那些训练的或者冲一个checkpoint文件中重载的用来验证（evaluation）或推理（inference）的变量，如slim.fully_connected或slim.conv2d创建的变量。而非模型变量则表示在训练和验证阶段用到但在推理阶段不需要的，如glob_step等。而移动平均虽然能反应模型变量，与模型变量有关，但他却不是模型变量。

模型变量和非模型变量通过TF-Slim创建都非常方便：

# Model Variables
weights = slim.model_variable('weights',
                              shape=[10, 10, 3 , 3],
                              initializer=tf.truncated_normal_initializer(stddev=0.1),
                              regularizer=slim.l2_regularizer(0.05),
                              device='/CPU:0')
model_variables = slim.get_model_variables()

# Regular variables
my_var = slim.variable('my_var',
                       shape=[20, 1],
                       initializer=tf.zeros_initializer())
regular_variables_and_model_variables = slim.get_variables()

这是如何工作的呢？当你通过TF-Slim layers或者slim.model_variable函数创建变量时，TF-Slim自动将其加入到tf.GraphKeys.MODEL_VARIABLES集合中。如果我们通过别的方法创建了一些模型变量想加入模型变量集合中应该怎么做呢？TF-Slim提供了很方便的函数来添加模型变量：

my_model_variable = CreateViaCustomCode()

# Letting TF-Slim know about the additional variable.
slim.add_model_variable(my_model_variable)

Layers

虽然Tensorflow的操作集合是极其广泛的，但是神经网络的开发者通常认为模型如层（layers）、损失函数、评估还有网络是更高级的抽象。神经网络的层，如卷积层、全连接层和BN层等相较于TensorFlow的其他一些操作来说更加的抽象。而且，一个层通常都是与变量（可训练参数）相关联的，而较低级的操作则不是这样的。例如：一个卷积层通常包括多个低级操作：

1、创建权重和偏置变量；
2、输入和权重进行卷积运算；
3、卷积运算结果与偏置相加；
4、使用激活函数。

使用原始的TensorFlow代码会创建一个卷积层会显得和繁杂：

input = ...
with tf.name_scope('conv1_1') as scope:
  kernel = tf.Variable(tf.truncated_normal([3, 3, 64, 128], dtype=tf.float32,
                                           stddev=1e-1), name='weights')
  conv = tf.nn.conv2d(input, kernel, [1, 1, 1, 1], padding='SAME')
  biases = tf.Variable(tf.constant(0.0, shape=[128], dtype=tf.float32),
                       trainable=True, name='biases')
  bias = tf.nn.bias_add(conv, biases)
  conv1 = tf.nn.relu(bias, name=scope)

为了减少重复代码，TF-Slim在layers水平上了很多更加抽象的易用的接口。例如，实现上面的卷积层可通过下面的一行代码来表示：

input = ...
net = slim.conv2d(input, 128, [3, 3], scope='conv1_1')

TF-Slim提供和很多和TensorFlow标准接口对应的方法接口：

Layer	TF-Slim
BiasAdd	slim.bias_add
BatchNorm	slim.batch_norm
Conv2d	slim.conv2d
Conv2dInPlane	slim.conv2d_in_plane
Conv2dTranspose (Deconv)	slim.conv2d_transpose
FullyConnected	slim.fully_connected
AvgPool2D	slim.avg_pool2d
Dropout	slim.dropout
Flatten	slim.flatten
MaxPool2D	slim.max_pool2d
OneHotEncoding	slim.one_hot_encoding
SeparableConv2	slim.separable_conv2d
UnitNorm	slim.unit_norm

同时，TF-Slim还提供了重复和堆叠这两种元操作以允许方便的定义重复的操作。如，使用TF-Slim创建VGG网络就非常简单：

net = ...
net = slim.conv2d(net, 256, [3, 3], scope='conv3_1')
net = slim.conv2d(net, 256, [3, 3], scope='conv3_2')
net = slim.conv2d(net, 256, [3, 3], scope='conv3_3')
net = slim.max_pool2d(net, [2, 2], scope='pool2')

另一种减少重复代码的方法是使用for循环：

net = ...
for i in range(3):
  net = slim.conv2d(net, 256, [3, 3], scope='conv3_%d' % (i+1))
net = slim.max_pool2d(net, [2, 2], scope='pool2')

而借助TF-Slim中的repeat方法，我们是用两行代码就可以解决：

net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
net = slim.max_pool2d(net, [2, 2], scope='pool2')

需要注意的是slim.repeat不仅使用了相同的参数，还将每层卷积所对应的命名空间也给相应的改了（在上例中3个重复卷积层对应的命名空间分别为 'conv3/conv3_1', 'conv3/conv3_2' and 'conv3/conv3_3'）。

slim.repeat使用相同的参数（相同的卷积核大小，输入输出通道数，padding类型以及步长），而slim.stack则允许我们使用不同的参数，将不同参数的卷积层叠加在一起，用一行代码表示。slim.stack也为每一层网络创建了一个命名空间。例如，创建一个多层感知机：

# Verbose way:
x = slim.fully_connected(x, 32, scope='fc/fc_1')
x = slim.fully_connected(x, 64, scope='fc/fc_2')
x = slim.fully_connected(x, 128, scope='fc/fc_3')

# Equivalent, TF-Slim way using slim.stack:
slim.stack(x, slim.fully_connected, [32, 64, 128], scope='fc')

在上面的例子中，slim.stack调用了三次slim.fully_connected，但是每次传的参数都不相同，3层神经网络层的神经元数量依次为32,64,128。同样的，我们也可以使用slim.stack来将不同参数的卷积层叠加在一起：

# Verbose way:
x = slim.conv2d(x, 32, [3, 3], scope='core/core_1')
x = slim.conv2d(x, 32, [1, 1], scope='core/core_2')
x = slim.conv2d(x, 64, [3, 3], scope='core/core_3')
x = slim.conv2d(x, 64, [1, 1], scope='core/core_4')

# Using stack:
slim.stack(x, slim.conv2d, [(32, [3, 3]), (32, [1, 1]), (64, [3, 3]), (64, [1, 1])], scope='core')

Scopes

除了TensorFlow中原始的两种命名空间外，TF-Slim还定义了另一种命名空间(arg_scope)，这个新的空间允许我们指定一个或多个操作（如slim.conv2d,slim.fully_connected）并设置其默认参数（如初始化方法，正则化方法等）。具体见下面的例子：

net = slim.conv2d(inputs, 64, [11, 11], 4, padding='SAME',
                  weights_initializer=tf.truncated_normal_initializer(stddev=0.01),
                  weights_regularizer=slim.l2_regularizer(0.0005), scope='conv1')
net = slim.conv2d(net, 128, [11, 11], padding='VALID',
                  weights_initializer=tf.truncated_normal_initializer(stddev=0.01),
                  weights_regularizer=slim.l2_regularizer(0.0005), scope='conv2')
net = slim.conv2d(net, 256, [11, 11], padding='SAME',
                  weights_initializer=tf.truncated_normal_initializer(stddev=0.01),
                  weights_regularizer=slim.l2_regularizer(0.0005), scope='conv3')

很明显，这三个卷积层拥有很多相同的参数。两个的padding都为SAME,而且三个卷积层的权重初始化和和正则化方式都相同。上面的写法很麻烦而且很容易在复制的时候出错。通过将这些相同参数先设置为变量或许是一个可行的办法：

padding = 'SAME'
initializer = tf.truncated_normal_initializer(stddev=0.01)
regularizer = slim.l2_regularizer(0.0005)
net = slim.conv2d(inputs, 64, [11, 11], 4,
                  padding=padding,
                  weights_initializer=initializer,
                  weights_regularizer=regularizer,
                  scope='conv1')
net = slim.conv2d(net, 128, [11, 11],
                  padding='VALID',
                  weights_initializer=initializer,
                  weights_regularizer=regularizer,
                  scope='conv2')
net = slim.conv2d(net, 256, [11, 11],
                  padding=padding,
                  weights_initializer=initializer,
                  weights_regularizer=regularizer,
                  scope='conv3')

上面的方法能保证三个方法的参数一致，但是这并没有减少代码繁杂性。通过使用arg_scope我们不仅能够保证参数一致，还能大大减少这种繁复的代码：

  with slim.arg_scope([slim.conv2d], padding='SAME',
                      weights_initializer=tf.truncated_normal_initializer(stddev=0.01)
                      weights_regularizer=slim.l2_regularizer(0.0005)):
    net = slim.conv2d(inputs, 64, [11, 11], scope='conv1')
    net = slim.conv2d(net, 128, [11, 11], padding='VALID', scope='conv2')
    net = slim.conv2d(net, 256, [11, 11], scope='conv3')

如上所见，arg_scope使得我们的代码变得更简单，可读性和可维护性也更好。此外，还有一点需要注意的是，虽然我们可以在arg_scope中指定操作的默认参数，但是随后我们也可以在具体实现时对其进行重写覆盖。

此外，与tf.variable_scope和tf.name_scope相同，arg_scope同样可以嵌套使用:

with slim.arg_scope([slim.conv2d, slim.fully_connected],
                      activation_fn=tf.nn.relu,
                      weights_initializer=tf.truncated_normal_initializer(stddev=0.01),
                      weights_regularizer=slim.l2_regularizer(0.0005)):
  with slim.arg_scope([slim.conv2d], stride=1, padding='SAME'):
    net = slim.conv2d(inputs, 64, [11, 11], 4, padding='VALID', scope='conv1')
    net = slim.conv2d(net, 256, [5, 5],
                      weights_initializer=tf.truncated_normal_initializer(stddev=0.03),
                      scope='conv2')
    net = slim.fully_connected(net, 1000, activation_fn=None, scope='fc')

在上面的例子中，外层的arg_scope给slim.conv2d和slim.fully_connected设置了相同的weights_initializer和weights_regularizer参数。而在内层arg_scope中，又添加了stride和padding这两个默认参数。

Working Example: Specifying the VGG16 Layers

下面，我们将TF-Slim中的Variables、Operations和scope结合起来写一个VGG16的网络结构：

def vgg16(inputs):
  with slim.arg_scope([slim.conv2d, slim.fully_connected],
                      activation_fn=tf.nn.relu,
                      weights_initializer=tf.truncated_normal_initializer(0.0, 0.01),
                      weights_regularizer=slim.l2_regularizer(0.0005)):
    net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
    net = slim.max_pool2d(net, [2, 2], scope='pool1')
    net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
    net = slim.max_pool2d(net, [2, 2], scope='pool2')
    net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
    net = slim.max_pool2d(net, [2, 2], scope='pool3')
    net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
    net = slim.max_pool2d(net, [2, 2], scope='pool4')
    net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
    net = slim.max_pool2d(net, [2, 2], scope='pool5')
    net = slim.fully_connected(net, 4096, scope='fc6')
    net = slim.dropout(net, 0.5, scope='dropout6')
    net = slim.fully_connected(net, 4096, scope='fc7')
    net = slim.dropout(net, 0.5, scope='dropout7')
    net = slim.fully_connected(net, 1000, activation_fn=None, scope='fc8')
  return net

Training Models

训练模型需要有模型，loss函数，优化器，梯度计算以及重复的计算梯度更新权重。TF-Slim提供了通用的loss函数和很多对于训练和验证有用的函数。

Losses

loss函数定义了一个我们想要去优化的（变量）值。对于分类问题来说，常常使用交叉熵作为loss函数。而对于回归问题，则是使用均方误差作为loss函数。

然而对于某些模型如多任务学习模型来说，同时需要多个loss函数。换句话说，现在模型的loss是多个loss的和。例如，思考这样一个模型，既要预测图片的主要对象的类别，还要准确找出这个这个对象的边界，这个模型的损失函数将会是分类和分割的总和。

TF-Slim在losses模块中提供了易用的定义loss函数方法。以训练VGG模型为例：

import tensorflow as tf
import tensorflow.contrib.slim.nets as nets
vgg = nets.vgg

# Load the images and labels.
images, labels = ...

# Create the model.
predictions, _ = vgg.vgg_16(images)

# Define the loss functions and get the total loss.
loss = slim.losses.softmax_cross_entropy(predictions, labels)

在上面这个例子中，我们使用交叉熵函数作为分类loss函数。现在，让我们考虑一下有多个输出的多任务学习模型：

# Load the images and labels.
images, scene_labels, depth_labels = ...

# Create the model.
scene_predictions, depth_predictions = CreateMultiTaskModel(images)

# Define the loss functions and get the total loss.
classification_loss = slim.losses.softmax_cross_entropy(scene_predictions, scene_labels)
sum_of_squares_loss = slim.losses.sum_of_squares(depth_predictions, depth_labels)

# The following two lines have the same effect:
total_loss = classification_loss + sum_of_squares_loss
total_loss = slim.losses.get_total_loss(add_regularization_losses=False)

在上面的例子中，我们有两个loss：slim.losses.softmax_cross_entropy和slim.losses.sum_of_squares。我们使用slim.losses.get_total_loss()来获得总loss。他们是如何工作的呢，每当我们通过TF-Slim中的losses模块中的函数来创建loss时，这些loss将会自动添加到loss 集合中。这使得我们可以自己来处理loss（手动将多个loss相加），也可以让TF-Slim来为你处理loss（使用slim.losses.get_total_loss来获取loss集合中的总loss）。

思考一下另一种情况，如果我们的loss是自己定义的而非使用slim中的loss函数定义的，这时候要怎么做呢？loss_ops.py中也提供了一个函数使得我们能将自己定义的loss添加到loss集合中。

# Load the images and labels.
images, scene_labels, depth_labels, pose_labels = ...

# Create the model.
scene_predictions, depth_predictions, pose_predictions = CreateMultiTaskModel(images)

# Define the loss functions and get the total loss.
classification_loss = slim.losses.softmax_cross_entropy(scene_predictions, scene_labels)
sum_of_squares_loss = slim.losses.sum_of_squares(depth_predictions, depth_labels)
pose_loss = MyCustomLossFunction(pose_predictions, pose_labels)
slim.losses.add_loss(pose_loss) # Letting TF-Slim know about the additional loss.

# The following two ways to compute the total loss are equivalent:
regularization_loss = tf.add_n(slim.losses.get_regularization_losses())
total_loss1 = classification_loss + sum_of_squares_loss + pose_loss + regularization_loss

# (Regularization Loss is included in the total loss by default).
total_loss2 = slim.losses.get_total_loss()

在这个例子中，我们同样有两种方式来获得总loss：直接多个loss相加或调用slim.losses.get_total_loss()。

Training Loop

TF-Slim在learning.py提供了一系列用来训练模型的工具。这其中包括了一个训练函数，它能不断地计算loss，计算梯度以及储存模型。还有一些方便操作梯度的函数。例如，当我们定义好了模型，loss函数以及优化器后，我们就能使用slim.learning.create_train_op和slim.learning.train来进行优化模型：

g = tf.Graph()

# Create the model and specify the losses...
...

total_loss = slim.losses.get_total_loss()
optimizer = tf.train.GradientDescentOptimizer(learning_rate)

# create_train_op ensures that each time we ask for the loss, the update_ops
# are run and the gradients being computed are applied too.
train_op = slim.learning.create_train_op(total_loss, optimizer)
logdir = ... # Where checkpoints are stored.

slim.learning.train(
    train_op,
    logdir,
    number_of_steps=1000,
    save_summaries_secs=300,
    save_interval_secs=600):

在这个例子中，我们使用slim.learning.create_train_op来创建一个train_op,这个op被用来计算loss和反向梯度传播。logdir指定checkpoint和event文件，number_of_steps用来设置梯度方向传播的次数，这里我们设置的是1000。最后，save_summaries_secs=300表示每5分钟计算一次summaries，save_interval_secs=600表示每十分钟保存一次模型。

Working Example: Training the VGG16 Model

下面让我们通过一个例子来看看怎么使用上面介绍的东西：

import tensorflow as tf
import tensorflow.contrib.slim.nets as nets

slim = tf.contrib.slim
vgg = nets.vgg

...

train_log_dir = ...
if not tf.gfile.Exists(train_log_dir):
  tf.gfile.MakeDirs(train_log_dir)

with tf.Graph().as_default():
  # Set up the data loading:
  images, labels = ...

  # Define the model:
  predictions = vgg.vgg_16(images, is_training=True)

  # Specify the loss function:
  slim.losses.softmax_cross_entropy(predictions, labels)

  total_loss = slim.losses.get_total_loss()
  tf.summary.scalar('losses/total_loss', total_loss)

  # Specify the optimization scheme:
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=.001)

  # create_train_op that ensures that when we evaluate it to get the loss,
  # the update_ops are done and the gradient updates are computed.
  train_tensor = slim.learning.create_train_op(total_loss, optimizer)

  # Actually runs training.
  slim.learning.train(train_tensor, train_log_dir)

Fine-Tuning Existing Models

Brief Recap on Restoring Variables from a Checkpoint

当一个模型训练好后，我们通常会将其保存下来，然后在inference时候，通过tf.train.Saver()将保存好的模型载入。大部分时候，tf.train.Saver()都能满足这个要求，载入全部或者部分变量。

# Create some variables.
v1 = tf.Variable(..., name="v1")
v2 = tf.Variable(..., name="v2")
...
# Add ops to restore all the variables.
restorer = tf.train.Saver()

# Add ops to restore some variables.
restorer = tf.train.Saver([v1, v2])

# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
  # Restore variables from disk.
  restorer.restore(sess, "/tmp/model.ckpt")
  print("Model restored.")
  # Do some work with the model
  ...

Partially Restoring Models

上面介绍了我们可以使用tf.train.Saver()来进行模型的简单载入。然而，更多的时候我们需要的是微调（fine-tune）预训练的模型，即使用预训练的模型进行继续训练。在这些情况下，TF-Slim提供了很多有用的函数用于载入指定变量：

# Create some variables.
v1 = slim.variable(name="v1", ...)
v2 = slim.variable(name="nested/v2", ...)
...

# Get list of variables to restore (which contains only 'v2'). These are all
# equivalent methods:
variables_to_restore = slim.get_variables_by_name("v2")
# or
variables_to_restore = slim.get_variables_by_suffix("2")
# or
variables_to_restore = slim.get_variables(scope="nested")
# or
variables_to_restore = slim.get_variables_to_restore(include=["nested"])
# or
variables_to_restore = slim.get_variables_to_restore(exclude=["v1"])

# Create the saver which will be used to restore the variables.
restorer = tf.train.Saver(variables_to_restore)

with tf.Session() as sess:
  # Restore variables from disk.
  restorer.restore(sess, "/tmp/model.ckpt")
  print("Model restored.")
  # Do some work with the model
  ...

在上面的例子中，我们提供了多个方法来载入指定变量：

slim.get_variables_by_name('v2') 载入变量名中包含‘v2’的变量；
slim.get_variables_by_suffix('2') 载入以‘2’结尾的变量；
slim.get_variables(scope="nested") 载入‘nested’命名空间内的变量；
slim.get_variables_to_restore(include=["nested"]) 载入包含include列表中字符串的变量；
slim.get_variables_to_restore(exclude=["v1"]) 载入不包含exclude列表中字符串的变量。

Restoring models with different variable names

当我们从checkpoint中载入模型的时候，Saver将checkpoint中的变量名与当前计算图（current graph）中的变量进行比较（map）。如上面的例子，我们从checkpoint中载入的变量都会与var.op.name中的变量名进行比较。

在checkpoint中的变量名与我们当前计算图中的命名是一致的时候，这无疑是没问题的。但是当checkpoint中的变量名与当前图中相同意义作用的变量名名称不同时该怎么办呢？这时候，我们必须给Saver提供一个映射（字典），将checkpoint中的变量名与当前计算图中的变量名一一对应起来。如下例：

# Assuming than 'conv1/weights' should be restored from 'vgg16/conv1/weights'
def name_in_checkpoint(var):
  return 'vgg16/' + var.op.name

# Assuming than 'conv1/weights' and 'conv1/bias' should be restored from 'conv1/params1' and 'conv1/params2'
def name_in_checkpoint(var):
  if "weights" in var.op.name:
    return var.op.name.replace("weights", "params1")
  if "bias" in var.op.name:
    return var.op.name.replace("bias", "params2")

variables_to_restore = slim.get_model_variables()
variables_to_restore = {name_in_checkpoint(var):var for var in variables_to_restore}
restorer = tf.train.Saver(variables_to_restore)

with tf.Session() as sess:
  # Restore variables from disk.
  restorer.restore(sess, "/tmp/model.ckpt")

Fine-Tuning a Model on a different task

考虑这样一种情况，我们使用ImageNet数据集训练好了一个模型，这个模型是进行1000分类的。然而现在我们想将这个训练好的模型用来对Pascal VOC数据集进行分类，现在得问题是Pascal VOC数据集是20分类，而我们预训练好的模型是1000分类。要解决这个问题，我们只需要在载入模型参数时，只载入最后一层以外的参数，最后一层的参数重新初始化训练：

# Load the Pascal VOC data
image, label = MyPascalVocDataLoader(...)
images, labels = tf.train.batch([image, label], batch_size=32)

# Create the model
predictions = vgg.vgg_16(images)

train_op = slim.learning.create_train_op(...)

# Specify where the Model, trained on ImageNet, was saved.
model_path = '/path/to/pre_trained_on_imagenet.checkpoint'

# Specify where the new model will live:
log_dir = '/path/to/my_pascal_model_dir/'

# Restore only the convolutional layers:
variables_to_restore = slim.get_variables_to_restore(exclude=['fc6', 'fc7', 'fc8'])
init_fn = assign_from_checkpoint_fn(model_path, variables_to_restore)

# Start training.
slim.learning.train(train_op, log_dir, init_fn=init_fn)

Evaluating Models

一旦我们开始训练模型，我们总会想要在模型训练过程中查看模型训练的怎么样了？诸如欠拟合还是过拟合，学习率是否过大，是否已经训练好等。这主要是通过一系列评估指标进行的，使用目前训练中的模型进行inference，然后将inference结果与真实结果进行比较。这一步评估通常是周期性的运行的，如每过10个epoch评估一次。

Metrics

用来评估模型训练的好坏的指标通常都不是loss，而是从模型的目的来设计的。例如，虽然在训练阶段我们的目的是最小化loss，但是评估指标可能是F1 score又或者是IOU（Intersection Over Union）score。

TF-Slim提供了一系列评估模型指标的接口来使得评估模型训练情况变得更加简单。简单来说，计算评估指标值可分为3部分：

Initialize, 初始化，初始化用于计算评估指标的变量；
Aggregation，聚合操作，执行计算评估指标的操作（如求和）；
Finalization (optionally)，执行最后的操作去计算评估指标，如计算平均值，最小值最大值等。

例如，为了计算mean_absolute_error（平均绝对误差），count和total变量都需要被初始化为0,。然后执行计算操作，计算每个样本的预测值与真实值间的差值的绝对值，然后将这些值进行相加。每次给total添加一次误差，count也要加1。最后，total/count则是我们的结果，即mean_absolute_error。

下面的例子给出了这些评估指标的用法示例：

images, labels = LoadTestData(...)
predictions = MyModel(images)

mae_value_op, mae_update_op = slim.metrics.streaming_mean_absolute_error(predictions, labels)
mre_value_op, mre_update_op = slim.metrics.streaming_mean_relative_error(predictions, labels)
pl_value_op, pl_update_op = slim.metrics.percentage_less(mean_relative_errors, 0.3)

如上所示，metric模块接口会返回两个值：value_op和update_op。value_op返回当前这步操作的评估值，而updata_op则是将所有之前这些步的结果进行了一个统计后返回的一个‘总数’。

每次计算都去追踪value_op和update_op非常费力。为了解决这个问题，TF-Slim提供了两个方便的函数来解决这个问题：

# Aggregates the value and update ops in two lists:
value_ops, update_ops = slim.metrics.aggregate_metrics(
    slim.metrics.streaming_mean_absolute_error(predictions, labels),
    slim.metrics.streaming_mean_squared_error(predictions, labels))

# Aggregates the value and update ops in two dictionaries:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
    "eval/mean_absolute_error": slim.metrics.streaming_mean_absolute_error(predictions, labels),
    "eval/mean_squared_error": slim.metrics.streaming_mean_squared_error(predictions, labels),
})

Working example: Tracking Multiple Metrics

下面我们通过一个例子将上面的步骤串起来：

import tensorflow as tf
import tensorflow.contrib.slim.nets as nets

slim = tf.contrib.slim
vgg = nets.vgg


# Load the data
images, labels = load_data(...)

# Define the network
predictions = vgg.vgg_16(images)

# Choose the metrics to compute:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
    "eval/mean_absolute_error": slim.metrics.streaming_mean_absolute_error(predictions, labels),
    "eval/mean_squared_error": slim.metrics.streaming_mean_squared_error(predictions, labels),
})

# Evaluate the model using 1000 batches of data:
num_batches = 1000

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  sess.run(tf.local_variables_initializer())

  for batch_id in range(num_batches):
    sess.run(names_to_updates.values())

  metric_values = sess.run(names_to_values.values())
  for metric, value in zip(names_to_values.keys(), metric_values):
    print('Metric %s has value: %f' % (metric, value))

metric_ops.py能够独立于layers.py 和 loss_ops.py单独使用。

Evaluation Loop

TF-Slim提供了一个评估模块 (evaluation.py)，其提供了很多用来评估模型的函数，这些函数使用的评估指标都来自metric_ops.py模块。其中包含一个函数，该函数用于周期性地对模型进行评估，评估模型在某批次数据集上的结果并打印统计标准的结果。例如：

import tensorflow as tf

slim = tf.contrib.slim

# Load the data
images, labels = load_data(...)

# Define the network
predictions = MyModel(images)

# Choose the metrics to compute:
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({
    'accuracy': slim.metrics.accuracy(predictions, labels),
    'precision': slim.metrics.precision(predictions, labels),
    'recall': slim.metrics.recall(mean_relative_errors, 0.3),
})

# Create the summary ops such that they also print out to std output:
summary_ops = []
for metric_name, metric_value in names_to_values.iteritems():
  op = tf.summary.scalar(metric_name, metric_value)
  op = tf.Print(op, [metric_value], metric_name)
  summary_ops.append(op)

num_examples = 10000
batch_size = 32
num_batches = math.ceil(num_examples / float(batch_size))

# Setup the global step.
slim.get_or_create_global_step()

output_dir = ... # Where the summaries are stored.
eval_interval_secs = ... # How often to run the evaluation.
slim.evaluation.evaluation_loop(
    'local',
    checkpoint_dir,
    log_dir,
    num_evals=num_batches,
    eval_op=names_to_updates.values(),
    summary_op=tf.summary.merge(summary_ops),
    eval_interval_secs=eval_interval_secs)

参考:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim