ResNet是何凯明大神在2015发表的力作，影响深远，不得不说大佬真滴强。
先甩上论文原文：https://arxiv.org/abs/1512.03385

ResNet提出的背景

首先深度学习为什么要叫深度学习，其原因在于深度学习模型的网络层数很多。网络层数越多模型的学习能力越强。我的理解是深度学习模型就是一个复杂的函数表达式，其中参数越多表明这个函数拟合的情况就越多（当然需要合适的结构）。所有有一句话 The deeper the better。
但是仅仅简单堆叠层数有没有效果，答案是没有。原因就是可恶的梯度消失和梯度爆炸。这个两个问题不断被研究，提出更好的方法，大大促进了深度学习的发展。比如更好的激活函数Relu，Mish。

梯度消失或爆炸实验结果图

引用大神论文中的原图，可以看到56-layer模型的错误率明显高于20-layer模型，不管是在测试还是训练。这其中的原因并不是什么过拟合啥。前面也提到了，模型越深它的参数就越多，那么它就更难以优化。我理解的是因为参数多了，模型可以拟合的情况也多了，但没有办法保证它拟合的情况就是我们想要的，或者根本就拟合不了。
然后大神提出了残差学习。别怕残差学习这个高大上的名字，其实原理并没有那么复杂。
原本深度学习可以表示为一个函数 x -> H(x)，输入x，输出是H(x)。这样直接去学习H(x)在网络层数很深的时候，过于困难。那就让模型学习一个差值，x -> F(x)+x。可以直观的理解为一部分学习任务被x承担了。举一个不是特别准确的例子，可以用来直观的理解。比如 x = 100，H(x)=100.1 ，F(x)=0.1，原本模型需要学习到100.1，但是现在只需要学习到0.1。
论文中提到了恒等映射的概念，对于函数表达式 x -> x，对于残差学习只需要优化F(x)=0，这比优化非线性函数H(x) = x更加简单。
下图是一个简单示例，它把x提出来和后面卷积的结果F(x)相加，等到最终结果F(x)+x。这个操作论文中称为“shortcut”。具体的代码实现，后面会提到，也比较简单。

image.png

ResNet 网络结构

ResNet有很多版本，它们的不同之处在于网络层数不同。不过整体的结构和下图中最右边的结构一样。conv代表的是卷积，7x7 和 3x3代表的是卷积核大小，pool代表池化，/2 表示图像的高宽变为原来的1/2。我们可以看到网络是先经过一个7x7的卷积然后3x3的池化，接着是一堆3x3卷积和shortcut。

ResNet 网络结构

代码解析

代码地址:resnet_utils.py
resnet_v1.py
看了前面的是不是还是云里雾里，还是看代码更加实在一点。那就解析一下slim库实现的ResNet50。
下图是ResNet各个版本的网络结构图。我们这里主要看50-layer一栏。

image.png

我们看到resnet_v1.py中resnet_v1_50函数，这是调用resnet50的函数。

def resnet_v1_50(inputs,
                 num_classes=None,
                 is_training=True,
                 global_pool=True,
                 output_stride=None,
                 spatial_squeeze=True,
                 store_non_strided_activations=False,
                 reuse=None,
                 scope='resnet_v1_50'):
  """ResNet-50 model of [1]. See resnet_v1() for arg and return description."""
  blocks = [
      resnet_v1_block('block1', base_depth=64, num_units=3, stride=2),
      resnet_v1_block('block2', base_depth=128, num_units=4, stride=2),
      resnet_v1_block('block3', base_depth=256, num_units=6, stride=2),
      resnet_v1_block('block4', base_depth=512, num_units=3, stride=1),
  ]
  return resnet_v1(inputs, blocks, num_classes, is_training,
                   global_pool=global_pool, output_stride=output_stride,
                   include_root_block=True, spatial_squeeze=spatial_squeeze,
                   store_non_strided_activations=store_non_strided_activations,
                   reuse=reuse, scope=scope)
resnet_v1_50.default_image_size = resnet_v1.default_image_size

ResNet实现的时候，使用了block和bottleneck的代码结构，使得整个代码看上去很“优雅”，而不是简单一层层网上加。这其中定义了4个resnet_v1_block对应ResNet结构图中conv2_x到conv_5_x。
我们来看一下resnet_v1_block，它返回了num_units个小的block，其中的参数我们在bottleneck函数中会用到。

def resnet_v1_block(scope, base_depth, num_units, stride):
  return resnet_utils.Block(scope, bottleneck, [{
      'depth': base_depth * 4,
      'depth_bottleneck': base_depth,
      'stride': 1
  }] * (num_units - 1) + [{
      'depth': base_depth * 4,
      'depth_bottleneck': base_depth,
      'stride': stride
  }])

接下来这个函数调用了resnet_v1()函数。接着看。

def resnet_v1(inputs,
              blocks,
              num_classes=None,
              is_training=True,
              global_pool=True,
              output_stride=None,
              include_root_block=True,
              spatial_squeeze=True,
              store_non_strided_activations=False,
              reuse=None,
              scope=None):
  with tf.variable_scope(scope, 'resnet_v1', [inputs], reuse=reuse) as sc:
    end_points_collection = sc.original_name_scope + '_end_points'
    with slim.arg_scope([slim.conv2d, bottleneck,
                         resnet_utils.stack_blocks_dense],
                        outputs_collections=end_points_collection):
      with (slim.arg_scope([slim.batch_norm], is_training=is_training)
            if is_training is not None else NoOpScope()):
        net = inputs
        if include_root_block:
          if output_stride is not None:
            if output_stride % 4 != 0:
              raise ValueError('The output_stride needs to be a multiple of 4.')
            output_stride /= 4
          net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1')
          net = slim.max_pool2d(net, [3, 3], stride=2, scope='pool1')
        net = resnet_utils.stack_blocks_dense(net, blocks, output_stride,
                                              store_non_strided_activations)
        # Convert end_points_collection into a dictionary of end_points.
        end_points = slim.utils.convert_collection_to_dict(
            end_points_collection)

        if global_pool:
          # Global average pooling.
          net = tf.reduce_mean(net, [1, 2], name='pool5', keep_dims=True)
          end_points['global_pool'] = net
        if num_classes:
          net = slim.conv2d(net, num_classes, [1, 1], activation_fn=None,
                            normalizer_fn=None, scope='logits')
          end_points[sc.name + '/logits'] = net
          if spatial_squeeze:
            net = tf.squeeze(net, [1, 2], name='SpatialSqueeze')
            end_points[sc.name + '/spatial_squeeze'] = net
          end_points['predictions'] = slim.softmax(net, scope='predictions')
        return net, end_points
resnet_v1.default_image_size = 224

这里代码还挺长的，我把一部分注释给去掉了。我们一点点来看。

                if include_root_block:
                    if output_stride is not None:
                        if output_stride % 4 != 0:
                            raise ValueError('The output_stride needs to be a multiple of 4.')
                        output_stride /= 4
                    # 7x7卷积操作
                    net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1')
                    # pool操作
                    net = slim.max_pool2d(net, [3, 3], stride=2, scope='pool1')

                    net = slim.utils.collect_named_outputs(end_points_collection, 'pool2', net)

这一部分代码表示的是ResNet中刚开始的7x7卷积和pool操作。其中需要注意的是conv2d_same函数。

def conv2d_same(inputs, num_outputs, kernel_size, stride, rate=1, scope=None):
    if stride == 1:
        return slim.conv2d(inputs, num_outputs, kernel_size, stride=1, rate=rate,
                           padding='SAME', scope=scope)
    else:
        kernel_size_effective = kernel_size + (kernel_size - 1) * (rate - 1)
        pad_total = kernel_size_effective - 1
        pad_beg = pad_total // 2
        pad_end = pad_total - pad_beg
        inputs = tf.pad(inputs,
                        [[0, 0], [pad_beg, pad_end], [pad_beg, pad_end], [0, 0]])
        return slim.conv2d(inputs, num_outputs, kernel_size, stride=stride,
                           rate=rate, padding='VALID', scope=scope)

这一个函数保证图片卷积操作之后尺寸的合法性，得保证在stride=2的卷积下，得到的结果的高和宽是原来的1/2。使用了tf.pad来padding图像，注意这里图像的维度是4，包括了batch_size，我们只需要在w和h的维度上padding。
接下来就是重头戏了。就是下面这一行代码，是整个网络的精髓。

net = resnet_utils.stack_blocks_dense(net, blocks, output_stride)

这其中调用了stack_blocks_dense函数，我们接着来看。

def stack_blocks_dense(net, blocks, output_stride=None,
                       store_non_strided_activations=False,
                       outputs_collections=None):
  current_stride = 1

  # The atrous convolution rate parameter.
  rate = 1

  for block in blocks:
    with tf.variable_scope(block.scope, 'block', [net]) as sc:
      block_stride = 1
      for i, unit in enumerate(block.args):
        if store_non_strided_activations and i == len(block.args) - 1:
          # Move stride from the block's last unit to the end of the block.
          block_stride = unit.get('stride', 1)
          unit = dict(unit, stride=1)

        with tf.variable_scope('unit_%d' % (i + 1), values=[net]):
 
          if output_stride is not None and current_stride == output_stride:
            net = block.unit_fn(net, rate=rate, **dict(unit, stride=1))
            rate *= unit.get('stride', 1)

          else:
            net = block.unit_fn(net, rate=1, **unit)
            current_stride *= unit.get('stride', 1)
            if output_stride is not None and current_stride > output_stride:
              raise ValueError('The target output_stride cannot be reached.')

      # Collect activations at the block's end before performing subsampling.
      net = slim.utils.collect_named_outputs(outputs_collections, sc.name, net)

      # Subsampling of the block's output activations.
      if output_stride is not None and current_stride == output_stride:
        rate *= block_stride
      else:
        net = subsample(net, block_stride)
        current_stride *= block_stride
        if output_stride is not None and current_stride > output_stride:
          raise ValueError('The target output_stride cannot be reached.')

  if output_stride is not None and current_stride != output_stride:
    raise ValueError('The target output_stride cannot be reached.')

  return net

其中有一个output_stride和一个current_stride，这两个变量是用来控制输出图像的大小，举个例子current_stride=2 表示输出图像的高宽是原图的1/2。一般使用ResNet不会去设置output_stride，当然有特殊需求的话，可以去设置一下。(如果达到了output_stride，之后会使用空洞卷积来代替步长不同的卷积。空洞卷积的rate和普通卷积步长对应)。我们这里就不管output_stride了。
我们看到它对blocks进行一个遍历，这里的blocks就是我们一开始使用的那4个block。对于每一个block，程序执行这么一个操作。

net = block.unit_fn(net,
                               depth=unit_depth,
                               depth_bottleneck=unit_depth_bottleneck,
                               stride=unit_stride,
                               rate=1)
                        current_stride *= unit_stride

其中unit_fn是一个函数，回到我们定义block的时候，我们可以看到unit_fn指向的是bottleneck函数。

def bottleneck(inputs,
               depth,
               depth_bottleneck,
               stride,
               rate=1,
               outputs_collections=None,
               scope=None,
               use_bounded_activations=False):
  with tf.variable_scope(scope, 'bottleneck_v1', [inputs]) as sc:
    depth_in = slim.utils.last_dimension(inputs.get_shape(), min_rank=4)
    if depth == depth_in:
      shortcut = resnet_utils.subsample(inputs, stride, 'shortcut')
    else:
      shortcut = slim.conv2d(
          inputs,
          depth, [1, 1],
          stride=stride,
          activation_fn=tf.nn.relu6 if use_bounded_activations else None,
          scope='shortcut')

    residual = slim.conv2d(inputs, depth_bottleneck, [1, 1], stride=1,
                           scope='conv1')
    residual = resnet_utils.conv2d_same(residual, depth_bottleneck, 3, stride,
                                        rate=rate, scope='conv2')
    residual = slim.conv2d(residual, depth, [1, 1], stride=1,
                           activation_fn=None, scope='conv3')

    if use_bounded_activations:
      # Use clip_by_value to simulate bandpass activation.
      residual = tf.clip_by_value(residual, -6.0, 6.0)
      output = tf.nn.relu6(shortcut + residual)
    else:
      output = tf.nn.relu(shortcut + residual)

    return slim.utils.collect_named_outputs(outputs_collections,
                                            sc.name,
                                            output)

好了，到了最终的时刻了。bottleneck函数所需要的参数，就是之前我们提到的block的参数。我们具体来看这个函数。
首先它获取了输入也就是inputs的最后一维（就是通道数）。

depth_in = slim.utils.last_dimension(inputs.get_shape(), min_rank=4)
    if depth == depth_in:
      shortcut = resnet_utils.subsample(inputs, stride, 'shortcut')
    else:
      shortcut = slim.conv2d(
          inputs,
          depth, [1, 1],
          stride=stride,
          activation_fn=tf.nn.relu6 if use_bounded_activations else None,
          scope='shortcut')

这一部分代码是为了保证shortcut和residual的高宽以及通道数是相同的。通过下采样步长和卷积步长来控制高宽相同，通过1x1卷积来保证通道数相同。这里的下采样操作，如果步长为1直接返回输入，不然就使用maxpool。
当让还需要保证residual和shortcut的通道数是一样的，这里使用1x1卷积来改变通道数。
最后的输出就是 output = tf.nn.relu6(shortcut + residual)，就是F(x)+x。

ResNet 解析

ResNet 解析

ResNet提出的背景

ResNet 网络结构

代码解析