Very Deep Convolutional Networks for Large-Scale Image Recognition翻译[上]

Very Deep Convolutional Networks for Large-Scale Image Recognition翻译下

Very Deep Convolutional Networks for Large-Scale Image Recognition

用于大规模图像识别的非常深的卷积网络

论文：http://arxiv.org/pdf/1409.1556v6.pdf

ABSTRACT

摘要

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (

image

) convolution ﬁlters, which shows that a signiﬁcant improvement on the prior-art conﬁgurations can be achieved by pushing the depth to 16–19 weight layers. These ﬁndings were the basis of our ImageNet Challenge 2014 submission, where our team secured the ﬁrst and the second places in the localisation and classiﬁcation tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

在这项工作中，我们研究了卷积网络深度对大规模图像识别设置的精度的影响。我们的主要贡献是使用具有非常小（

image

）卷积滤波器的体系结构对深度网络进行深入评估，这表明通过将深度推到16-19个重量层可以实现对现有技术配置的显着改进。这些发现是我们ImageNet Challenge 2014提交的基础，我们的团队分别获得了本地化和分类轨道的第一和第二名。我们还表明，我们的表示很好地适用于其他数据集，他们在那里获得最新的结果。我们已经公开发布了两款性能最佳的ConvNet模型，以便于进一步研究在计算机视觉中使用深度视觉表示。

1 INTRODUCTION

1引言

Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image repositories, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012). In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classiﬁcation systems, from high-dimensional shallow feature encodings (Perronnin et al., 2010) (the winner of ILSVRC-2011) to deep ConvNets (Krizhevsky et al., 2012) (the winner of ILSVRC-2012).

卷积网络（ConvNets）最近在大规模图像和视频识别（Krizhevsky等，2012; Zeiler＆Fergus，2013; Sermanet等，2014; Simonyan＆Zisserman，2014）方面取得了巨大的成功，这已经成为可能由于大型公共图像库（如ImageNet（Deng等，2009））和高性能计算系统（如GPU或大规模分布式群集）（Dean等，2012）。特别是，ImageNet大规模视觉识别挑战（ILSVRC）（Russakovsky et al。，2014）对深度视觉识别架构的发展起到了重要作用，它已经成为几代大型（Perronnin et al。，2010）（ILSVRC-2011的获胜者）到深层ConvNets（Krizhevsky等，2012）（ILSVRC-2012的获胜者）的高分辨率图像分类系统。

With ConvNets becoming more of a commodity in the computer vision ﬁeld, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the ﬁrst convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we ﬁx other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (

image

) convolution ﬁlters in all layers.

随着ConvNets在计算机视觉领域越来越成为一种商品，许多人尝试改进Krizhevsky等人的原始体系结构。（2012年），以争取更好的准确性。例如，ILSVRC2013的最佳表现（Zeiler＆Fergus，2013; Sermanet等，2014）利用较小的接受窗口大小和较小的第一卷积层。另一个改进方案是在整个图像和多个尺度上密集训练和测试网络（Sermanet et al。，2014; Howard，2014）。在本文中，我们解决了ConvNet架构设计的另一个重要方面 - 它的深度。为此，我们定义了该架构的其他参数，并通过添加更多卷积层来稳步增加网络深度，由于在所有层中使用了非常小的（

image

）卷积滤波器，这是可行的。

As a result, we come up with signiﬁcantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classiﬁcation and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classiﬁed by a linear SVM without ﬁne-tuning). We have released our two best-performing models1 to facilitate further research.

因此，我们提出了更加精确的ConvNet架构，它不仅实现了ILSVRC分类和本地化任务的最新准确度，而且还适用于其他图像识别数据集，甚至可以实现卓越的性能当用作相对简单的管道的一部分时（例如，不需要微调的线性SVM对深度特征进行分类）。我们发布了两款性能最好的模型1，以便于进一步研究。

The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet conﬁgurations. The details of the image classiﬁcation training and evaluation are then presented in Sect. 3, and the ∗current afﬁliation: Google DeepMind +current afﬁliation: University of Oxford and Google DeepMind 1http://www.robots.ox.ac.uk/˜vgg/research/very_deep/ conﬁgurations are compared on the ILSVRC classiﬁcation task in Sect. 4. Sect. 5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.

本文的其余部分安排如下。在Sect。 2，我们描述了我们的ConvNet配置。图像分类培训和评估的细节将在第二部分中介绍。 3和*当前补充：Google DeepMind +当前补充：牛津大学和Google DeepMind 1http：//www.robots.ox.ac.uk/~vgg/research/very_deep/配置在ILSVRC分类任务中进行比较教派。 4. Sect。 5结束了论文。为了完整起见，我们还在附录A中描述和评估了ILSVRC-2014对象定位系统，并讨论了附录B中对其他数据集的深入特征的概括。最后，附录C包含主要论文修订版的列表。

2 CONVNET CONFIGURATIONS

2 CONVNET配置

To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer conﬁgurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we ﬁrst describe a generic layout of our ConvNet conﬁgurations (Sect. 2.1) and then detail the speciﬁc conﬁgurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.

为了衡量公平环境下ConvNet深度增加所带来的改进，我们所有的ConvNet层配置都采用了Ciresan等人的相同原则设计。（2011）; Krizhevsky等人。（2012年）。在本节中，我们首先描述ConvNet配置的一般布局（第2.1节），然后详细介绍评估中使用的特定配置（第2.2节）。然后讨论我们的设计选择，并与Sect中的现有技术进行比较。 2.3。

2.1 ARCHITECTURE

2.1体系结构

During training, the input to our ConvNets is a ﬁxed-size

image

RGB image. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use ﬁlters with a very small receptive ﬁeld:

image

(which is the smallest size to capture the notion of left/right, up/down, center). In one of the conﬁgurations we also utilise

image

convolution ﬁlters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is ﬁxed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for

image

conv. layers. Spatial pooling is carried out by ﬁve max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a

image

pixel window, with stride 2.

在训练期间，我们ConvNets的输入是固定尺寸的

image

RGB图像。我们所做的唯一预处理是从每个像素中减去在训练集上计算的平均RGB值。图像通过一堆卷积（conv。）图层，我们在这里使用具有非常小的接收区域的滤波器：

image

（这是捕获左/右，上/下，中心概念的最小尺寸）。在其中一种配置中，我们也使用

image

卷积滤波器，这可以看作是输入通道的线性变换（其次是非线性）。卷积步长固定为1个像素; conv的空间填充。层输入使得在卷积之后保留空间分辨率，即

image

conv的填充是1个像素。层。空间池由五个最大池层完成，这些层遵循一些转化。图层（并非所有的转化层次都是最大池化）。Max-pooling通过

image

像素窗口进行，步幅2。

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the ﬁrst two have 4096 channels each, the third performs 1000way ILSVRC classiﬁcation and thus contains 1000 channels (one for each class). The ﬁnal layer is the soft-max layer. The conﬁguration of the fully connected layers is the same in all networks.

一堆卷积层（在不同的体系结构中具有不同的深度）之后是三个全连接（FC）层：前两个层各有4096个通道，第三层执行1000way ILSVRC分类，因此包含1000个通道（每个类）。最后一层是软 - 最大层。全连接层的配置在所有网络中都是相同的。

All hidden layers are equipped with the rectiﬁcation (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).

所有隐藏层都配备了整合（ReLU（Krizhevsky et al。，2012））非线性。我们注意到我们的网络（除了一个网络）都没有包含本地响应规范化（LRN）规范化（Krizhevsky et al。，2012）。如图4所示，这种归一化不会提高ILSVRC数据集的性能，但会导致内存消耗和计算时间增加。在适用的情况下，LRN层的参数是（Krizhevsky et al。，2012）的参数。

2.2 CONFIGURATIONS

2.2配置

The ConvNet conﬁgurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All conﬁgurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the ﬁrst layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.

本文中评估的ConvNet配置在表1中列出，每列一列。下面我们将以他们的名字（A-E）来提及网。所有的配置都遵循Sect中的通用设计。 2.1，并且仅在深度上有所不同：从网络A中的11个权重层（8个转发层和3个FC层）到网络E中的19个权重层（16个转发层和3个FC层）。conv的宽度。层数（通道数量）相当小，从第一层64层开始，然后在每个最大池层后增加2倍，直到达到512。

In Table 2 we report the number of parameters for each conﬁguration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive ﬁelds (144M weights in (Sermanet et al., 2014)).

在表2中，我们报告了每个配置的参数数量。尽管深度很大，但我们的网中的重量数量不会超过更大的转化次数的更浅网中的重量数量。图层宽度和接受域（Sermanet et al。，2014）中的144M权重）。

2.3 DISCUSSION

2.3讨论

Our ConvNet conﬁgurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014). Rather than using relatively large receptive ﬁelds in the ﬁrst conv. layers (e.g.

image

with stride 4 in (Krizhevsky et al., 2012), or

image

with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small

image

receptive ﬁelds throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two

image

conv. layers (without spatial pooling in between) has an effective receptive ﬁeld of

image

; three such layers have a

image

effective receptive ﬁeld. So what have we gained by using, for instance, a stack of three

image

conv. layers instead of a single

image

layer? First, we incorporate three non-linear rectiﬁcation layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer

image

convolution stack has C channels, the stack is parametrised by

image

weights; at the same time, a single

image

conv. layer would require

image

parameters, i.e. 81% more. This can be seen as imposing a regularisation on the

image

conv. ﬁlters, forcing them to have a decomposition through the

image

ﬁlters (with non-linearity injected in between).

我们的ConvNet配置与ILSVRC-2012（Krizhevsky等，2012）和ILSVRC-2013竞赛（Zeiler＆Fergus，2013; Sermanet等，2014）的表现最佳的条目中使用的配置大不相同。而不是在第一次转化中使用相对较大的接受性字段。我们在整个网络中使用了非常小的

image

接收区域（例如（Krizhevsky et al。，2012）的

image

或步幅为2的

image

（Zeiler＆Fergus，2013; Sermanet et al。，2014），它与每个像素的输入（步幅1）进行卷积。很容易看到一堆两个

image

conv。层（没有空间池）之间有一个有效的

image

接受域;三个这样的层次具有

image

有效的接受域。那么，我们通过使用三个

image

conv的堆栈获得了什么。层而不是单个

image

图层？首先，我们包含三个非线性整型层而不是单一层，这使决策函数更具有区分性。其次，我们减少了参数的数量：假设三层

image

卷积叠层的输入和输出都具有C通道，则堆叠由

image

权重进行参数化;同时还有一个

image

conv。层需要

image

参数，即多出81％。这可以被看作是在

image

conv上实施正规化。过滤器强迫它们通过

image

过滤器进行分解（两者之间注入非线性）。

image

The incorporation of

image

conv. layers (conﬁguration C, Table 1) is a way to increase the nonlinearity of the decision function without affecting the receptive ﬁelds of the conv. layers. Even though in our case the

image

convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectiﬁcation function. It should be noted that

image

conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).

纳入

image

conv。层（配置C，表1）是增加决策函数的非线性而不影响conv的接受域的一种方法。层。尽管在我们的例子中，

image

卷积本质上是对相同维度空间的线性投影（输入和输出通道的数量是相同的），但整合函数引入了额外的非线性。应该指出的是，

image

conv。层最近被用于Lin等人的“Network in Network”网络中。（2014）。

Small-size convolution ﬁlters have been previously used by Ciresan et al. (2011), but their nets are signiﬁcantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance. GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classiﬁcation task, was developed independently of our work, but is similar in that it is based on very deep ConvNets (22 weight layers) and small convolution ﬁlters (apart from

image

, they also use

image

and

image

convolutions). Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the ﬁrst layers to decrease the amount of computation. As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classiﬁcation accuracy.

Ciresan等人以前使用小尺寸卷积滤波器。（2011），但是他们的网络比我们的网络要低得多，他们没有评估大规模的ILSVRC数据集。Goodfellow等人。（2014）将深度ConvNets（11个重量层）应用于街道号识别任务，并表明深度增加导致更好的性能。GoogLeNet（Szegedy等，2014）是ILSVRC-2014分类任务中性能最好的一个项目，它独立于我们的工作而开发，但类似之处在于它基于非常深的ConvNets（22个加权层）和小卷积（除

image

外，他们还使用

image

和

image

卷积）。但是，它们的网络拓扑结构比我们的要复杂得多，并且特征映射的空间分辨率在第一层中更加积极地减少以减少计算量。正如将在章节中所显示的那样。 4.5，我们的模型超过了Szegedy等人的模型。（2014年）的单网分类精度。

3 CLASSIFICATION FRAMEWORK

3分类框架

In the previous section we presented the details of our network conﬁgurations. In this section, we describe the details of classiﬁcation ConvNet training and evaluation.

在上一节中，我们介绍了我们网络配置的细节。在本节中，我们将描述分类ConvNet培训和评估的细节。

3.1 TRAINING

3.1培训

The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay (the

image

penalty multiplier set to

image

) and dropout regularisation for the ﬁrst two fully-connected layers (dropout ratio set to 0.5). The learning rate was initially set to

image

, and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs). We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. ﬁlter sizes; (b) pre-initialisation of certain layers.

ConvNet培训程序通常遵循Krizhevsky et al。（2012）（除了从多尺度训练图像中抽取输入作物，如后文所述）。也就是说，培训是通过使用小批量梯度下降（基于反向传播（LeCun等人，1989））利用动量来优化多项式逻辑回归目标来进行的。批量大小设置为256，动量为0.9。训练通过体重衰减（

image

惩罚乘数设置为

image

）和前两个完全连接层（丢失率设置为0.5）的丢失正则化进行调整。学习率最初设置为

image

，然后在验证集精度停止改进时减少10倍。总的来说，学习率降低了3倍，并且在370K迭代（74个时期）后停止了学习。我们猜想，尽管与（Krizhevsky et al。，2012）相比，网络的参数数量更多，网络深度也更大，但网络需要更少的时间收敛，因为（a）由更深的深度和更小的转换所带来的隐式正则化。过滤器尺寸; （b）某些图层的预初始化。

The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the conﬁguration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the ﬁrst four convolutional layers and the last three fullyconnected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning. For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and

image

variance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).

网络权重的初始化很重要，因为由于深度网络中的梯度不稳定，初始化不好可能会导致学习停滞。为了避免这个问题，我们开始训练配置A（表1），这个配置足够浅，可以随机初始化进行训练。然后，当训练更深的体系结构时，我们初始化了前四个卷积层和最后三个完全连接的层，其中网A层（中间层随机初始化）。我们没有降低预先初始化图层的学习速率，允许它们在学习期间改变。对于随机初始化（如适用），我们从具有零均值和

image

方差的正态分布中采样权重。偏差初始化为零。值得注意的是，在提交论文后，我们发现可以通过使用Glorot＆Bengio（2010）的随机初始化程序在没有预先训练的情况下初始化权重。

To obtain the ﬁxed-size

image

ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal ﬂipping and random RGB colour shift (Krizhevsky et al., 2012). Training image rescaling is explained below.

为了获得固定大小的

image

ConvNet输入图像，他们从重新缩放的训练图像中随机裁剪（每SGD迭代每个图像一次裁剪）。为了进一步增强训练集，作物经历了随机的水平平移和随机RGB颜色偏移（Krizhevsky et al。，2012）。下面将介绍培训图像缩放。

Training image size. Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to S as the training scale). While the crop size is ﬁxed to

image

, in principle S can take on any value not less than 224: for

image

the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for

image

the crop will correspond to a small part of the image, containing a small object or an object part.

培训图像大小。设S是各向同性重新调整的训练图像的最小侧，从中ConvNet输入被裁剪（我们也称S为训练尺度）。虽然作物大小固定在

image

上，但原则上S可以取不小于224的任何值：对于

image

，作物将捕获整幅图像统计数据，完全跨越训练图像的最小边;对于

image

，裁剪将对应于图像的一小部分，包含一个小对象或一个对象部分。

We consider two approaches for setting the training scale S. The ﬁrst is to ﬁx S, which corresponds to single-scale training (note that image content within the sampled crops can still represent multiscale image statistics). In our experiments, we evaluated models trained at two ﬁxed scales:

image

256 (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) and

image

. Given a ConvNet conﬁguration, we ﬁrst trained the network using

image

. To speed-up training of the

image

network, it was initialised with the weights pre-trained with

image

, and we used a smaller initial learning rate of

image

我们考虑设置训练量表S的两种方法。首先是对应于单尺度训练的S（注意，采样作物中的图像内容仍然可以代表多尺度图像统计）。在我们的实验中，我们评估了以两种固定比例进行训练的模型：

image

256（其在现有技术中已被广泛使用（Krizhevsky等，2012; Zeiler＆Fergus，2013; Sermanet等，2014））和

image

。鉴于ConvNet配置，我们首先使用

image

对网络进行了培训。为加速

image

网络的培训，初始化时使用

image

预先训练的权重，我们使用了

image

的较小初始学习率。

The second approach to setting S is multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range

image

(we used

image

and

image

). Since objects in images can be of different size, it is beneﬁcial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales. For speed reasons, we trained multi-scale models by ﬁne-tuning all layers of a single-scale model with the same conﬁguration, pre-trained with ﬁxed

image

第二种设置S的方法是多尺度训练，其中每个训练图像通过从特定范围

image

（我们使用

image

和

image

）随机抽样S来单独重新调整比例。由于图像中的对象可以具有不同的大小，因此在训练过程中考虑到这一点是有益的。这也可以看作是通过缩放抖动来增强训练集，其中单个模型被训练以识别范围广泛的物体。出于速度的原因，我们通过对具有相同配置的单尺度模型的所有层进行细调，并使用固定的

image

进行预培训，从而训练了多尺度模型。

3.2 TESTING

3.2测试

At test time, given a trained ConvNet and an input image, it is classiﬁed in the following way. First, it is isotropically rescaled to a pre-deﬁned smallest image side, denoted as Q (we also refer to it as the test scale). We note that Q is not necessarily equal to the training scale S (as we will show in Sect. 4, using several values of Q for each S leads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are ﬁrst converted to convolutional layers (the ﬁrst FC layer to a

image

conv. layer, the last two FC layers to

image

conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a ﬁxed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). We also augment the test set by horizontal ﬂipping of the images; the soft-max class posteriors of the original and ﬂipped images are averaged to obtain the ﬁnal scores for the image.

在测试时间，给定一个训练好的ConvNet和一个输入图像，它按以下方式进行分类。首先，将其各向同性地重新缩放到预先定义的最小图像侧，表示为Q（我们也将其称为测试尺度）。我们注意到，Q不一定等于训练量表S（如我们将在第4节中所示，对每个S使用几个Q值导致改进的性能）。然后，以类似于（Sermanet等人，2014）的方式密集地在重新缩放的测试图像上应用网络。也就是说，完全连接的层首先转换为卷积层（第一个FC层转换为

image

转换层，最后两个FC层转换为

image

转换层）。然后将所得的全卷积网应用于整个（未裁剪的）图像。其结果是一个班级分数地图，其通道数量等于班级数量，以及一个可变的空间分辨率，取决于输入图像的大小。最后，为了获得图像的固定尺寸的类别分数矢量，类别分数地图是空间平均的（总和合并）。我们还通过水平翻转图像来增加测试集;对原始图像和图像进行平均处理，得到最终图像的最终分数。

Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efﬁcient as it requires network re-computation for each crop. At the same time, using a large set of crops, as done by Szegedy et al. (2014), can lead to improved accuracy, as it results in a ﬁner sampling of the input image compared to the fully-convolutional net. Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive ﬁeld, so more context is captured. While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using 50 crops per scale (

image

regular grid with 2 ﬂips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014).

由于全卷积网络应用于整个图像，因此不需要在测试时间对多个作物进行采样（Krizhevsky et al。，2012），因为它需要每个作物的网络重新计算，效率较低。同时，使用大量的作物，如Szegedy等人所做的那样。（2014）可以提高准确性，因为与完全卷积网相比，它可以对输入图像进行细化采样。此外，由于卷积边界条件不同，多作物评估与密集评估是互补的：将ConvNet应用于作物时，卷积特征地图用零填充，而在密集评估的情况下，同一作物的填充自然会出现来自图像的相邻部分（由于卷积和空间共用），这大大增加了网络的整体接受范围，因此捕获更多的上下文。虽然我们认为在实践中增加多种作物的计算时间并不能证明潜在的准确度增加，但我们也可以使用每种规模的50种作物（

image

常规格栅，2种作物）评估我们的网络，共计150种作物3个尺度，这与Szegedy等人使用的4个尺度上的144个作物相当。（2014）。

3.3 IMPLEMENTATION DETAILS

3.3实施细节

Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of signiﬁcant modiﬁcations, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.

我们的实现来源于公开的C ++ Caffe工具箱（Jia，2013）（2013年12月推出），但包含许多重要的修改，使我们可以对安装在单个系统中的多个GPU执行培训和评估作为训练并评估多尺度的全尺寸（未裁剪）图像（如上所述）。多GPU训练利用数据并行性，并且通过将每批训练图像分成几个GPU批次并在每个GPU上并行处理来执行。计算GPU批梯度后，将它们平均以获得完整批次的梯度。梯度计算在GPU中是同步的，因此结果与在单个GPU上进行训练时完全相同。

While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.

尽管最近提出了更加复杂的加速ConvNet训练的方法（Krizhevsky，2014），它们针对网络的不同层使用模型和数据并行性，但我们发现我们的概念更简单的方案已经提供了3.75倍的加速与使用单个GPU相比，现成的4 GPU系统。在配备四个NVIDIA Titan Black GPU的系统上，根据架构的不同，培训一个网络需要2-3周。

文章引用于http://tongtianta.site/paper/122
编辑 Lornatang
校准 Lornatang