Very Deep Convolutional Networks for Large-Scale Image Recognition翻译 下
Very Deep Convolutional Networks for Large-Scale Image Recognition
用于大规模图像识别的非常深的卷积网络
论文:http://arxiv.org/pdf/1409.1556v6.pdf
ABSTRACT
摘要
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small () convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
在这项工作中,我们研究了卷积网络深度对大规模图像识别设置的精度的影响。我们的主要贡献是使用具有非常小()卷积滤波器的体系结构对深度网络进行深入评估,这表明通过将深度推到16-19个重量层可以实现对现有技术配置的显着改进。这些发现是我们ImageNet Challenge 2014提交的基础,我们的团队分别获得了本地化和分类轨道的第一和第二名。我们还表明,我们的表示很好地适用于其他数据集,他们在那里获得最新的结果。我们已经公开发布了两款性能最佳的ConvNet模型,以便于进一步研究在计算机视觉中使用深度视觉表示。
1 INTRODUCTION
1引言
Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image repositories, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012). In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings (Perronnin et al., 2010) (the winner of ILSVRC-2011) to deep ConvNets (Krizhevsky et al., 2012) (the winner of ILSVRC-2012).
卷积网络(ConvNets)最近在大规模图像和视频识别(Krizhevsky等,2012; Zeiler&Fergus,2013; Sermanet等,2014; Simonyan&Zisserman,2014)方面取得了巨大的成功,这已经成为可能由于大型公共图像库(如ImageNet(Deng等,2009))和高性能计算系统(如GPU或大规模分布式群集)(Dean等,2012)。特别是,ImageNet大规模视觉识别挑战(ILSVRC)(Russakovsky et al。,2014)对深度视觉识别架构的发展起到了重要作用,它已经成为几代大型(Perronnin et al。,2010)(ILSVRC-2011的获胜者)到深层ConvNets(Krizhevsky等,2012)(ILSVRC-2012的获胜者)的高分辨率图像分类系统。
With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small () convolution filters in all layers.
随着ConvNets在计算机视觉领域越来越成为一种商品,许多人尝试改进Krizhevsky等人的原始体系结构。 (2012年),以争取更好的准确性。例如,ILSVRC2013的最佳表现(Zeiler&Fergus,2013; Sermanet等,2014)利用较小的接受窗口大小和较小的第一卷积层。另一个改进方案是在整个图像和多个尺度上密集训练和测试网络(Sermanet et al。,2014; Howard,2014)。在本文中,我们解决了ConvNet架构设计的另一个重要方面 - 它的深度。为此,我们定义了该架构的其他参数,并通过添加更多卷积层来稳步增加网络深度,由于在所有层中使用了非常小的()卷积滤波器,这是可行的。
As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models1 to facilitate further research.
因此,我们提出了更加精确的ConvNet架构,它不仅实现了ILSVRC分类和本地化任务的最新准确度,而且还适用于其他图像识别数据集,甚至可以实现卓越的性能当用作相对简单的管道的一部分时(例如,不需要微调的线性SVM对深度特征进行分类)。我们发布了两款性能最好的模型1,以便于进一步研究。
The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect. 3, and the ∗current affiliation: Google DeepMind +current affiliation: University of Oxford and Google DeepMind 1http://www.robots.ox.ac.uk/˜vgg/research/very_deep/ configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.
本文的其余部分安排如下。在Sect。 2,我们描述了我们的ConvNet配置。图像分类培训和评估的细节将在第二部分中介绍。 3和*当前补充:Google DeepMind +当前补充:牛津大学和Google DeepMind 1http://www.robots.ox.ac.uk/~vgg/research/very_deep/配置在ILSVRC分类任务中进行比较教派。 4. Sect。 5结束了论文。为了完整起见,我们还在附录A中描述和评估了ILSVRC-2014对象定位系统,并讨论了附录B中对其他数据集的深入特征的概括。最后,附录C包含主要论文修订版的列表。
2 CONVNET CONFIGURATIONS
2 CONVNET配置
To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.
为了衡量公平环境下ConvNet深度增加所带来的改进,我们所有的ConvNet层配置都采用了Ciresan等人的相同原则设计。 (2011); Krizhevsky等人。 (2012年)。在本节中,我们首先描述ConvNet配置的一般布局(第2.1节),然后详细介绍评估中使用的特定配置(第2.2节)。然后讨论我们的设计选择,并与Sect中的现有技术进行比较。 2.3。
2.1 ARCHITECTURE
2.1体系结构
During training, the input to our ConvNets is a fixed-sizeA stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
一堆卷积层(在不同的体系结构中具有不同的深度)之后是三个全连接(FC)层:前两个层各有4096个通道,第三层执行1000way ILSVRC分类,因此包含1000个通道(每个类)。最后一层是软 - 最大层。全连接层的配置在所有网络中都是相同的。
All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).
所有隐藏层都配备了整合(ReLU(Krizhevsky et al。,2012))非线性。我们注意到我们的网络(除了一个网络)都没有包含本地响应规范化(LRN)规范化(Krizhevsky et al。,2012)。如图4所示,这种归一化不会提高ILSVRC数据集的性能,但会导致内存消耗和计算时间增加。在适用的情况下,LRN层的参数是(Krizhevsky et al。,2012)的参数。
2.2 CONFIGURATIONS
2.2配置
The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.
本文中评估的ConvNet配置在表1中列出,每列一列。下面我们将以他们的名字(A-E)来提及网。所有的配置都遵循Sect中的通用设计。 2.1,并且仅在深度上有所不同:从网络A中的11个权重层(8个转发层和3个FC层)到网络E中的19个权重层(16个转发层和3个FC层)。conv的宽度。层数(通道数量)相当小,从第一层64层开始,然后在每个最大池层后增加2倍,直到达到512。
In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).
在表2中,我们报告了每个配置的参数数量。尽管深度很大,但我们的网中的重量数量不会超过更大的转化次数的更浅网中的重量数量。图层宽度和接受域(Sermanet et al。,2014)中的144M权重)。
2.3 DISCUSSION
2.3讨论
Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. layers (e.g.3 CLASSIFICATION FRAMEWORK
3分类框架
In the previous section we presented the details of our network configurations. In this section, we describe the details of classification ConvNet training and evaluation.
在上一节中,我们介绍了我们网络配置的细节。在本节中,我们将描述分类ConvNet培训和评估的细节。
3.1 TRAINING
3.1培训
The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay (thevariance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).
网络权重的初始化很重要,因为由于深度网络中的梯度不稳定,初始化不好可能会导致学习停滞。为了避免这个问题,我们开始训练配置A(表1),这个配置足够浅,可以随机初始化进行训练。然后,当训练更深的体系结构时,我们初始化了前四个卷积层和最后三个完全连接的层,其中网A层(中间层随机初始化)。我们没有降低预先初始化图层的学习速率,允许它们在学习期间改变。对于随机初始化(如适用),我们从具有零均值和方差的正态分布中采样权重。偏差初始化为零。值得注意的是,在提交论文后,我们发现可以通过使用Glorot&Bengio(2010)的随机初始化程序在没有预先训练的情况下初始化权重。
To obtain the fixed-sizeConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012). Training image rescaling is explained below.
为了获得固定大小的ConvNet输入图像,他们从重新缩放的训练图像中随机裁剪(每SGD迭代每个图像一次裁剪)。为了进一步增强训练集,作物经历了随机的水平平移和随机RGB颜色偏移(Krizhevsky et al。,2012)。下面将介绍培训图像缩放。
Training image size. Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to S as the training scale). While the crop size is fixed to.
第二种设置S的方法是多尺度训练,其中每个训练图像通过从特定范围进行预培训,从而训练了多尺度模型。
3.2 TESTING
3.2测试
At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to it as the test scale). We note that Q is not necessarily equal to the training scale S (as we will show in Sect. 4, using several values of Q for each S leads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to aregular grid with 2 flips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014).
由于全卷积网络应用于整个图像,因此不需要在测试时间对多个作物进行采样(Krizhevsky et al。,2012),因为它需要每个作物的网络重新计算,效率较低。同时,使用大量的作物,如Szegedy等人所做的那样。 (2014)可以提高准确性,因为与完全卷积网相比,它可以对输入图像进行细化采样。此外,由于卷积边界条件不同,多作物评估与密集评估是互补的:将ConvNet应用于作物时,卷积特征地图用零填充,而在密集评估的情况下,同一作物的填充自然会出现来自图像的相邻部分(由于卷积和空间共用),这大大增加了网络的整体接受范围,因此捕获更多的上下文。虽然我们认为在实践中增加多种作物的计算时间并不能证明潜在的准确度增加,但我们也可以使用每种规模的50种作物(常规格栅,2种作物)评估我们的网络,共计150种作物3个尺度,这与Szegedy等人使用的4个尺度上的144个作物相当。 (2014)。
3.3 IMPLEMENTATION DETAILS
3.3实施细节
Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.
我们的实现来源于公开的C ++ Caffe工具箱(Jia,2013)(2013年12月推出),但包含许多重要的修改,使我们可以对安装在单个系统中的多个GPU执行培训和评估作为训练并评估多尺度的全尺寸(未裁剪)图像(如上所述)。多GPU训练利用数据并行性,并且通过将每批训练图像分成几个GPU批次并在每个GPU上并行处理来执行。计算GPU批梯度后,将它们平均以获得完整批次的梯度。梯度计算在GPU中是同步的,因此结果与在单个GPU上进行训练时完全相同。
While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.
尽管最近提出了更加复杂的加速ConvNet训练的方法(Krizhevsky,2014),它们针对网络的不同层使用模型和数据并行性,但我们发现我们的概念更简单的方案已经提供了3.75倍的加速与使用单个GPU相比,现成的4 GPU系统。在配备四个NVIDIA Titan Black GPU的系统上,根据架构的不同,培训一个网络需要2-3周。
文章引用于http://tongtianta.site/paper/122
编辑 Lornatang
校准 Lornatang