Going Deeper With Convolutions翻译[下]

The network was designed with computational efﬁciency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint. The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. However this number depends on the machine learning infrastructure system used. The use of average pooling before the classiﬁer is based on [12], although our implementation differs in that we use an extra linear layer. This enables adapting and ﬁne-tuning our networks for other label sets easily, but it is mostly convenience and we do not expect it to have a major effect. It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.

网络设计时考虑到了计算效率和实用性，因此推理可以在单个设备上运行，甚至包括那些计算资源有限的设备，特别是在低内存占用情况下。当只计算具有参数的图层时，网络的深度为22层（如果我们还计算池的话，则计算为27层）。用于构建网络的总层数（独立构建块）大约为100。但是这个数量取决于所使用的机器学习基础设施系统。在分类器之前使用平均池是基于[12]的，尽管我们的实现不同之处在于我们使用了一个额外的线性层。这使得我们的网络能够很容易地适应和调整我们的网络以适应其他标签集，但它主要是方便的，我们并不期望它有重大影响。研究发现，从完全连接的层转移到平均汇聚将前1精度提高了约0.6％，但即使在去除完全连接的层之后仍然使用丢失。

Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classiﬁers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classiﬁer, increase the gradient signal that gets propagated back, and provide additional regularization. These classiﬁers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classiﬁers were weighted by 0.3). At inference time, these auxiliary networks are discarded.

鉴于网络的深度相对较大，以有效方式将梯度传播回所有层的能力值得关注。一个有趣的见解是，相对较浅的网络在这个任务上的强大表现表明，网络中间的层产生的特征应该是非常有区别性的。通过添加与这些中间层相关的辅助分类器，我们期望在分类器的较低阶段鼓励歧视，增加传播回来的梯度信号，并提供额外的正则化。这些分类器采用了更小的卷积网络形式，放在初始（4a）和（4d）模块的输出之上。在训练期间，他们的损失以折扣权重加入网络的全部损失中（辅助分类员的损失加权为0.3）。在推断时，这些辅助网络被丢弃。

The exact structure of the extra network on the side, including the auxiliary classiﬁer, is as follows:

包括辅助分类器在内的附加网络的确切结构如下：

• An average pooling layer with

image

ﬁlter size and stride 3, resulting in an

image

output for the (4a), and

image

for the (4d) stage.

•

image

滤波器大小和步长3的平均汇聚层，导致（4a）的

image

输出和（4d）阶段的

image

导致

image

输出。

image

Figure 3: GoogLeNet network with all the bells and whistles

图3：带有所有花里胡哨的GoogLeNet网络

• A

image

convolution with 128 ﬁlters for dimension reduction and rectiﬁed linear activation.

•带128个滤波器的

image

卷积器，用于降低尺寸和整齐线性激活。

• A fully connected layer with 1024 units and rectiﬁed linear activation.

•具有1024个单元的完全连接层和整型线性激活。

• A dropout layer with 70% ratio of dropped outputs.

•丢失输出的比例为70％的丢失层。

• A linear layer with softmax loss as the classiﬁer (predicting the same 1000 classes as the main classiﬁer, but removed at inference time).

•一个具有softmax损失的线性层作为分类器（预测与主分类器相同的1000个类，但是在推断时被删除）。

A schematic view of the resulting network is depicted in Figure 3.

图3描绘了最终网络的示意图。

Our networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], ﬁxed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the ﬁnal model used at inference time.

我们的网络使用DistBelief [4]分布式机器学习系统进行训练，使用适量的模型和数据并行性。虽然我们只使用基于CPU的实现方式，但粗略估计表明，GoogLeNet网络可以在一周内使用少量高端GPU进行融合培训，主要限制是内存使用量。我们的训练使用0.9动量的异步随机梯度下降[17]，固定的学习速率计划（每8个学习时间将学习速率降低4％）。使用Polyak平均[13]来创建用于推断时间的最终模型。

Our image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, like dropout and learning rate, so it is hard to give a deﬁnitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was veriﬁed to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between

image

and

image

. Also, we found that the photometric distortions by Andrew Howard [8] were useful to combat overﬁtting to some extent. In addition, we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing relatively late and in conjunction with other hyperparameter changes, so we could not tell deﬁnitely whether the ﬁnal results were affected positively by their use.

我们的图像采样方法在竞争激烈的几个月里已经发生了很大的变化，已经收敛的模型已经通过其他选项进行了培训，有时还会与更改的超参数一起使用，例如辍学率和学习率，因此很难给出明确的指导最有效的单一方式来训练这些网络。使问题更加复杂化的是，一些模型主要是针对较小的相关作物进行培训，另一些则针对较大的模型进行培训，其灵感来自[8]。尽管如此，经过验证的一个处方在比赛结束后得到了很好的效果，包括对尺寸均匀分布在图像区域8％和100％之间的各种尺寸图像进行采样，并且其长宽比在

image

和

image

之间随机选择。此外，我们发现安德鲁霍华德[8]的光度失真在某种程度上有助于对抗过度拟合。此外，我们开始使用随机插值方法（双线性，面积，最近邻和立方，等概率）调整相对较晚的时间，并与其他超参数变化一起使用，因此我们无法确定最终结果是否受到正面影响他们的使用。

ILSVRC 2014 Classiﬁcation Challenge Setup and Results

ILSVRC 2014分类挑战设置和结果

The ILSVRC 2014 classiﬁcation challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classiﬁer predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the ﬁrst predicted class, and the top-5 error rate, which compares the ground truth against the ﬁrst 5 predicted classes: an image is deemed correctly classiﬁed if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes.

ILSVRC 2014分类挑战涉及将图像分类到Imagenet层次结构中1000个叶节点类别之一的任务。大约有120万张培训图片，50,000张图片用于验证，100,000张图片用于测试。每个图像都与一个地面真值类别相关联，并且性能是根据最高得分分类器预测来测量的。通常会报告两个数字：前1个准确率，将实际情况与第一个预测类别进行比较，前5个错误率，将实际情况与前5个预测类别进行比较：图像被认为是正确分类的如果基础真实在前五名之内，不管它们的排名如何。挑战使用排名前5的错误率。

We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we elaborate below.

我们参加了挑战赛，没有用于训练的外部数据。除了本文提到的训练技术之外，我们在测试过程中采用了一套技术来获得更高的性能，我们将在下面进行详细介绍。

1. We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, mainly because of an oversight) and learning rate policies, and they only differ in sampling methodologies and the random order in which they see input images.

1.我们独立训练了同一个GoogLeNet模型的7个版本（包括一个更宽的版本），并对它们进行了集合预测。这些模型经过相同的初始化（即使具有相同的初始权重，主要是因为疏忽）和学习率策略进行了培训，并且它们仅在采样方法和输入图像的随机顺序上有所不同。

2. During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Speciﬁcally, we resize the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center

image

crop as well as the

2.在测试过程中，我们采取了比Krizhevsky等人更加激进的裁剪方法。 [9]。具体而言，我们将图像调整为4个缩放比例，其中较短的尺寸（高度或宽度）分别为256,288,320和352，将这些调整大小的图像的左侧，中间和右侧平方（在纵向图像的情况下，顶部，中心和底部正方形）。对于每个广场，我们然后采取4个角落和中心

image

作物以及

6 Training Methodology

6培训方法

image

Table 2: Classiﬁcation performance

表2：分类性能

image

Table 3: GoogLeNet classiﬁcation performance break down

表3：GoogLeNet分类性能下降

square resized to

image

, and their mirrored versions. This results in

image

crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically veriﬁed to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the beneﬁt of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).

正方形调整到

image

和他们的镜像版本。这会导致

image

作物每张图像。安德鲁霍华德[8]在前一年的项目中使用了类似的方法，我们通过实证验证表现比拟议方案略差。我们注意到，在实际应用中，这种侵略性耕作可能不是必要的，因为在合理数量的作物出现后，更多作物的利益会变得微不足道（正如我们稍后会展示的）。

3. The softmax probabilities are averaged over multiple crops and over all the individual classiﬁers to obtain the ﬁnal prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classiﬁers, but they lead to inferior performance than the simple averaging.

3. softmax概率是对多个作物和所有单个分类器进行平均以获得最终预测结果。在我们的实验中，我们分析了验证数据的可选方法，例如作物上的最大汇集和对分类器的平均值，但它们导致的性能低于简单的平均值。

In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the ﬁnal submission.

在本文的其余部分，我们分析了对最终提交的整体表现有贡献的多个因素。

Our ﬁnal submission in the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the ﬁrst among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classiﬁers. The following table shows the statistics of some of the top-performing approaches.

我们最终提交的挑战在验证和测试数据上获得了6.67％的前5位错误，在其他参与者中排名第一。与2012年SuperVision方法相比，这相比减少了56.5％，与前一年的最佳方法（Clarifai）相比，相对减少了约40％，两者都使用外部数据来训练分类器。下表显示了一些高性能方法的统计数据。

We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in the following table. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overﬁt to the testing data statistics.

我们还通过在下表中预测图像时通过改变模型的数量和使用的作物数量来分析和报告多种测试选择的性能。当我们使用一个模型时，我们选择了验证数据中具有最低前1个错误率的模型。所有数据都会在验证数据集上报告，以避免过度使用测试数据统计。

ILSVRC 2014 Detection Challenge Setup and Results

ILSVRC 2014检测挑战设置和结果

The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classiﬁcation task, each image may contain

ILSVRC检测任务是在200个可能的类中为图像中的物体生成边界框。如果检测到的对象与groundtruth的类相匹配，并且它们的边界框重叠至少50％（使用Jaccard索引），则检测到的对象计数为正确。无关的检测被视为误报，并受到惩罚。与分类任务相反，每个图像可能包含

image

Table 4: Detection performance

表4：检测性能

image

Table 5: Single model performance for detection

表5：用于检测的单模型性能

many objects or none, and their scale may vary from large to tiny. Results are reported using the mean average precision (mAP).

许多物体或没有物体，它们的比例可能从大到小变化。结果使用平均平均精确度（mAP）报告。

The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classiﬁer. Additionally, the region proposal step is improved by combining the Selective Search [20] approach with multi-box [5] predictions for higher object bounding box recall. In order to cut down the number of false positives, the superpixel size was increased by

image

. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the proposals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 ConvNets when classifying each region which improves results from 40% to 43.9% accuracy. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.

GoogLeNet采用的方法与[6]中的R-CNN类似，但是用Inception模型作为区域分类器进行了增强。此外，通过将选择性搜索[20]方法与多盒[5]预测相结合，可以改进区域提议步骤，以实现更高的对象边界框调用。为了减少误报数量，

image

增加了超像素尺寸。这减少了来自选择性搜索算法的提议。我们添加了来自多框[5]的200个区域提案，总共导致了[6]使用的提案的大约60％，同时将覆盖率从92％提高到了93％。削减覆盖率提高的提案数量的总体影响是单个模型案例的平均精确度提高1％。最后，在对每个区域进行分类时，我们使用6个ConvNets集合，从而将结果从40％提高到43.9％的准确度。请注意，与R-CNN相反，由于时间不够，我们没有使用边界框回归。

We ﬁrst report the top detection results and show the progress since the ﬁrst edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use Convolutional Networks. We report the ofﬁcial scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classiﬁcation data for pre-training a model that is later reﬁned on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classiﬁcation is used for pre-training. The GoogLeNet entry did not use the localization data for pretraining.

我们首先报告检测结果并显示检测任务第一版以来的进展情况。与2013年的结果相比，准确率几乎翻了一番。表现最佳的团队都使用卷积网络。我们报告表4中的官方分数和每个团队的常用策略：使用外部数据，集成模型或上下文模型。外部数据通常是ILSVRC12分类数据，用于预先培训一个模型，该模型稍后将在检测数据上进行定义。一些团队还提到了本地化数据的使用。由于本地化任务边界框的很大一部分不包含在检测数据集中，因此可以使用这种数据预训练一个通用边界框回归器，分类方法同样用于预训练。GoogLeNet条目未将本地化数据用于预训练。

In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains signiﬁcantly stronger results with the ensemble.

在表5中，我们仅使用单一模型比较结果。表现最佳的模型是Deep Insight，令人惊讶的是只有3个模型的合奏提高了0.3分，而GoogLeNet在合奏中获得了显着更强的结果。

Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a signiﬁcant quality gain at a modest increase of computational requirements compared to shallower and less wide networks. Also note that our detection work was competitive despite of neither utilizing context nor performing bounding box

我们的研究结果似乎产生了一个可靠的证据，即通过容易获得的密集构建块来近似预期的最优稀疏结构是改进用于计算机视觉的神经网络的可行方法。这种方法的主要优点是与较浅和较宽的网络相比，在适度增加计算需求的情况下获得显着的质量增益。另外请注意，尽管我们的检测工作既没有利用上下文，也没有执行边界框，但它们仍具有竞争力

9 Conclusions

9结论

regression and this fact provides further evidence of the strength of the Inception architecture. Although it is expected that similar quality of result can be achieved by much more expensive networks of similar depth and width, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest promising future work towards creating sparser and more reﬁned structures in automated ways on the basis of [2].

回归和这个事实进一步证明了Inception架构的优势。尽管预计类似的结果质量可以通过更加昂贵的类似深度和宽度的网络来实现，但我们的方法提供了可靠的证据，表明向更稀疏的架构转变是一种可行和有用的想法。这表明在[2]的基础上有希望的未来工作是以自动化的方式创建更稀疏和更精确的结构。

We would like to thank Sanjeev Arora and Aditya Bhaskara for fruitful discussions on [2]. Also we are indebted to the DistBelief [4] team for their support especially to Rajat Monga, Jon Shlens, Alex Krizhevsky, Jeff Dean, Ilya Sutskever and Andrea Frome. We would also like to thank to Tom Duerig and Ning Ye for their help on photometric distortions. Also our work would not have been possible without the support of Chuck Rosenberg and Hartwig Adam.

我们要感谢Sanjeev Arora和Aditya Bhaskara [2]的富有成效的讨论。我们也感谢DistBelief [4]团队的支持，特别是Rajat Monga，Jon Shlens，Alex Krizhevsky，Jeff Dean，Ilya Sutskever和Andrea Frome。我们还要感谢汤姆杜里格和宁冶对光度失真的帮助。如果没有Chuck Rosenberg和Hartwig Adam的支持，我们的工作也不可能实现。

[1] Know your meme: We need to go deeper. http://knowyourmeme.com/memes/ we-need-to-go-deeper. Accessed: 2014-09-15.

[1]知道你的模因：我们需要更深入。 http://knowyourmeme.com/memes/我们需要进一步深入。访问时间：2014-09-15。

[2] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. CoRR, abs/1310.6343, 2013.

[2] Sanjeev Arora，Aditya Bhaskara，荣戈和马腾宇。学习一些深刻的表示的可证明的界限。 CoRR，abs / 1310.6343,2013。

¨Umit V. C¸ ataly¨urek, Cevdet Aykanat, and Bora Uc¸ar. On two-dimensional sparse matrix par[3] titioning: Models, methods, and a recipe. SIAM J. Sci. Comput., 32(2):656–683, February 2010.

¨拒绝V.C�ataly¨urek，Cevdet Aykanat和Bora Ucar。在二维稀疏矩阵参数[3]上：模型，方法和配方。 SIAM J. Sci。 Comput。，32（2）：656-683，2010年2月。

[4] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributed deep networks. In P. Bartlett, F.c.n. Pereira, C.j.c. Burges, L. Bottou, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1232–1240. 2012.

[4] Jeffrey Dean，Greg Corrado，Rajat Monga，Kai Chen，Matthieu Devin，Mark Mao，Marc'aurelio Ranzato，Andrew Senior，Paul Tucker，Ke Yang，Quoc V. Le和Andrew Y. Ng。大规模分布式深度网络。 P. Bartlett，F.c.n. Pereira，C.j.c. Burges，L.Bottou和K.q. Weinberger，编者，神经信息处理系统进展25，第1232-1240页。 2012。

[5] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014.

[5] Dumitru Erhan，Christian Szegedy，Alexander Toshev和Dragomir Anguelov。使用深度神经网络的可伸缩对象检测在计算机视觉和模式识别，2014年。CVPR 2014. IEEE会议，2014年。

[6] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014.

[6] Ross B. Girshick，Jeff Donahue，Trevor Darrell和Jitendra Malik。丰富的功能层次结构，用于精确的对象检测和语义分割。在计算机视觉和模式识别，2014年。CVPR 2014. IEEE会议，2014年。

[7] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.

[7] Geoffrey E. Hinton，Nitish Srivastava，Alex Krizhevsky，Ilya Sutskever和Ruslan Salakhutdinov。通过防止特征检测器的共同适应来改进神经网络。 CoRR，abs / 1207.0580，2012。

[8] Andrew G. Howard. Some improvements on deep convolutional neural network based image

[8]安德鲁G.霍华德。基于深度卷积神经网络图像的一些改进

[9] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.

[9] Alex Krizhevsky，Ilya Sutskever和Geoff Hinton。Imagenet分类与深卷积神经网络。In Advances in Neural Information Processing Systems 25，第1106-1114页，2012。

[10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, December 1989.

[10] Y. LeCun，B. Boser，J. S. Denker，D. Henderson，R. E. Howard，W. Hubbard和L. D. Jackel。反向传播适用于手写邮政编码识别。 Neural Comput。，1（4）：541-551，1989年12月。

[11] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[11] Yann LeCun，L'eon Bottou，Yoshua Bengio和Patrick Haffner。基于渐变的学习应用于文档识别。 Proceedings of the IEEE，86（11）：2278-2324，1998。

[12] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.

[13] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM

[13] B.P.Polyak和A.B. Juditsky。通过平均加速随机逼近。暹

[14] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨el Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.

[14] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨el Mathieu, Rob Fergus, and Yann LeCun.Overfeat：使用卷积网络的综合识别，定位和检测。 CoRR，abs / 1312.6229,2013。

10 Acknowledgements

10致谢

References

参考

classiﬁcation. CoRR, abs/1312.5402, 2013.

CLASSI科幻阳离子。 CoRR，abs / 1312.5402,2013。

J. Control Optim., 30(4):838–855, July 1992.

J. Control Optim。，30（4）：838-855，1992年7月。

[15] Thomas Serre, Lior Wolf, Stanley M. Bileschi, Maximilian Riesenhuber, and Tomaso Poggio. Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):411–426, 2007.

Thomas Serre，Lior Wolf，Stanley M. Bileschi，Maximilian Riesenhuber和Tomaso Poggio。具有皮质样机制的健壮物体识别。 IEEE Trans。模式分析。马赫。 Intell。，29（3）：411-426,2007。

[16] Fengguang Song and Jack Dongarra. Scaling up matrix computations on shared-memory manycore systems with 1000 cpu cores. In Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, pages 333–342, New York, NY, USA, 2014. ACM.

[16] Fengguang宋和杰克Dongarra。在具有1000个cpu核的共享内存manycore系统上扩展矩阵计算。在第28届ACM国际超级计算会议论文集中，ICS'14，第333-342页，纽约，纽约，美国，2014年。ACM。

[17] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Proceedings, pages 1139–1147. JMLR.org, 2013.

[17] Ilya Sutskever，James Martens，George E. Dahl和Geoffrey E. Hinton。关于初始化和深入学习势头的重要性。在第30届国际机器学习会议论文集中，ICML 2013，美国佐治亚州亚特兰大，2013年6月16-21日，JMLR会议录28卷，第1139-1147页。 JMLR.org，2013年。

[18] Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. In Christopher J. C. Burges, L´eon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2553–2561, 2013.

[18]基督教Szegedy，亚历山大Toshev和杜米特鲁尔汉。用于物体检测的深度神经网络。在Christopher J. C. Burges，L'eon Bottou，Zoubin Ghahramani和Kilian Q.Weinberger，编辑，神经信息处理系统进展26：第27届神经信息处理系统年度会议2013。2013年12月5日至8日举行的会议记录，美国内华达州塔霍湖，2013年第2553-2561页。

[19] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural

[19]亚历山大Toshev和基督教Szegedy。 Deeppose：通过深度神经元进行人体姿态估计

[20] Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, and Arnold W. M. Smeulders. Segmentation as selective search for object recognition. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 1879–1886, Washington, DC, USA,

[20] Koen E. A. van de Sande，Jasper R. R. Uijlings，Theo Gevers和Arnold W. M. Smeulders。分割作为对象识别的选择性搜索。在Proceedings of the 2011 International Conference on Computer Vision，ICCV'11，pages 1879-1886，Washington，DC，USA中，

2011. IEEE Computer Society.

IEEE计算机协会。

[21] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In David J. Fleet, Tom´as Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, volume 8689 of Lecture Notes in Computer Science, pages 818–833. Springer, 2014.

[21] Matthew D. Zeiler和Rob Fergus。可视化和理解卷积网络。在David J. Fleet，Tom'as Pajdla，Bernt Schiele和Tinne Tuytelaars的编辑中，计算机视觉 - ECCV 2014 - 第13届欧洲会议，瑞士苏黎世，2014年9月6日至12日，会议记录，第I部分，第8689号讲座计算机科学中的注释，第818-833页。施普林格，2014年。

networks. CoRR, abs/1312.4659, 2013.

网络。 CoRR，abs / 1312.4659,2013。

文章引用于 http://tongtianta.site/paper/237
编辑 Lornatang
校准 Lornatang

Going Deeper With Convolutions翻译[下]

推荐阅读更多精彩内容