图像识别现代史 Modern History of Object Recognition

Modern History of Object Recognition

最近看到一篇很好的介绍图像识别技术发展路线图的文章《Modern History of Object Recognition》，其中描绘了从2012年到2017年图像识别领域的算法发展脉络，值得做一下笔记，也是自己在图像识别这块知识框架的梳理。

1. Topics

图像分类：
Classify an image based on the dominant object inside it
根据图像中的主要目标对图像进行分类
目标定位：
Predict the image region that contains the dominant object. Then image classification can be used to recognize object in the region
预测包含主要目标的图像区域，然后利用图像分类来识别区域内的目标
目标识别
Localize and classify all objects appearing in the image. This task typically includes: proposing regions then classify the object inside them
对图像中出现的所有目标进行定位以及分类，这项任务通常包括:产生候选区域，然后对区域内的对象进行分类
语义分割
Label each pixel of an image by the object class that it belongs to, such as human, sheep, and grass in the example
根据图像中目标所属的类别(如示例中的人、羊和草)对图像进行像素级别标记
单例分割
Label each pixel of an image by the object class and object instance that it belongs to
根据图像中目标所属的类别和实例对图像进行像素级别标记
关键点检测
Detect locations of a set of predefined keypoints of an object, such as keypoints in a human body, or a human face
将图像中目标一组预定义的关键点位置检测出来，如人体或人脸中的关键点

2. 算法发展路线图

roadmap

3. 目标识别的重要概念 Important Object Recognition Concepts

3.1 Bounding box proposal（候选边界框）

下面概念具有同样的意义：

感兴趣区域（region of interest）
候选区域（region proposal）
候选框（box proposal）

A rectangular region of the input image that potentially contains an object inside. These proposals can be generated by some heuristics search: objectness, selective search, or by a region proposal network (RPN).

译：候选边界框是指在输入图像中可能包含目标的矩形区域。这些候选区域可以由一些启发式搜索算法产生，比如:目标搜索、选择性搜索或区域建议网络(RPN)。

A bounding box can be represented as a 4-element vector, either storing its two corner coordinates (x0, y0,x1, y1), or (more common) storing its center location and its width and height (x, y, w, h). A bounding box is usually accompanied by a confidence score of how likely the box contains an object.

译：一个边界框可以由包括4个元素的vector来表示，这个vector可以存储边界框的对角点的坐标(x0, y0,x1, y1)（理解为即top_left点和bottom_right点），或者更常见一些，存储框的中心点坐标和它的长框数据 (x, y, w, h)。一个边界框通常会伴随一个置信度分数，来表示其中包含目标的可能性大小。

The difference between two bounding boxes is usually measured by the L2 distance of their vector representations. w and h can be log-transformed before the distance calculation.

译：两个边界框之间的差异通常由表示它们的vector之间的L2距离（理解为欧氏距离）来衡量。宽高数据可以在距离计算之前进行对数变换。

3.2 Intersection over Union（IOU：交并比）

A metric that measures the similarity between two bounding boxes = their overlapping area over their union area.

译：IOU是衡量两个边界框相似度的指标，其值等于两个边界框的重叠区域除以它们的合并区域，所以也称作交并比。

交并比

3.3 Non Maxium Suppression （非极大值抑制）

A common algorithm to merge overlapping bounding boxes (proposals or detections). Any bounding box that significantly overlaps (IoU > IoU_threshold) with a higher-confident bounding box is suppressed (removed).

译：NMS是一种融合重叠边界框（候选框或者检测框）的通用算法。思路是：任何与具有高置信度边界框明显重叠（明显重叠的意思是两框的IOU大于设定的IOU阈值）的候选框将被抑制（理解为删除）。看下图理解：

非极大值抑制

译：图中有两个框，分别具有置信度0.9和0.6，假设两框的IOU大于给定的阈值0.5，那么置信度为0.6的边界框被忽略。

3.4 Bounding box regression（Bounding box refinement）

边界框回归/微调

By looking at an input region, we can infer the bounding box that better fit the object inside, even if the object is only partly visible.

译：通过查看图像区域，我们能够推断出更适合图中目标的边界框，即使只能看到目标的一部分。

The example on the right illustrates the possibility of inferring the ground truth box only by looking at part of an object. Therefore, one regressor can be trained to look at an input region and predict the offset Δ(x, y, w, h) between the input region box and the ground truth box.

译：下图中的例子说明是是有可能只通过对目标部分区域的检测检测出GT框的。因此，我们可以训练一个回归模型，它能够检测输入框，从而预测出输入框和GT框之间的偏移量Δ(x, y, w, h)。

边界框回归

If we have one regressor for each object class, it is called class-specific regression, otherwise, it is called class-agnostic (one regressor for all classes). A bounding box regressor is often accompanied by a bounding box classifier (confidence scorer) to estimate the confidence of object existence in the box. The classifier can also be class-specific or class-agnostic. Without defining prior boxes, the input region box plays the role of a prior box.

译：如果针对每一个目标类型，我们都有一个回归器，这种场景称为类型专用回归，否则，称为class-agnostic（类型无感的），即所有不同类型的目标使用同一个回归器。边界框回归器通常会与边界框分类器（置信度评分器）配合使用，来估算框中存在目标的可能性。分类器既可以是class-specific 也可以是 class-agnostic。当没有定义先验框时，输入框扮演先验框的角色。
class-specific/class-agnostic的意思下面一段叙述帮助理解，总体来说就是类型专用的/类型通用的意思：
For a class-aware detector, if you feed it an image, it will return a set of bounding boxes, each box associated with the class of the object inside (i.e. dog, cat, car). It means that by the time the detector finished detecting, it knows what type of object was detected.For class-agnostic detector, it detects a bunch of objects without knowing what class they belong to. To put it simply, they only detect “foreground” objects. Foreground is a broad term, but usually it is a set that contains all specific classes we want to find in an image, i.e. foreground = {cat, dog, car, airplane, …}. Since it doesn’t know the class of the object it detected, we call it class-agnostic.Class-agnostic detectors are often used as a pre-processor: to produce a bunch of interesting bounding boxes that have a high chance of containing cat, dog, car, etc. Obviously, we need a specialized classifier after a class-agnostic detector to actually know what class each bounding box contains.

3.5 Prior box (default box, anchor box)

先验框、默认框、锚框

Instead of using the input region as the only prior box, we can train multiple bounding box regressors, each look at the same input region but has a different prior box and learns to predict the offset between its own prior box and the ground truth box. This way, regressors with different prior boxes can learn to predict bounding boxes with different properties (aspect ratio, scale, locations). Prior boxes can be predefined relatively to the input region, or learned by clustering. An appropriate box matching strategy is crucial to make the training converge.

译：除了用输入框作为唯一的先验框以外，我们还可以训练出拥有不同先验框的边界框回归模型，它们检测同一份输入图像区域，学习如何预测自身先验框和GT框之间的偏移量。这样，拥有不同先验框的回归器就能够学会预测具有不同属性（纵横比、比例、位置）的边界框。先验框可以相对于输入区域进行预定义，也可以通过聚类来学习。合适的框匹配策略是实现训练收敛的关键。

3.6 Box Matching Strategy（框匹配策略）

We cannot expect a bounding box regressor to be able to predict a bounding box of an object that is too far away from its input region or its prior box (more common). Therefore, we need a box matching strategy to decide which prior box is matched with a ground truth box. Each match is a training example for regressing. Possible strategies: (Multibox) matching each ground truth box with one prior box with highest IoU; (SSD, FasterRCNN) matching a prior box with any ground truth with IoU higher than 0.5.

译：我们不能期望一个边界框回归器能够预测一个离输入区域或者它的先验框太远的目标边界框。因此，我们需要一个框匹配策略来决定哪个先验框与GT框匹配。每个匹配都是一个回归训练实例。其中可能的策略有:(Multibox)将每个GT框与一个IOU最高的先验框进行匹配;(SSD, FasterRCNN)匹配任何与GT框的IOU高于0.5阈值的先验框。

3.7 Hard negative example mining （难分样本挖掘）

For each prior box, there is a bounding box classifier that estimates the likelihood of having an object inside. After box matching, all matched prior boxes are positive examples for the classifier. All other prior boxes are negatives. If we used all of these hard negative examples, there would be a significant imbalance between the positives and negatives. Possible solutions: pick randomly negative examples (FasterRCNN), or pick the ones that the classifier makes the most serious error (SSD), so that the ratio between the negatives and positives is at roughly 3:1.

译：对于每个先验框，都有一个边界框分类器，用于评估其中包含目标的可能性。在框匹配之后，所有匹配的先验框都是分类器的正样本，其他的先验框都是负样本。如果我们使用所有这些负样本，那么正负样本之间就会有很大的不平衡。可能的解决方案:随机选择负样本(FasterRCNN)，或者选择使分类器产生最严重错误的样本(SSD)，这样正负比例大约是3:1。

Important CNN Concepts （CNN的重要概念）

Feature（特征）

A hidden neuron that is activated when a particular pattern (feature) is presented in its input region (receptive field).The pattern that a neuron is detecting can be visualized by (1) optimizing its input region to maximize the neuron’s activation (deep dream), (2) visualizing the gradient or guided gradient of the neuron activation on its input pixels (back propagation and guided back propagation), (3) visualizing a set of image regions in the training dataset that activate the neuron the most.

译：当一种特定的模式（或特征）出现在一个隐形神经元的输入区域（感受野）时该神经元才会被激活。神经元检测的模式可以通过以下几种方式可视化(1)优化其输入区域使神经元激活最大化,(2)在激活神经元的输入像素上可视化其梯度或引导梯度(反向传播和引导反向传播),(3)在训练数据集可视化一组激活神经元最多的图像区域。

Receptive Field（感受野）

The region of the input image that affects the activation of a feature. In other words, it is the region that the feature is looking at. Generally, a feature in a higher layer has a bigger receptive field, which allows it to learn to capture a more complex/abstract pattern. The ConvNet architecture determines how the receptive field change layer by layer.

译：影响特征激活的输入图像的区域。换句话说，这就是特征所关注的区域。一般来说，较高层的特性具有更大的接受域，这允许它学习捕获更复杂/抽象的模式。卷积神经网络体系结构决定了感受野如何逐层变化。

Feature Map（特征图）

A set of features that created by applying the same feature detector at different locations of an input map in a sliding window fashion (i.e. convolution). Features in the same feature map have the same receptive size and look for the same pattern but at different locations. This creates the spatial invariance properties of a ConvNet.

译：通过在输入图的不同位置以滑动窗口的方式执行相同的特征检测生成的一组特征（即卷积）。在同一个特征图中的特征拥有相同的感受野大小，在不同的位置检测相同的模式。这使得卷积神经网络产生了空间不变性。

Feature Volume

A set of feature maps, each map searches for a particular feature at a fixed set of locations on the input map. All features have the same receptive field size.

译：指一组特征图，每个特征图在输入图的一组固定位置上搜索特定的特征。所有特征具有相同的感受野大小。

Fully connected layer as Feature Volume

全连接层作为特征量

Fully connected layers (fc layers - usually attached to the end of a ConvNet for classification) with k hidden nodes can be seen as a 1x1xk feature volume. This feature volume has one feature in each feature map, and its receptive field covers the whole image.

译：具有k个隐藏节点的全连接层(fc层——通常附加到卷积神经网络的末端执行分类)可以看作是1x1xk的特征量。该特征量在每个特征图中都有一个特征，其感受野覆盖整个图像。

The weight matrix W in an fc layer can be converted to a CNN kernel. Convolving a kernel wxhxk to a CNN feature volume wxhxd creates a 1x1xk feature volume (=FC layer with k nodes). Convolving a 1x1xk filter kernel to a 1x1xd feature volume creates a 1x1xk feature volume. Replacing fully connected layers by convolution layers allows us to apply a ConvNet to an image with arbitrary size.

译：在全连接层中的权重矩阵W可以转换为CNN核函数。将wxhxk的核函数与wxhxd的CNN特征量进行卷积会创建出一个1x1xk大小的特征量(等同于具有k个节点的全连接层)。将1x1xk的过滤核函数与1x1xd的特性量进行卷积将创建1x1xk大小的特性量。用卷积层替换全连接层允许我们对任意大小的图像应用卷积网络。

Transposed Convolution（反卷积）

The operation that back-propagates the gradient of a convolution operation. In other words, it is the backward pass of a convolution layer. A transposed convolution can be implemented as a normal convolution with zero inserted between the input features. A convolution with filter size k, stride s and zero padding p has an associated transposed convolution with filter size k’=k, stride s’=1, zero padding p’=k-p-1, and s-1 zeros inserted between each input unit.

译：反卷积运算只是对卷积运算的梯度进行反向传播。换句话说，它是卷积层的逆推法。转置卷积可以实现为在输入特征之间插入零的普通卷积。一个滤波器大小为k、stride s和零填充p的卷积拥有一个关联转置卷积，其滤波器大小为k’=k，stride s’=1，零填充p’=k-p-1，并且在每个输入单元之间填充s-1个0。

End-To-End object recognition pipeline

An object recognition pipeline that all stages (pre-processing, region proposal generation, proposal classification, post-processing) can be trained altogether by optimizing a single objective function, which is a differentiable function of all stages’ variables. This end-to-end pipeline is the opposite of the traditional object recognition pipeline, which connects stages in a non-differentiable fashion. In these systems, we do not know how changing a stage’s variable can affect the overall performance, so that each stage must be trained independently or alternately, or heuristically programmed.

译：目标识别链路可以通过优化单个目标函数(各阶段变量的可微函数)，对所有阶段(预处理、候选区域生成、候选分类、后处理)进行同时训练。这种端到端链路与传统的对象识别链路相反，后者以不可微分的方式连接各个阶段。在这些系统中，我们不知道改变一个stage的变量会如何影响整体性能，因此每个阶段都必须独立地或交替地进行训练，或采用启发式编程。