yolov3论文:Yolov3: An Incremental Improvement

yolov3代码：1.pytorch-yolov3

YOLO V3网络结构图

1.YOLO V3结构：

(1).CBL: Conv+BN+Leaky relu

(2).Res unit:2个CBL组成的残差结构

(3).ResX:一个CBL+x个Res unit

1.1.Backbone:

Darknet53=CBL+Res1+Res2+Res8+Res8+Res4=1*Conv+(1+2*1)*Conv+(1+2*2)*Conv+(1+2*8)*Conv+(1+2*8)*Conv+(1+2*4)*Conv=52*Conv+1*Fc

输入经过5次降采样，输出大小为输入的 $\frac{1}{2^{5}}$ ,每个ResX的第一个CBL通过执行stride=2的卷积操作进行降采样

1.2.Neck:

Neck部分采用FPN结构：

YOLOV3中的Neck采用FPN结构

FPN是一种Top-To-Down的结构：

FPN的Top-To-Down结构

FPN结构自顶向下采用上采样的方式将高层语义信息与低层位置信息进行融合，利用三个尺度的特征进行多尺度预测，注意这里的特征融合采用的是ADD形式

Neck部分和Prediction部分的代码

2.损失函数

yolov1中的损失函数：

where $1_{i}^{obj}$ denotes if object appears in cell i and $1_{ij}^{obj}$ denotes that the jth bounding box predictor in cell i is “responsible” for that prediction.Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

yolov3中的损失函数:

图源：https://www.jianshu.com/p/86b8208f634f

yolov3的损失函数与yolov1损失函数的不同之处：

(1).yolov3每个bbx都会有一个条件类别概率,而yolov1每个cell处的N个bbx共用一个条件类别概率，因为yolov1的设定是每个cell只预测一个object

(2).yolov3中去掉了yolov1中的 $\lambda _{coord}和\lambda _{noobj}$

(3).yolov3里面多了 $G_{ij}$ ,用于处理与gt的IOU超过阈值但不是best的bbx,对于这些bbx，我们不去优化这些bbx，也就是说这些bbx的所有loss都为0，此时 $G_{ij}=0$ ;而其他所有情况 $G_{ij}=1$

(4).object confidence 和object classifier具体实现的时候采用交叉熵损失

3.NMS

yolov3-NMS

4.yolov3工作原理

4.1 yolov3预测

Bounding boxes with dimension priors and locationprediction. We predict the width and height of the box as offsetsfrom cluster centroids. We predict the center coordinates of thebox relative to the location of filter application using a sigmoidfunction. This figure blatantly self-plagiarized from[15]

网络输出四个量用来预测bbx位置及大小： $t_x,t_y和t_w,t_h$ ，其中： $t_x,t_y$ 是bbx相比于cell左上角的偏移量，用来预测bbx的左上角顶点 $b_x和b_y$ ，经过sigmoid防止 $t_x,t_y$ 偏出当前cell(每个cell的边长是“1”)； $b_w和b_h$ 是bbx对输入图像进行归一化后的宽和高， $t_w和t_h$ 用来预测bbx的长和宽； $p_w和p_h$ 是anchor box的宽和高

4.2 哪些bbx参与目标预测：

只有与gt的IOU最大的先验框参与预测 $t_x,t_y,t_w,t_h,object-confidence,object-class$ ；

对于与gt的IOU超过阈值的非最优anchor box，我们将其所有loss都置为0,也就是忽略这些anchor box；

对于与gt的IOU低于阈值的话，其loss只有 $object-confidence$ ，并且label为0；

采用聚类方法得到的三个尺度的先验框：(10×13),(16×30),(33×23) (30×61),(62×45),(59×119) (116 × 90),(156 × 198),(373 × 326).

This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prioris not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of .5. our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness

4.3 目标类别预测

利用logistic classifier代替softmax，这样就可以处理多标签的分类问题，例如在Open Images Dataset里面object会有多个标签(如：woman and person），softmax的前提假设是每个object只属于一个类别，显然这种假设在Open Images Dataset上是不合理的。

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions

4.4 引入FPN进行多尺度目标检测

深层次的特征由于感受野大，适合检测大目标；而稍微浅层次的特征适合检测中等目标或者小目标；网络的最终输出是一个3d tensor，对于coco数据集来说，取值为： $N × N × [3 ∗ (4 + 1 + 80)]$ ,其中N是把输入图像切分的cell个数，3代表每个cell预测三个bbx，4代表 $t_x,t_y,t_w,t_h$ ,1代表object confidence,80代表coco的80个类别；对于voc数据集来说，取值为： $N × N × [3 ∗ (4 + 1 + 20)]$

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [3 ∗ (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions

4.5 输出结果结构

图源：https://www.cnblogs.com/chenhuabin/p/13908615.html

5.注意点

5.1.NMS是针对每一类分别做NMS，对于多标签目标，有可能存在对于一个标签类别BBOX被抑制，而对于另外一个标签类别该BBOX被保留，导致一个目标处有两个类别的BBOX-------理论上有可能，实际中尚未发现

5.2.训练过程中究竟优化哪一个anchor box：选择目~~标中心位于对应的grid cell且~~anchor box与目标的IOU最大的anchor预测该目标，其他的anchor box的object confidence都置为0

5.3.对于 $t_x,t_y,t_w,t_h$ 的预测，yolov3在论文中说是采用均方误差进行训练，但实际对于 $t_x,t_y$ 是采用的交叉熵损失函数进行训练-----待验证,需要看官方源码的反向传播部分

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is tˆ* our gradient is the ground truth value (computed from the ground truth box) minus our prediction: tˆ* − t*. This ground truth value can be easily computed by inverting the equations above

keras复现版本中的loss: