YOLOv3代码分析(Keras+Tensorflow)

前面（YOLO v3深入理解）讨论过论文和方案之后，现在看一下代码实现。YOLO原作者是C程序，这里选择的是Kears+Tensorflow版本，代码来自experiencor的git项目keras-yolo3，我补充了一些注释，项目在keras-yolo3 + 注释，如有错漏请指正。

图1 检测Raccoon

下面讲一下训练样本的设置和loss的计算。

图2 输入->输出

训练样本设置

参考上面图2，对于一个输入图像，比如416*416*3，相应的会输出 13*13*3 + 26*26*3 + 52*52*3 = 10647 个预测框。我们希望这些预测框的信息能够尽量准确的反应出哪些位置存在对象，是哪种对象，其边框位置在哪里。

在设置标签y（10647个预测框 * (4+1+类别数) 张量）的时候，YOLO的设计思路是，对于输入图像中的每个对象，该对象实际边框（groud truth）的中心落在哪个网格，就由该网格负责预测该对象。不过，由于设计了3种不同大小的尺度，每个网格又有3个先验框，所以对于一个对象中心点，可以对应9个先验框。但最终只选择与实际边框IOU最大的那个先验框负责预测该对象（该先验框的置信度=1），所有其它先验框都不负责预测该对象（置信度=0）。同时，该先验框所在的输出向量中，边框位置设置为对象实际边框，以及该对象类型设置为1。

loss计算

loss主要有3个部分，置信度、边框位置、对象类型。

首先需要注意的还是置信度的问题。上面说到对于一个实际对象，除了与它IOU最大的那个先验框其置信度设为1，其它先验框的置信度都是0。但是，还有一些先验框与该对象实际边框位置是比较接近的，它们依然有可能检测到该对象，只不过并非最接近实际边框。所以，这部分边框的目标置信度不应该期望输出0。但YOLO也不希望它们输出1。所以，在计算loss的时候，就设置一个IOU阈值，超过阈值的（接近目标边框但又不是IOU最大的）那些边框不计入loss。低于阈值的那些边框就要求置信度为0，也就是检测到背景。

同时，对于检测到对象的边框，要计算其边框位置的loss，以及对象类型的loss。对于那些检测到背景的边框，就只计算其置信度loss了，它的边框位置和对象类型都没有意义。

另外注意一下的是边框位置计算要根据论文的设计做一些变换，参考下面图2。

图2 边框预测

网络结构

详细的YOLOv3网络比较深，有兴趣的同学可以看一下Keras打印的网络结构图。（图有点大）。

代码

总体来说YOLO的设计并不复杂，不过用python实现的话有不少张量计算，一些实现细节请参考项目代码和注释。

训练样本设置参考 generator.py 中 class BatchGenerator。
loss计算参考 yolo.py 的 call(self, x)。
网络结构是 yolo.py 的 create_yolov3_model()。

另外该项目的YOLO网络的训练和测试，请根据项目说明进行。

下面仅摘录loss计算部分代码。


    """
    一个神经网络层的计算，实际上在计算loss。
    YOLOv3输出3个尺度的特征图，这里是对1个尺度的特征图计算loss。
    input
        x = input_image, y_pred, y_true, true_boxes
        分别是：输入图像，YOLO输出的tensor，标签y（期望其输出的tensor），输入图像中所有ground truth box。
    return
        loss = 边框位置xy loss + 边框位置wh loss + 边框置信度loss + 对象分类loss
    """
    def call(self, x):
        # true_boxes 对应 BatchGenerator 里面的 t_batch，shape=(batch,1,1,1,一个图像中最多几个对象,4个坐标)
        # y_true 对应 BatchGenerator 里面的 yolo_1/yolo_2/yolo_3，即一个特征图tensor
        input_image, y_pred, y_true, true_boxes = x

        # adjust the shape of the y_predict [batch, grid_h, grid_w, 3, 4+1+nb_class]
        # shape=(batch, 特征图高，特征图宽，3个anchor，4个边框坐标+1个置信度+检测对象类别数)
        y_pred = tf.reshape(y_pred, tf.concat([tf.shape(y_pred)[:3], tf.constant([3, -1])], axis=0))
        
        # initialize the masks
        # object_mask 是一个特征图上所有预测框的置信度（objectness），这里来自标签y_true，除了负责检测对象的那些anchor，其它置信度都是0。
        # shape = (batch, 特征图高，特征图宽，3个anchor，1个置信度)
        # y_true[..., 4]提取边框置信度（最后一维tensor中，前4个是边框坐标，第5个就是置信度），expand_dims将其恢复到原来的tensor形状。
        object_mask     = tf.expand_dims(y_true[..., 4], 4)

        # the variable to keep track of number of batches processed
        batch_seen = tf.Variable(0.)        

        # compute grid factor and net factor
        # 特征图的宽高
        grid_h      = tf.shape(y_true)[1]
        grid_w      = tf.shape(y_true)[2]
        grid_factor = tf.reshape(tf.cast([grid_w, grid_h], tf.float32), [1,1,1,1,2])

        # 输入图像的宽高
        net_h       = tf.shape(input_image)[1]
        net_w       = tf.shape(input_image)[2]            
        net_factor  = tf.reshape(tf.cast([net_w, net_h], tf.float32), [1,1,1,1,2])
        
        """
        Adjust prediction
        """
        # pred_box_xy 是预测框在特征图上的中心点坐标，特征图网格大小归一化为1*1，=(sigma(t_xy) + c_xy)
        pred_box_xy    = (self.cell_grid[:,:grid_h,:grid_w,:,:] + tf.sigmoid(y_pred[..., :2]))  # shape=(batch,特征图高,特征图宽,3预测框,2坐标)
        # pred_box_wh 是预测对象的t_w, t_h。注：truth_wh = anchor_wh * exp(t_wh)
        pred_box_wh    = y_pred[..., 2:4]                                                       # shape=(batch,特征图高,特征图宽,3预测框,2坐标)
        pred_box_conf  = tf.expand_dims(tf.sigmoid(y_pred[..., 4]), 4)                          # shape=(batch,特征图高,特征图宽,3预测框,1confidence)
        pred_box_class = y_pred[..., 5:]                                                        # shape=(batch,特征图高,特征图宽,3预测框,c个对象)

        """
        Adjust ground truth
        """
        # true_box_xy 是实际边框在特征图上的中心点坐标，=(sigma(t_xy) + c_xy)，参见y_true
        true_box_xy    = y_true[..., 0:2]                  # shape=(batch,特征图高,特征图宽,3预测框,2坐标)
        # true_box_wh 是对象的t_w, t_h。注：truth_wh = anchor_wh * exp(t_wh)
        true_box_wh    = y_true[..., 2:4]                  # shape=(batch,特征图高,特征图宽,3预测框,2坐标)
        true_box_conf  = tf.expand_dims(y_true[..., 4], 4) # shape=(batch,特征图高,特征图宽,3预测框,1confidence)
        true_box_class = tf.argmax(y_true[..., 5:], -1)    # shape=(batch,特征图高,特征图宽,3预测框)

        """
        Compare each predicted box to all true boxes
        这一部分是为了计算出IOU低于阈值的那些预测框，也可以理解为找出那些检测到背景的预测框。
        一个特征图上有 宽*高*3anchor 个预测框，YOLO的策略是，一个对象其中心点所在gird的3个anchor，IOU最大的那个anchor负责预测（其confidence=1）该对象。
        但是附近还有一些IOU比较大的anchor，如果要求其confidence=0是不合理的，于是不计入loss也是合理的选择。剩下那些框里面就是背景了，其confidence=0。
        下面先计算出每个预测框对每个真实框的IOU（iou_scores），然后每个预测框选一个最大的IOU，低于阈值的框就认为是背景，将计算loss。
        """
        # initially, drag all objectness of all boxes to 0
        conf_delta  = pred_box_conf - 0 

        # then, ignore the boxes which have good overlap with some true box
        # true_xy,true_wh 的值是相当于将原始图像的宽高归一化为1*1
        true_xy = true_boxes[..., 0:2] / grid_factor  # shape=(batch,1,1,1,一个图像中最多几(3)个对象,2个xy坐标),xy是特征图上的坐标，与y_true中的xy一样
        true_wh = true_boxes[..., 2:4] / net_factor   # shape=(batch,1,1,1,一个图像中最多几(3)个对象,2个wh坐标),wh是原始图像上对象的宽和高
        true_wh_half = true_wh / 2.
        true_mins    = true_xy - true_wh_half
        true_maxes   = true_xy + true_wh_half
        
        pred_xy = tf.expand_dims(pred_box_xy / grid_factor, 4)                        # shape=(batch,特征图高,特征图宽,3预测框,1,2坐标)
        pred_wh = tf.expand_dims(tf.exp(pred_box_wh) * self.anchors / net_factor, 4)  # shape=(batch,特征图高,特征图宽,3预测框,1,2坐标)
        
        pred_wh_half = pred_wh / 2.
        pred_mins    = pred_xy - pred_wh_half
        pred_maxes   = pred_xy + pred_wh_half    

        intersect_mins  = tf.maximum(pred_mins,  true_mins)  # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象, 2个坐标)
        intersect_maxes = tf.minimum(pred_maxes, true_maxes) # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象, 2个坐标)

        intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)  # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象, 2个坐标)
        intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]       # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象)
        
        true_areas = true_wh[..., 0] * true_wh[..., 1]  # shape=(batch,1,       1,       1,      一个图像中最多几(3)个对象)
        pred_areas = pred_wh[..., 0] * pred_wh[..., 1]  # shape=(batch,特征图高,特征图宽,3预测框,1)

        union_areas = pred_areas + true_areas - intersect_areas  # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象)
        iou_scores  = tf.truediv(intersect_areas, union_areas)   # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象)

        # 每个预测框与最接近的实际对象的IOU
        best_ious   = tf.reduce_max(iou_scores, axis=4)  # shape=(batch, 特征图高,特征图宽, 3预测框)

        # IOU低于阈值的那些预测边框，才计算其（检测到背景的）置信度的loss
        conf_delta *= tf.expand_dims(tf.to_float(best_ious < self.ignore_thresh), 4) # shape=(batch,特征图高,特征图宽,3预测框,1confidence)

        """
        Compute some online statistics
        """            
        true_xy = true_box_xy / grid_factor
        true_wh = tf.exp(true_box_wh) * self.anchors / net_factor

        true_wh_half = true_wh / 2.
        true_mins    = true_xy - true_wh_half
        true_maxes   = true_xy + true_wh_half

        pred_xy = pred_box_xy / grid_factor
        pred_wh = tf.exp(pred_box_wh) * self.anchors / net_factor 
        
        pred_wh_half = pred_wh / 2.
        pred_mins    = pred_xy - pred_wh_half
        pred_maxes   = pred_xy + pred_wh_half      

        intersect_mins  = tf.maximum(pred_mins,  true_mins)
        intersect_maxes = tf.minimum(pred_maxes, true_maxes)
        intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
        intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]
        
        true_areas = true_wh[..., 0] * true_wh[..., 1]
        pred_areas = pred_wh[..., 0] * pred_wh[..., 1]

        union_areas = pred_areas + true_areas - intersect_areas
        iou_scores  = tf.truediv(intersect_areas, union_areas)
        iou_scores  = object_mask * tf.expand_dims(iou_scores, 4)
        
        count       = tf.reduce_sum(object_mask)
        count_noobj = tf.reduce_sum(1 - object_mask)
        detect_mask = tf.to_float((pred_box_conf*object_mask) >= 0.5)
        class_mask  = tf.expand_dims(tf.to_float(tf.equal(tf.argmax(pred_box_class, -1), true_box_class)), 4)
        recall50    = tf.reduce_sum(tf.to_float(iou_scores >= 0.5 ) * detect_mask  * class_mask) / (count + 1e-3)
        recall75    = tf.reduce_sum(tf.to_float(iou_scores >= 0.75) * detect_mask  * class_mask) / (count + 1e-3)    
        avg_iou     = tf.reduce_sum(iou_scores) / (count + 1e-3)
        avg_obj     = tf.reduce_sum(pred_box_conf  * object_mask)  / (count + 1e-3)
        avg_noobj   = tf.reduce_sum(pred_box_conf  * (1-object_mask))  / (count_noobj + 1e-3)
        avg_cat     = tf.reduce_sum(object_mask * class_mask) / (count + 1e-3) 

        """
        Warm-up training
        """
        batch_seen = tf.assign_add(batch_seen, 1.)
        
        true_box_xy, true_box_wh, xywh_mask = tf.cond(tf.less(batch_seen, self.warmup_batches+1),
                              # 根据YOLOv2开始的设计，前self.warmup_batches 个batch 计算的是预测框与先验框的误差，不是与真实对象边框的误差。
                              # 但这里代码好像有点问题。
                              lambda: [true_box_xy + (0.5 + self.cell_grid[:,:grid_h,:grid_w,:,:]) * (1-object_mask), 
                                       true_box_wh + tf.zeros_like(true_box_wh) * (1-object_mask),   # zeros_like 导致后面的项为0，实际还是true_box_wh，需要修改
                                       tf.ones_like(object_mask)],                                   # 每个预测框的位置都计入loss
                              # 之后的batch不做特殊处理
                              lambda: [true_box_xy, 
                                       true_box_wh,
                                       object_mask])

        """
        Compare each true box to all anchor boxes
        """
        # 注：exp(true_box_wh) = exp(t_wh) = truth_wh / anchor_wh
        # exp(true_box_wh) * self.anchors / net_factor = truth_wh / anchor_wh * self.anchors / net_factor = truth_wh / net_factor
        # wh_scale 是实际对象相对输入图像的大小。
        wh_scale = tf.exp(true_box_wh) * self.anchors / net_factor   # shape=(batch,特征图高,特征图宽,3anchor,2坐标)
        # wh_scale 与实际对象边框的面积负相关，小尺寸对象对边框误差提升敏感度，the smaller the box, the bigger the scale
        wh_scale = tf.expand_dims(2 - wh_scale[..., 0] * wh_scale[..., 1], axis=4)

        # 正常情况下（warmup_batches之后），xywh_mask = object_mask，即存在对象的那些预测框（其位置、置信度、对象类型有意义）才计算loss。
        # 不存在对象的那些预测框，其置信度有意义（不过conf_delta已过滤掉了那些IOU超过阈值的边框），计入loss。而位置和对象类型无意义，不计入loss。
        xy_delta    = xywh_mask   * (pred_box_xy-true_box_xy) * wh_scale * self.xywh_scale  # shape=(batch,特征图高,特征图宽,3个预测框,2个位置)
        wh_delta    = xywh_mask   * (pred_box_wh-true_box_wh) * wh_scale * self.xywh_scale  # shape=(batch,特征图高,特征图宽,3个预测框,2个位置)
        # shape=(batch,特征图高,特征图宽,3个预测框,1个置信度)，前一半是检测到对象的置信度，后一半是检测到背景的置信度
        conf_delta  = object_mask * (pred_box_conf-true_box_conf) * self.obj_scale + (1-object_mask) * conf_delta * self.noobj_scale
        # shape=(batch,特征图高,特征图宽,3个预测框,1个交叉熵)
        class_delta = object_mask * \
                      tf.expand_dims(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=true_box_class, logits=pred_box_class), 4) * \
                      self.class_scale

        # shape=(batch_size,)
        loss_xy    = tf.reduce_sum(tf.square(xy_delta),       list(range(1,5)))
        loss_wh    = tf.reduce_sum(tf.square(wh_delta),       list(range(1,5)))
        loss_conf  = tf.reduce_sum(tf.square(conf_delta),     list(range(1,5)))
        loss_class = tf.reduce_sum(class_delta,               list(range(1,5)))

        loss = loss_xy + loss_wh + loss_conf + loss_class

        loss = tf.Print(loss, [grid_h, avg_obj], message='avg_obj \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, avg_noobj], message='avg_noobj \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, avg_iou], message='avg_iou \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, avg_cat], message='avg_cat \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, recall50], message='recall50 \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, recall75], message='recall75 \t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, count], message='count \t\t\t', summarize=1000)
        loss = tf.Print(loss, [grid_h, tf.reduce_sum(loss_xy), 
                                       tf.reduce_sum(loss_wh), 
                                       tf.reduce_sum(loss_conf), 
                                       tf.reduce_sum(loss_class)],  message='loss xy, wh, conf, class: \t',   summarize=1000)   

        # loss 的shape=(batch_size,)
        return loss*self.grid_scale

参考

[1]YOLOv3: An Incremental Improvement
[2]YOLO v3深入理解
[3]keras-yolo3 + 注释

YOLOv3代码分析(Keras+Tensorflow)

训练样本设置

loss计算

网络结构

代码

参考

推荐阅读更多精彩内容