网络架构
可以分为3个部分
- Head
- Region Proposal Network(RPN)
- Classification Network
Region Proposal Network
从CNN网络(head)中获取的feature map作为输入,经过一个的2D卷积后分成两路。1)一路是把Anchor Generation Layer产生的一系列anchor box进行二分(background/foreground class scores and probabilities)。2)一路产生回归系数(regression coefficients)对对应的anchor box进行精修。这两路输出后续送给proposal layer进行处理。
训练
训练每一层参数,这些层可以大体分为:
- Anchor Generation Layer
- Region Proposal Layer
- Proposal Layer
- Anchor Target Layer
- Proposal Target Layer
- ROI Pooling Layer
- Classification Layer
Anchor Generation Layer
生成大小、长宽比例不同的一系列固定数目的矩形框(anchor boxes)。然后送到卷积网络判断哪些矩形框是有物体(positive anchor),哪些没有物体(negative anchor)。这种判断是一个二分问题。
Anchor Generation Layer: This layer generates a fixed number of “anchors” (bounding boxes) by first generating 9 anchors of different scales and aspect ratios and then replicating these anchors by translating them across uniformly spaced grid points spanning the input image.
Region Proposal Layer
可以分为3个部分:
- Proposal Layer
- Anchor Target Layer
- Proposal Target Layer
主要功能是:
- anchor分类:把给定一系列的anchor分成有物体(positive anchor)的,没有物体(negative anchor)的。
- 精修:用回归到方法对anchor box进行精修(调节大小、位置),得到更精确的矩形框。
Proposal Layer
Proposal Layer的目标是从Region Proposal Network中选出一些精修后的anchor box作为proposal。
Proposal Layer: Transform the anchors according to the bounding box regression coefficients to generate transformed anchors. Then prune the number of anchors by applying non-maximum suppression (see Appendix) using the probability of an anchor being a foreground region
Anchor Target Layer
作用是选出可以用来训练RPN的anchor box,计算RPN loss。RPN loss分为两部分:1)Classification Loss;2)Bounding Box Regression Loss。注意的是,不是所有anchor box都进行训练,而是选出一部分得分高的anchor box进行训练。对一个无效的anchor box进行训练毫无意义。
Anchor Target Layer: The goal of the anchor target layer is to produce a set of “good” anchors and the corresponding foreground/background labels and target regression coefficients to train the Region Proposal Network. The output of this layer is only used to train the RPN network and is not used by the classification layer. Given a set of anchors (produced by the anchor generation layer, the anchor target layer identifies promising foreground and background anchors. Promising foreground anchors are those whose overlap with some ground truth box is higher than a threshold. Background boxes are those whose overlap with any ground truth box is lower than a threshold. The anchor target layer also outputs a set of bounding box regressors i.e., a measure of how far each anchor target is from the closest bounding box. These regressors only make sense for the foreground boxes as there is no notion of “closest bounding box” for a background box.
Input:
- RPN Network Outputs (predicted foreground/background class labels, regression coefficients)
- Anchor boxes (generated by the anchor generation layer)
- Ground truth boxes
Output
- Good foreground/background boxes and associated class labels
- Target regression coefficients
Parameters:
- TRAIN.RPN_POSITIVE_OVERLAP: Threshold used to select if an anchor box is a good foreground box (Default: 0.7)
- TRAIN.RPN_NEGATIVE_OVERLAP: If the max overlap of a anchor from a ground truth box is lower than this thershold, it is marked as background. Boxes whose overlap is > than RPN_NEGATIVE_OVERLAP but < RPN_POSITIVE_OVERLAP are marked “don’t care”.(Default: 0.3)
- TRAIN.RPN_BATCHSIZE: Total number of background and foreground anchors (default: 256)
- TRAIN.RPN_FG_FRACTION: fraction of the batch size that is foreground anchors (default: 0.5). If the number of foreground anchors found is larger than TRAIN.RPN_BATCHSIZE TRAIN.RPN_FG_FRACTION, the excess (indices are selected randomly) is marked “don’t care”.
Proposal Target Layer
从Proposal Layer输出的一系列ROIs中选出符合条件的ROIs。这些选出的ROIs进行crop pooling后和feature maps结合。同Anchor Target Layer一样,不是对所有ROIs进行训练,而是挑出一部分得分高的。因为对一个不可能是proposal的ROI训练毫无意义。
Proposal Target Layer: The goal of the proposal target layer is to prune the list of anchors produced by the proposal layer and produce class specific bounding box regression targets that can be used to train the classification layer to produce good class labels and regression targets
Input:
- ROIs produced by the proposal layer
- ground truth information
Output:
- Selected foreground and background ROIs that meet overlap criteria.
- Class specific target regression coefficients for the ROIs
Parameters:
- TRAIN.FG_THRESH: (default: 0.5) Used to select foreground ROIs. ROIs whose max overlap with a ground truth box exceeds FG_THRESH are marked foreground
- TRAIN.BG_THRESH_HI: (default 0.5)
- TRAIN.BG_THRESH_LO: (default 0.1) These two thresholds are used to select background ROIs. ROIs whose max overlap falls between BG_THRESH_HI and BG_THRESH_LO are marked background
- TRAIN.BATCH_SIZE: (default 128) Maximum number of foreground and background boxes selected.
- TRAIN.FG_FRACTION: (default 0.25). Number of foreground boxes can’t exceed > - BATCH_SIZE*FG_FRACTION
RoI Pooling
可适用不同大小的图片。
ROI Pooling Layer: Implements a spatial transformation network that samples the input feature map given the bounding box coordinates of the region proposals produced by the proposal target layer. These coordinates will generally not lie on integer boundaries, thus interpolation based sampling is required.
Classification Layer
和前面的对anchor的分类不同。anchor的分类是二分,而这里是输出属于哪个分类的概率,类别可以很多。
Classification Layer: The classification layer takes the output feature maps produced by the ROI Pooling Layer and passes them through a series of convolutional layers. The output is fed through two fully connected layers. The first layer produces the class probability distribution for each region proposal and the second layer produces a set of class specific bounding box regressors.
Appendix
Bounding Box Regression Coefficients
和分别代表“目标”和“原始”参数,分别代表x,y坐标和宽、高。可以看出系数具有仿射(affine transformation)不变性。这一点很重要,因为在计算classification loss的时候,系数是根据原始参数中计算的,期望输出的系数是针对ROI pooling后的。
Intersection over Union (IoU) Overlap
Non-Maximum Suppression
有多个候选框,用非极大值抑制NMS,来抑制那些冗余的框: 抑制的过程是一个迭代-遍历-消除的过程。
- 将所有框的得分排序,选中最高分及其对应的框;
- 遍历其余的框,如果和当前最高分框的重叠面积(IOU)大于一定阈值,我们就将框删除;
- 从未处理的框中继续选一个得分最高的,重复上述过程,直至边界框列表为空。