原文:Feature Pyramid Networks for Object Detection
1、FPN结构(以resnet50为例)
Fig. 3 shows the building block that constructs our topdown feature maps. With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated. To start the iteration, we simply attach a 1×1 convolutional layer on C5 to produce the coarsest resolution map. Finally, we append a 3×3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called{P2,P3,P4,P5}, corresponding to{C2,C3,C4,C5} that are respectively of the same spatial sizes.
在进行特征图的横向连接时,先用1×1卷积将待连接特征图压缩为256通道,再将两特征图对应元素相加,最后通过一个3×3的卷积(图中没画出,但原文和代码中有),得到连接后的金字塔特征图。
2、RPN与FPN结合
原文:
We adapt RPN by replacing the single-scale feature map with our FPN. We attach a head of the same design (3×3 conv and two sibling 1×1 convs) to each level on our feature pyramid. Because the head slides densely over all locations in all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level. Formally, we define the anchors to have areas of {32,64,128,256,512} pixels on {P2,P3,P4,P5,P6} respectively.1 As in [29] we also use anchors of multiple aspect ratios{1:2, 1:1, 2:1}at each level. So in total there are 15 anchors over the pyramid.
五个金字塔特征图 {P2,P3,P4,P5,P6} 各对应一个anchor({32,64,128,256,512})。
原文:
We note that the parameters of the heads are shared across all feature pyramid levels; we have also evaluated the alternative without sharing parameters and observed similar accuracy. The good performance of sharing parameters indicates that all levels of our pyramid share similar semantic levels. This advantage is analogous to that of using a featurized image pyramid, where a common head classifier can be applied to features computed at any image scale.
五个金字塔特征图的rpn共享参数,因为作者做了实验,和不共享参数效果一样好。这其实可以变相说明五个金字塔特征图的语义信息相似,top-down回路是奏效的。
3、FPN与faster rcnn结合
We view our feature pyramid as if it were produced from an image pyramid. Thus we can adapt the assignment strategy of region-based detectors [15, 11] in the case when they are run on image pyramids. Formally, we assign an RoI of width w and height h (on the input image to the network) to the level Pk of our feature pyramid by:
这个公式把不同尺寸的proposal region分配给不同的金字塔特征图,并在对应的特征图上进行roi pooling。
We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels. Again, the heads all share parameters, regardless of their levels. In [16], a ResNet’s conv5 layers (a 9-layer deep subnetwork) are adopted as the head on top of the conv4 features, but our method has already harnessed conv5 to construct the feature pyramid. So unlike [16], we simply adopt RoI pooling to extract 7×7 features, and attach two hidden 1,024-d fully-connected (fc) layers (each followed by ReLU) before the final classification and bounding box regression layers. These layers are randomly initialized, as there are no pre-trained fc layers available in ResNets. Note that compared to the standard conv5 head, our 2-fc MLP head is lighter weight and faster.
所有roi pooling的结果(7×7)共用同一个predictor heads,在这之前经过了两个1024的全连接层。