RepPoint：可形变卷积生成的目标轮廓点集

论文题目为:
RepPoints: Point Set Representation for Object Detection

idea总结:

改变目标检测领域中对于目标用矩形框的表现形式,而是采用点集的形式来表现一个物体的轮廓
特征抽取后,配合deformable convolution来进对物体中心点的偏移量学习,得到其点集的位置.
提出deformable RoI pooling.
提出三种转换方式,将点集转化为矩形框方便评测该目标检测算法的指标

respoints 表示

传统目标检测采用一个4-D的向量来表示一个物体 $B=(x,y,w,h)$ ,其分别代表了物体的中心点 $(x,y)$ 坐标,物体框的宽与高.

respoint则是用一组点集来表示,其中n代表了取样点的数量(文中设置为9).建议为某个数的平方
$R=\{(x_{k},y_{k})\}^{n}_{k=1}$

如图表示,respints在backbone骨干网络抽取特征后,通过其RepPointsHead结构转化成9个物体的轮廓点,然后,这9个点形成物体边框的pseudo box,然后再转化为传统目标检测的bbox.

image

回顾传统的多阶段目标检测

传统的两阶段目标检测流程:

通过预设的锚点(anchor)来覆盖一定范围的边界框比例和纵横比.
对于锚点,将其中心点处的图像特征作为对象特征,生成有关锚点是否为目标对象的置信度得分,并通过边界框回归生成精炼的边界框(bbox proposals)
在第二阶段,通过 RoI-pooling 或 RoI-Align从(2)中获得的边界框建议提取对象特征.
经过改进的特征将通过边界框回归产生最终的边界框目标。
对于多阶段方法，还通过边界框回归，使用改进的特征来生成中间的改进的边界框建议（S2）。在生成最终的边界框目标之前，可以多次重复此步骤,用以修正目标框边界.

边界框与点集回归对比

逐步完善边界框定位和特征提取对于多阶段目标检测方法的成功至关重要。

对于bbox表现形式:

4-d的回归量 $\left(\Delta x_{p}, \Delta y_{p}, \Delta w_{p}, \Delta h_{p}\right)$ map到原始的建议框bounding box proposal $\mathcal{B}_{p}=\left(x_{p}, y_{p}, w_{p}, h_{p}\right)$ :
$\mathcal{B}_{r}=\left(x_{p}+w_{p} \Delta x_{p}, y_{p}+h_{p} \Delta y_{p}, w_{p} e^{\Delta w_{p}}, h_{p} e^{\Delta h_{p}}\right)$

对于ground truth bounding box $\mathcal{B}_{t}=(x_{t},y_{t},w_{t},h_{t})$ ,我们的loss是要使 $\mathcal{B}_{r}$ 更接近gt,所以其4-d的loss为:
$\hat{\mathcal{F}}\left(\mathcal{B}_{p}, \mathcal{B}_{t}\right)=\left(\frac{x_{t}-x_{p}}{w_{p}}, \frac{y_{t}-y_{p}}{h_{p}}, \log \frac{w_{t}}{w_{p}}, \log \frac{h_{t}}{h_{p}}\right)$

对于respoint形式

$\mathcal{R}_{r}=\left\{\left(x_{k}+\Delta x_{k}, y_{k}+\Delta y_{k}\right)\right\}_{k=1}^{n}$

$\left\{\left(\Delta x_{k}, \Delta y_{k}\right)\right\}_{k=1}^{n}$ 是预测点的offset.

所以我们只需要学习其offset,然后加到原始点坐标即可.

RPDet:anchor free的respoint 检测器

其流程如下图所示:

image

使用中心点作为对象的初始表示.
基于中心点,通过deformable convolution 来学习每个中心点的偏移量,如9个点偏移量来表示物体,则用一个3 X 3的可变形卷积.然后利用偏移量对物体位置进行回归.
经过两次deformable convolution的offset偏移量回归矫正,形成respints object

其RPDet的head主要算法结构如图所示:

image

其中locate subnet 与class subnet两个子网络的输入都是通过rpn主干网络抽取的相同图像特征.

我们看到通过center point生成respoint的奥秘在于locate subnet中那个 3 X 3 的可变形卷积自动学习得到的关于物体的感受野位置

respoint 生成bbox的三种方法:

Min-max function.在RepPoints上执行两个轴上的Min-max操作以确定Bp，等效于所有采样点上的边界框值.
Partial min-max function.在两个轴上分别对样本点的子集进行最小-最大运算，以获得矩形框值.
Moment-based function.RepPoints的平均值和标准偏差用于计算矩形框Bp的中心点和比例，其中比例与全球共享的可学习乘数λx和λy相乘。(代码中默认使用这种方式)

loss的计算:

location loss:先将respoint转换为伪框(pseudo box),然后计算pseudo box与ground- truth bounding box的loss.(论文中使用左上角与右下角之间的smooth l1 loss来得到location loss)
classification loss:采用FocalLoss的形式来解决类别不平衡问题

代码分析

RPDet的代码在https://github.com/microsoft/RepPoints.已合并如mmdetion框架中,我们来看mmdetion中的代码:

config文件:
config/reppoints/reppoints_moment_r50_fpn_1x.py

#model定义
model = dict(
    type='RepPointsDetector',
    pretrained='torchvision://resnet50',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        style='pytorch'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs=True,
        num_outs=5,
        norm_cfg=norm_cfg),
    bbox_head=dict(
        type='RepPointsHead',
        num_classes=81,
        in_channels=256,
        feat_channels=256,
        point_feat_channels=256,
        stacked_convs=3,
        num_points=9,
        gradient_mul=0.1,
        point_strides=[8, 16, 32, 64, 128],
        point_base_scale=4,
        norm_cfg=norm_cfg,
        loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_bbox_init=dict(type='SmoothL1Loss', beta=0.11, loss_weight=0.5),
        loss_bbox_refine=dict(type='SmoothL1Loss', beta=0.11, loss_weight=1.0),
        transform_method='moment'))

其主干网络采用restnet+fpn的形式,正常的多尺度抽取图像特征;

下面我们结合reppoint-head的结构图,来看两个subnet是如何发挥作用的:
mmdet/models/anchor_heads/resppoints_head.py

@HEADS.register_module
class RepPointsHead(nn.Module):
    def __init__(self,****)
#部分省略初始化定义

        # we use deformable conv to extract points features
        #dcn的kernel大小即为定义点的数量,即用一个dcn的感受野来表示物体轮廓
        self.dcn_kernel = int(np.sqrt(num_points))
        self.dcn_pad = int((self.dcn_kernel - 1) / 2)
        assert self.dcn_kernel * self.dcn_kernel == num_points, \
            "The points number should be a square number."
        assert self.dcn_kernel % 2 == 1, \
            "The points number should be an odd square number."
        #可变形卷积的初始化x,y偏移量
        dcn_base = np.arange(-self.dcn_pad,
                             self.dcn_pad + 1).astype(np.float64)
        dcn_base_y = np.repeat(dcn_base, self.dcn_kernel)
        dcn_base_x = np.tile(dcn_base, self.dcn_kernel)
        dcn_base_offset = np.stack([dcn_base_y, dcn_base_x], axis=1).reshape(
            (-1))
        self.dcn_base_offset = torch.tensor(dcn_base_offset).view(1, -1, 1, 1)
        self._init_layers()

     def _init_layers(self):
        self.relu = nn.ReLU(inplace=True)
        self.cls_convs = nn.ModuleList()
        self.reg_convs = nn.ModuleList()
        #两个subnet分别都有3个3X3的卷积进行特征抽取工作
        for i in range(self.stacked_convs):
            chn = self.in_channels if i == 0 else self.feat_channels
            self.cls_convs.append(
                ConvModule(
                    chn,
                    self.feat_channels,
                    3,
                    stride=1,
                    padding=1,
                    conv_cfg=self.conv_cfg,
                    norm_cfg=self.norm_cfg))
            self.reg_convs.append(
                ConvModule(
                    chn,
                    self.feat_channels,
                    3,
                    stride=1,
                    padding=1,
                    conv_cfg=self.conv_cfg,
                    norm_cfg=self.norm_cfg))
        #respoint利用dcn进行offset学习部分网络定义
        pts_out_dim = 4 if self.use_grid_points else 2 * self.num_points
        self.reppoints_cls_conv = DeformConv(self.feat_channels,
                                             self.point_feat_channels,
                                             self.dcn_kernel, 1, self.dcn_pad)
        self.reppoints_cls_out = nn.Conv2d(self.point_feat_channels,
                                           self.cls_out_channels, 1, 1, 0)
        self.reppoints_pts_init_conv = nn.Conv2d(self.feat_channels,
                                                 self.point_feat_channels, 3,
                                                 1, 1)
        self.reppoints_pts_init_out = nn.Conv2d(self.point_feat_channels,
                                                pts_out_dim, 1, 1, 0)
        self.reppoints_pts_refine_conv = DeformConv(self.feat_channels,
                                                    self.point_feat_channels,
                                                    self.dcn_kernel, 1,
                                                    self.dcn_pad)
        self.reppoints_pts_refine_out = nn.Conv2d(self.point_feat_channels,
                                                  pts_out_dim, 1, 1, 0)
    #网络前馈计算
    def forward_single(self, x):
        dcn_base_offset = self.dcn_base_offset.type_as(x)
        # If we use center_init, the initial reppoints is from center points.
        # If we use bounding bbox representation, the initial reppoints is
        #   from regular grid placed on a pre-defined bbox.
        if self.use_grid_points or not self.center_init:
            scale = self.point_base_scale / 2
            points_init = dcn_base_offset / dcn_base_offset.max() * scale
            bbox_init = x.new_tensor([-scale, -scale, scale,
                                      scale]).view(1, 4, 1, 1)
        else:
            points_init = 0
        cls_feat = x
        pts_feat = x
        for cls_conv in self.cls_convs:
            cls_feat = cls_conv(cls_feat)
        for reg_conv in self.reg_convs:
            pts_feat = reg_conv(pts_feat)
        # initialize reppoints
        pts_out_init = self.reppoints_pts_init_out(
            self.relu(self.reppoints_pts_init_conv(pts_feat)))
        if self.use_grid_points:
            pts_out_init, bbox_out_init = self.gen_grid_from_reg(
                pts_out_init, bbox_init.detach())
        else:
            pts_out_init = pts_out_init + points_init
        # refine and classify reppoints
        pts_out_init_grad_mul = (1 - self.gradient_mul) * pts_out_init.detach(
        ) + self.gradient_mul * pts_out_init
        dcn_offset = pts_out_init_grad_mul - dcn_base_offset
        cls_out = self.reppoints_cls_out(
            self.relu(self.reppoints_cls_conv(cls_feat, dcn_offset)))
        pts_out_refine = self.reppoints_pts_refine_out(
            self.relu(self.reppoints_pts_refine_conv(pts_feat, dcn_offset)))
        if self.use_grid_points:
            pts_out_refine, bbox_out_refine = self.gen_grid_from_reg(
                pts_out_refine, bbox_out_init.detach())
        else:
            pts_out_refine = pts_out_refine + pts_out_init.detach()
        return cls_out, pts_out_init, pts_out_refine

总结与tips

这篇论文在我的理解中,更像是将可形变卷积应用在了目标检测领域,通过定位和分类的监督loss来监督可形变卷积对于物体偏移量的学习,使得卷积的学习变得可解释性.启发我们可以可以用不同的监督信息来使用可形变卷积.

respoint 如何解决同一位置多个物体的遮挡问题:

In RPDet, we show that this issue can be greatly alleviated by using the FPN structure [24] for the following reasons: first, objects of different scales will be assigned to different image feature levels, which addresses objects of different scales and the same center points locations; second, FPN has a high-resolution feature map for small objects, which also reduces the chance of two objects having centers located at the same feature position.

作者认为通过rpn结构将不同比例对象分配给不同的图像特征的方式来解决;

但这种方式能放解决像行人检测中多个行人遮挡问题还有待商榷.