本系列的代码来自:https://github.com/jwyang/faster-rcnn.pytorch
大家可以去star一下,目前支持pytorch1.0系列
参考:原理部分摘自两篇文章,个人记录
详解 ROI Align 的基本原理和实现细节
RoIPooling、RoIAlign笔记
Why roi-align
ROI Align 是在Mask-RCNN这篇论文里提出的一种区域特征聚集方式, 很好地解决了ROI Pooling操作中两次量化造成的区域不匹配(mis-alignment)的问题。实验显示,在检测测任务中将 ROI Pooling 替换为 ROI Align 可以提升检测模型的准确性。
1. roi pooling的局限性(造成mis-alignment问题)
在常见的两级检测框架(比如Fast-RCNN,Faster-RCNN,RFCN)中,ROI Pooling 的作用是根据预选框的位置坐标在特征图中将相应区域池化为固定尺寸的特征图,以便进行后续的分类和包围框回归操作。由于预选框的位置通常是由模型回归得到的,一般来讲是浮点数,而池化后的特征图要求尺寸固定。故ROI Pooling这一操作存在两次量化的过程。
- 将候选框边界量化为整数点坐标值。
- 将量化后的边界区域平均分割成 k x k 个单元(bin),对每一个单元的边界进行量化。
事实上,经过上述两次量化,此时的候选框已经和最开始回归出来的位置有一定的偏差,这个偏差会影响检测或者分割的准确度。在论文里,作者把它总结为“不匹配问题(misalignment)。
针对上图
Conv layers使用的是VGG16,feat_stride=32(即表示,经过网络层后图片缩小为原图的1/32),原图800*800,最后一层特征图feature map大小:25*25
假定原图中有一region proposal,大小为665*665,这样,映射到特征图中的大小:665/32=20.78,即20.78*20.78,如果你看过Caffe的Roi Pooling的C++源码,在计算的时候会进行取整操作,于是,进行所谓的第一次量化,即映射的特征图大小为20*20
假定pooled_w=7,pooled_h=7,即pooling后固定成7*7大小的特征图,所以,将上面在 feature map上映射的20*20的 region proposal划分成49个同等大小的小区域,每个小区域的大小20/7=2.86,即2.86*2.86,此时,进行第二次量化,故小区域大小变成2*2
每个2*2的小区域里,取出其中最大的像素值,作为这一个区域的‘代表’,这样,49个小区域就输出49个像素值,组成7*7大小的feature map
总结,所以,通过上面可以看出,经过两次量化,即将浮点数取整,原本在特征图上映射的20*20大小的region proposal,偏差成大小为7*7的,这样的像素偏差势必会对后层的回归定位产生影响。
所以,产生了替代方案,RoiAlign
- roi align的原理
同样,针对上图,有着类似的映射
- Conv layers使用的是VGG16,feat_stride=32(即表示,经过网络层后图片缩小为原图的1/32),原图800*800,最后一层特征图feature map大小:25*25
- 假定原图中有一region proposal,大小为665*665,这样,映射到特征图中的大小:665/32=20.78,即20.78*20.78,此时,没有像RoiPooling那样就行取整操作,保留浮点数
- 假定pooled_w=7,pooled_h=7,即pooling后固定成7*7大小的特征图,所以,将在 feature map上映射的20.78*20.78的region proposal 划分成49个同等大小的小区域,每个小区域的大小20.78/7=2.97,即2.97*2.97
- 假定采样点数为4,即表示,对于每个2.97*2.97的小区域,平分四份,每一份取其中心点位置,而中心点位置的像素,采用双线性插值法进行计算,这样,就会得到四个点的像素值,如下图
上图中,四个红色叉叉‘×’的像素值是通过双线性插值算法计算得到的。最后,取四个像素值中最大值作为这个小区域(即:2.97*2.97大小的区域)的像素值,如此类推,同样是49个小区域得到49个像素值,组成7*7大小的feature map
roi-align代码解读
该部分目录结构与roi_pooling一致,参见:Faster RCNN源码解读(2)-roi_pooling
- 这里还是先看cpu版本的c语言roi_align.c
#include <TH/TH.h> // pytorch的 c拓展
#include <math.h>
#include <omp.h> // 多线程openMP
// 定义实现forward和backword的两个函数,C语言先定义
void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,
const int height, const int width, const int channels,
const int aligned_height, const int aligned_width, const float * bottom_rois,
float* top_data);
void ROIAlignBackwardCpu(const float* top_diff, const float spatial_scale, const int num_rois,
const int height, const int width, const int channels,
const int aligned_height, const int aligned_width, const float * bottom_rois,
float* top_data);
int roi_align_forward(int aligned_height, int aligned_width, float spatial_scale,
THFloatTensor * features, THFloatTensor * rois, THFloatTensor * output)
{
//Grab the input tensor
// 推测出features数据的格式,实际为一维数组(里面的[]是为了区分):
// [...,[c1,c2,c3,...,c_num_channels],[c1,c2,c3,...,c_num_channels],...]
// 一共data_height*data_width个[c1,c2,c3,...,c_num_channels]
float * data_flat = THFloatTensor_data(features);
// rois_flat = [...,[batch_index x1 y1 x2 y2],[batch_index x1 y1 x2 y2],...]
float * rois_flat = THFloatTensor_data(rois);
float * output_flat = THFloatTensor_data(output);
// Number of ROIs
int num_rois = THFloatTensor_size(rois, 0);
int size_rois = THFloatTensor_size(rois, 1);
// ROI = [batch_index x1 y1 x2 y2]
if (size_rois != 5)
{
return 0;
}
// data height
int data_height = THFloatTensor_size(features, 2);
// data width
int data_width = THFloatTensor_size(features, 3);
// Number of channels
int num_channels = THFloatTensor_size(features, 1);
// do ROIAlignForward,调用单独的forward函数
ROIAlignForwardCpu(data_flat, spatial_scale, num_rois, data_height, data_width, num_channels,
aligned_height, aligned_width, rois_flat, output_flat);
return 1;
}
int roi_align_backward(int aligned_height, int aligned_width, float spatial_scale,
THFloatTensor * top_grad, THFloatTensor * rois, THFloatTensor * bottom_grad)
{
//Grab the input tensor
float * top_grad_flat = THFloatTensor_data(top_grad);
float * rois_flat = THFloatTensor_data(rois);
float * bottom_grad_flat = THFloatTensor_data(bottom_grad);
// Number of ROIs
int num_rois = THFloatTensor_size(rois, 0);
int size_rois = THFloatTensor_size(rois, 1);
if (size_rois != 5)
{
return 0;
}
// batch size
// int batch_size = THFloatTensor_size(bottom_grad, 0);
// data height
int data_height = THFloatTensor_size(bottom_grad, 2);
// data width
int data_width = THFloatTensor_size(bottom_grad, 3);
// Number of channels
int num_channels = THFloatTensor_size(bottom_grad, 1);
// do ROIAlignBackward,调用单独的backward函数
ROIAlignBackwardCpu(top_grad_flat, spatial_scale, num_rois, data_height,
data_width, num_channels, aligned_height, aligned_width, rois_flat, bottom_grad_flat);
return 1;
}
void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,
const int height, const int width, const int channels,
const int aligned_height, const int aligned_width, const float * bottom_rois,
float* top_data)
{
// 输出数据大小
const int output_size = num_rois * aligned_height * aligned_width * channels;
int idx = 0;
for (idx = 0; idx < output_size; ++idx)
{
// (n, c, ph, pw) is an element in the aligned output
int pw = idx % aligned_width; // 水平第几个
int ph = (idx / aligned_width) % aligned_height; // 垂直第几个
int c = (idx / aligned_width / aligned_height) % channels; // 第几个通道
int n = idx / aligned_width / aligned_height / channels; // 第几个roi
// bottom_rois:rois_flat
// 分别对应ROI = [batch_index x1 y1 x2 y2]五个值
float roi_batch_ind = bottom_rois[n * 5 + 0];
float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;
float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;
float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;
float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale;
// Force malformed ROI to be 1x1
float roi_width = fmaxf(roi_end_w - roi_start_w + 1., 0.);
float roi_height = fmaxf(roi_end_h - roi_start_h + 1., 0.);
// 每个bin的高度和宽度
float bin_size_h = roi_height / (aligned_height - 1.);
float bin_size_w = roi_width / (aligned_width - 1.);
//每个bin的坐标
float h = (float)(ph) * bin_size_h + roi_start_h;
float w = (float)(pw) * bin_size_w + roi_start_w;
int hstart = fminf(floor(h), height - 2);
int wstart = fminf(floor(w), width - 2);
int img_start = roi_batch_ind * channels * height * width;
// bilinear interpolation 双线性插值
if (h < 0 || h >= height || w < 0 || w >= width)
{
top_data[idx] = 0.;
}
else
{
float h_ratio = h - (float)(hstart);
float w_ratio = w - (float)(wstart);
int upleft = img_start + (c * height + hstart) * width + wstart;
int upright = upleft + 1;
int downleft = upleft + width;
int downright = downleft + 1;
top_data[idx] = bottom_data[upleft] * (1. - h_ratio) * (1. - w_ratio)
+ bottom_data[upright] * (1. - h_ratio) * w_ratio
+ bottom_data[downleft] * h_ratio * (1. - w_ratio)
+ bottom_data[downright] * h_ratio * w_ratio;
}
}
}
void ROIAlignBackwardCpu(const float* top_diff, const float spatial_scale, const int num_rois,
const int height, const int width, const int channels,
const int aligned_height, const int aligned_width, const float * bottom_rois,
float* bottom_diff)
{
const int output_size = num_rois * aligned_height * aligned_width * channels;
int idx = 0;
for (idx = 0; idx < output_size; ++idx)
{
// (n, c, ph, pw) is an element in the aligned output
int pw = idx % aligned_width;
int ph = (idx / aligned_width) % aligned_height;
int c = (idx / aligned_width / aligned_height) % channels;
int n = idx / aligned_width / aligned_height / channels;
float roi_batch_ind = bottom_rois[n * 5 + 0];
float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;
float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;
float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;
float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale;
// Force malformed ROI to be 1x1
float roi_width = fmaxf(roi_end_w - roi_start_w + 1., 0.);
float roi_height = fmaxf(roi_end_h - roi_start_h + 1., 0.);
float bin_size_h = roi_height / (aligned_height - 1.);
float bin_size_w = roi_width / (aligned_width - 1.);
float h = (float)(ph) * bin_size_h + roi_start_h;
float w = (float)(pw) * bin_size_w + roi_start_w;
int hstart = fminf(floor(h), height - 2);
int wstart = fminf(floor(w), width - 2);
int img_start = roi_batch_ind * channels * height * width;
// bilinear interpolation 双线性插值
if (h < 0 || h >= height || w < 0 || w >= width)
{
float h_ratio = h - (float)(hstart);
float w_ratio = w - (float)(wstart);
int upleft = img_start + (c * height + hstart) * width + wstart;
int upright = upleft + 1;
int downleft = upleft + width;
int downright = downleft + 1;
bottom_diff[upleft] += top_diff[idx] * (1. - h_ratio) * (1. - w_ratio);
bottom_diff[upright] += top_diff[idx] * (1. - h_ratio) * w_ratio;
bottom_diff[downleft] += top_diff[idx] * h_ratio * (1. - w_ratio);
bottom_diff[downright] += top_diff[idx] * h_ratio * w_ratio;
}
}
}
- 然后看functions下的roi_align.py,此处调用src实现的具体roi_align操作
# --------------------
# 此处实现roi align自定义层的function
# 包括forward和backward
# --------------------
import torch
from torch.autograd import Function
from .._ext import roi_align
# TODO use save_for_backward instead
class RoIAlignFunction(Function):
def __init__(self, aligned_height, aligned_width, spatial_scale):
self.aligned_width = int(aligned_width)
self.aligned_height = int(aligned_height)
self.spatial_scale = float(spatial_scale)
self.rois = None
self.feature_size = None
def forward(self, features, rois):
self.rois = rois
self.feature_size = features.size()
batch_size, num_channels, data_height, data_width = features.size()
num_rois = rois.size(0)
output = features.new(num_rois, num_channels, self.aligned_height, self.aligned_width).zero_()
if features.is_cuda:
roi_align.roi_align_forward_cuda(self.aligned_height,
self.aligned_width,
self.spatial_scale, features,
rois, output)
else:
roi_align.roi_align_forward(self.aligned_height,
self.aligned_width,
self.spatial_scale, features,
rois, output)
# raise NotImplementedError
return output
def backward(self, grad_output):
assert(self.feature_size is not None and grad_output.is_cuda)
batch_size, num_channels, data_height, data_width = self.feature_size
grad_input = self.rois.new(batch_size, num_channels, data_height,
data_width).zero_()
roi_align.roi_align_backward_cuda(self.aligned_height,
self.aligned_width,
self.spatial_scale, grad_output,
self.rois, grad_input)
# print grad_input
return grad_input, None
- 最后是modules下的roi_align.py,此处我们就实现了roi_align层了,此处调用functions下的roi_align.py定义的RoIAlignFunction()函数
# --------------------
# 此处调用function实现roi align自定义层的module
# 包括forward,实现了层的定义
# 有average pooling 和max pooling
# --------------------
from torch.nn.modules.module import Module
from torch.nn.functional import avg_pool2d, max_pool2d
from ..functions.roi_align import RoIAlignFunction
class RoIAlign(Module):
def __init__(self, aligned_height, aligned_width, spatial_scale):
super(RoIAlign, self).__init__()
self.aligned_width = int(aligned_width)
self.aligned_height = int(aligned_height)
self.spatial_scale = float(spatial_scale)
def forward(self, features, rois):
return RoIAlignFunction(self.aligned_height, self.aligned_width,
self.spatial_scale)(features, rois)
class RoIAlignAvg(Module):
def __init__(self, aligned_height, aligned_width, spatial_scale):
super(RoIAlignAvg, self).__init__()
self.aligned_width = int(aligned_width)
self.aligned_height = int(aligned_height)
self.spatial_scale = float(spatial_scale)
def forward(self, features, rois):
x = RoIAlignFunction(self.aligned_height+1, self.aligned_width+1,
self.spatial_scale)(features, rois)
return avg_pool2d(x, kernel_size=2, stride=1)
class RoIAlignMax(Module):
def __init__(self, aligned_height, aligned_width, spatial_scale):
super(RoIAlignMax, self).__init__()
self.aligned_width = int(aligned_width)
self.aligned_height = int(aligned_height)
self.spatial_scale = float(spatial_scale)
def forward(self, features, rois):
x = RoIAlignFunction(self.aligned_height+1, self.aligned_width+1,
self.spatial_scale)(features, rois)
return max_pool2d(x, kernel_size=2, stride=1)