20190722_CVPR2019_[特征点]D2-Net

文章：D2-Net: A Trainable CNN for Joint Description and Detection of Local Features
链接：https://arxiv.org/abs/1905.03561
作者：Mihai Dusmanu，……，Marc Pollefeys
机构：ETH，Microsoft，JSPS KAKENHI
摘要的摘要：传统的特征方法是先检测关键点再提取描述子（detect-then-describe），本文则是使用一个CNN网络，输入原始图片 $I$ ，尺寸为 $h \times w$ ，输出Feature Map，为3D张量 $F=\mathcal{F}\left( I \right)$ ， $F\in \mathbb{R}^{h\times w\times n}$ ， $n$ 为channel数，再从Feature Map里同时提取关键点和描述子（detect-and-describe）。并且在一些场景看着效果不错，如下图。

图1

1. 描述子

$\mathbf{d}_{ij}=F_{ij:},\mathbf{d}\in \mathbb{R}^n \tag{1}$
实际上，本文使用归一化的描述子： $\mathbf{\hat{d}}_{ij}=\mathbf{d}_{ij}/\lVert \mathbf{d}_{ij} \rVert _2$ .

2. 特征点

定义2D响应：
$D^k=F_{::k},D^k\in \mathbb{R}^{h\times w} \tag{2}$
Hard feature detection. 判断 $(i,j)$ 是一个detection：在 $D_{ij}^{1},D_{ij}^{2},...D_{ij}^{n}$ 中找到响应最大的那一层 $D_{ij}^{k}$ ，再判断在 $k$ 层 $D_{ij}^{k}$ 是否是一个局域响应最大，如是，则 $(i,j)$ 是一个detection，如图2。
$\left( i,j \right) \,\,\text{is a }\det\text{ection }\Longleftrightarrow \,\,D_{ij}^{k}\,\,\text{is a local }\max\text{. in }\,D^k, \text{with }k=\underset{t}{\text{arg }\max}\,\,D_{ij}^{t} \tag{3}$

图2

Soft feature detection. Hard feature detection是一个非0即1的判断，因此只适用于Testing阶段，在Training阶段需要soft一下来做back-propagation。思路就是给各个点定义score值（感觉有点类似于分类中的概率值）。

3. 联合训练

ground truth为给定的匹配点对集合 $\mathcal{C}$ ，损失函数为：
$\mathcal{L}\left( I_1,I_2 \right) =\sum_{c\in \mathcal{C}}{\frac{s_{c}^{\left( 1 \right)}s_{c}^{\left( 2 \right)}}{\sum_{q\in \mathcal{C}}{s_{q}^{\left( 1 \right)}s_{q}^{\left( 2 \right)}}}}m\left( p\left( c \right) ,n\left( c \right) \right) \tag{13}$
具体详见paper，大意就是最小化正确匹配距离，最大化错误匹配距离，同时提高关键点的得分。

4. 其他

使用VGG16，fine-tuning。
项目在Github上开源，https://github.com/mihaidusmanu/d2-net。