Real-Time Human Pose Recognition in Parts from Single Depth Images

Abstract We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

导读

论文提出了一种从单张深度图片中实时识别人体姿态的方法。主要思路是将这个问题转化成从单张深度图片识别每个像素属于身体的那个部位以及对识别出来的身体部位进行3维关节点重建问题。

图1: 从单张输入的审图图片，推断每个像素点从属的身体部位

我们主要关注论文中如何使用随机森林解决第一步转化：从深度图片来识别身体部位。

训练数据的采集

如何得到大量，多样的训练数据是问题的关键。限制有二，使用计算机图形技术生成的真实图像会受到大量的颜色和纹理的影响，使得原始数据的有效信息退化成二维剪影信息，即使深度相机可以避免颜色纹理的影响，但是人体和衣服的形状各样性仍不能很好收集完全。

论文基于原始深度相机采集的图像，结合人物模型特性轻微改变身高和体重来生成更多的合成图像，以此来覆盖更多的人体形状。合成图像的目标是真实和多样。此外，结合用户的使用场景，论文的数据采集目标是覆盖人可能在娱乐场景中做出来的各种姿势。事实上，并不需要采集所有可能的姿态的组合数据，只要采集到大量范围广的的姿态信息就足够了。对采集的连续姿态序列，姿态之间相似冗余，因而使用定义了姿态间的欧式距离舍弃其中的部分冗余数据。

此外，为了进一步完善先前采集数据的缺失信息，可以迭代采集过程，不断完善数据库。

特征表达

论文使用深度比对来生成图像 $I$ 中某一个像素 $\mathbf{x}$ 的特征。

首先，对于像素 $\mathbf{x}$ ，定义一组位移探针 $\theta = (\mathbf{u}, \mathbf{v})$ 来获取像素 $x$ 附近两个位置的深度差异。
基于 $\mathbf{u/v}$ 生成的两个像素点可以表示成， $\mathbf{x}+\frac{\mathbf{u}}{d_I(\mathbf{x})}$ 和 $\mathbf{x}+\frac{\mathbf{v}}{d_I(\mathbf{x})}$ 。
两个像素探针的深度差可以计算如下