Abstract
-
Purpose
- We present a novel method to realistically puppeteer and animate a face from a single RGB image using a source video sequence.
-
Procedures
- fitting a multilinear PCA model to obtain the 3D geometry and a single texture of the target face.
- dynamic per-frame textures that capture subtle wrinkles and deformations corresponding to the animated facial expressions
-
problems
- dynamic textures cannot be obtained directly from a single image
- not possible to obtain actual images of the mouth interior.
-
Solution
- a Deep Generative Network that can infer realistic per-frame texture deformations of the target identity
-
Conclusion
By retargeting the PCA expression geometry from the source, as well as using the newly inferred texture, we can both animate the face and perform video face replacement on the source video using the target appearance.
本文所要达成的目标是,把一张静态的二维的人脸,移植到另一张动态的人脸的视频上。首先,用PCA模型给人脸构建3D模型,其次,捕获人脸动态的皱纹等细节特征。由于动态的皮肤特征没法直接从一张图片中获得,而且口腔中牙齿等特征几乎没有,本文引入了conditional GAN来生成这些缺失的特征。
1. Introduction
前人的工作主要有video rewriting, face replacement还有realtime video reenactment. 它们的局限性是,都需要经过处理过的高清视频作为输出。而本文的方法是,从一张图片中生成脸部表情视频。
一种办法是套上一个3D模型,但是这样就没有皱纹这些细节之处。本文想要做到的是让这些脸部细节随着表情的变化而变化,该有的时候有,该没有的时候就没有。还有能够生成口腔中的细节特征。
2. Related Work
2.1 Facial retargeting and Enactment
目前最好的是用了两张或者更多图片去生成
2.2 Capturing and Retargeting Photorealistic Mouth Interior
这个有意思,有的人通过声音来还原嘴巴构造。最近的是通过搜索最相近的口腔模型来达到还原口腔的目的
2.3 Deep Generative Model for Texture Synthesis
之前有用马尔科夫网络用于生成高清人脸的,但是有瑕疵。有用统计模型来生成皱纹的,但是不高清。有用深度学习框架的,高清但是没能生成皱纹。
3. Overview
Our pipeline consists of the following steps (illustrated in Fig. 1):
- Fit a 3D model to extract static albedo textures from each frame in the source video sequence and the single RGB target image (Section 4).
- Infer dynamic textures and retarget the per-frame texture expressions from the source video frames onto the target image texture using a generative adversarial framework (Section 5).
- Composite the target mesh with the generated dynamic textures into each frame in the source video (Section 6).
4. Fitting the Face Model
5. Dynamic Texture Synthesis
5.1 Deep Learning Framework
5.2 Loss Function
5.3 Network Architecture
wiki: UV mapping is the 3D modeling process of projecting a 2D image to a 3D model's surface for texture mapping.
5.4 Mouth Synthesis
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
首先,第一步,把嘴巴不管是闭着的还是开着的都把它的UVtexture投影到三维,如果是闭着的那就是一片粉色,没有牙齿。然后我们把张开嘴的视频通过深度学习框架迁移到静态的图象中,从而来推测嘴部以及口腔内部的样子。由于缺少口腔的训练数据,导致嘴部的分辨率较低,于是他们用了一种叫SIFT-Flow的东西,这个东西把源视频的每一帧提取特征,通过一种matching公式把它整合到目标图片中去
6. Video Face Replacement via Blending
6.1 Graph-Cut
略
7. Experiments
略