原文地址https://pkhungurn.github.io/talking-head-anime/
Abstract. Fascinated by virtual YouTubers, I put together a deep neural network system that makes becoming one much easier. More specifically, the network takes as input an image of an anime character's face and a desired pose, and it outputs another image of the same character in the given pose. What it can do is shown in the video below. https://youtu.be/kMQCERkTdO0
I also connected the system to a face tracker. This allows the character to mimic my face movements: https://youtu.be/T1Gp-RxFZwU
I can also transfer face movements from existing videos: https://youtu.be/FioRJ6x_RbI
You can find the code for the above tools here.
1 Introduction
In the past two years, I have been really into virtual YouTubers (VTubers). For the uninitiated, these are anime characters, acted and voiced by real people, that contribute video contents and/or do live streams in YouTube. Perhaps the easiest way to get what they are is to see one in action. Below is Shirakami Fubuki, one of my favorite VTubers:
VTubers form a new cohort of entertainers, and they are gaining traction in Japan. According to this article by BBC, a new industry is being developed around them, with a company planning to invest millions of dollars.
On the other hand, I have also been fascinated by recent advances in deep learning, especially when it comes to anime-related stuffs. In 2017, a team of dedicated researchers have successfully trained a generative adversarial network (GAN) to generate images of anime characters of very good quality
. Recently, Gwern, a freelance writer, released the largest corpus of anime images and also managed to train another GAN that generated anime characters that are eye-poppingly beautiful. Sizigi Studios, a San Francisco game developer, opened WaifuLabs, a website that allows you to customize a GAN-generated female character and buy merchandise featuring her.
verything seems to point to the future where artificial intelligence is an important tool for anime creation, and I want to take part in realizing it. In particular, how can I make creating anime easier with deep learning? It seems that the lowest hanging fruit is creating VTuber contents. So, since early 2019, I embarked on the quest to answer the following question: Can I use deep learning to make becoming a VTuber easier?
2 What am I trying to do?
So how do you become a VTuber to begin with? You need a character model whose movement can be controlled. One approach is to create a full 3D model of the character, and it is taken by many famous VTubers such as Kizuna AI, Mirai Akari, and Dennou Shoujo Shiro. However, crafting a beautiful 3D model is expensive because it requires multiple types of talents: a great character designer is a must, and a highly skilled 3D modeler is also needed. It is rare for a person to be both, not to mention that creating a character is off limit for someone with no art skills like me. Of course, you can always throw money at the problem, but a simple Google search reveals that the asking price for a commission is around 500,000 yen ( 5,000 dollars).
Instead of 3D models, you can create 2D ones. A 2D model is a collection of movable images arranged into layers. Most VTubers use this type of models because it is much cheaper to create: commissioning seems to cost around 30,000 yen ( 300 dollars). Still, 2D modeling requires extra work on top of designing and drawing the character. The body needs to be divided into multiple movable parts. The modeler then has to assemble them together using specialized software such as Live2D. Specifying the parts' movements is also time consuming.
However, the possible movements of most 2D VTubers are rather simple. They can open and close their mouths and eyes, lower and raise their eyebrows, rotate their faces by some small angles, and rock their bodies left and right. They rarely rotate their bodies or move their arms and legs. The reason is that it is much harder to create such movements from a fixed collection of 2D images.
Given that the movements of 2D VTubers are simple and limited, can we automatically generate them on the fly instead of creating movable 2D models first? Being able to do so would make it much easier to become a VTuber. I can just go ahead and commission a drawing, which probably takes no more than 20,000 yen. Better yet, I can use a GAN to generate a character at virtually no cost! This would be a boon to not only someone who cannot draw like me, but also a benefit to artists: they can draw and get the character to move immediately without modeling. I also see immediate applications in game production. It would be super simple to make all characters in visual novels move as they go through the dialogue.
3 Overview
Now that the goal is established, let's get a little more specific about the project. The problem I'm trying to solve is this: given an image of an anime character's face and a "pose," generate another image of the same character such that its face is changed according to the pose. Here, a pose is a collection of numbers that specifies the character's facial expression and head orientation. In particular, my pose has 6 numbers, corresponding to the sliders in the first video of the abstract. I will discuss the specifics of inputs and outputs in the Problem Specification section.
[站外图片上传中...(image-d4fefd-1584251824936)] (© Kizuna AI). I used the 3D model downloaded from the official web page to render this image and other similar ones.
](https://upload-images.jianshu.io/upload_images/5193446-9429cb58c4c236cc.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
As you may have guessed, I will solve the problem with deep learning. This requires me to answer the follow two questions:
- What data am I going to train the network with?
- What network architecture am I going to use, and how in particular would I train the network?
It turns out that the main challenge is the first question. I need a dataset that contains face images annotated with their poses. EmotioNet is a large dataset of human faces with the desired type of annotations
. However, there is no such dataset for anime characters as far as I know.
I therefore generated a new dataset specifically for the project. I took advantage of the fact that there are ten of thousands of downloadable 3D models of anime characters, created for a 3D animation software called MikuMikuDance. I downloaded about 8,000 models and used them to render anime faces under random poses. I will discuss the steps to prepare the data in the Dataset section.
I designed the network according to how a 3D character model is animated. I decompose the process into two steps. The first changes the facial expression; i.e., controlling how much the eyes and the mouth are opened. The second rotates the face. I use a separate network for each step, making the second network takes as input the output of the first. Let us call the first network the face morpher, and the second the face rotator.
For the face morpher, I use the generator architecture employed by Pumarola et al. in their ECCV 2018 paper
. The network changes facial expression by producing another image that represents changes to the original image. The change image is combined with the original using an alpha mask , also produced by the network itself. I found that their architecture works excellently for changing small parts of the image: closing eyes and mouths in my case.
The face rotator is much more complicated. I use two algorithms, implemented in a single network, to rotate the face, thereby producing two outputs. The algorithms are:
- Pumarola et al.'s algorithm. This is the one just used to modify facial expression, but now I tell the network to rotate the face.
- Zhou et al.'s view synthesis algorithm . Their goal is to rotate a 3D object in an image. They do so by having a neural network compute an appearance flow: a map that tells, for each pixel in the output, which pixel in the input to copy color from.
Appearance flow produces sharp results that preserve the original texture, but it is not good at hallucinating occluded parts that become visible after rotation. On the other hand, Pumarola et al's architecture produces blurry results but can hallucinate disoccluded
parts as it is trained to change the original image's pixels without copying from existing ones. (See Figure 6C for a visual demonstration of the pros and the cons of the algorithms.) To combine both approaches' advantages, I train another network to blend the two outputs together through an alpha mask. The network also outputs a "retouch" image, which is blended with the combined image with yet another alpha mask.
.
I will discuss the architectures of all the networks and how they were trained in details in the Networks section.
4 Problem Specification
The input to the system consists of an image of an anime character and a desired pose vector. The image is of size 256256, has RGBA format, and must have a transparent background. More specifically, pixels that do not belong to the character must have the RGBA value of (0,0,0,0), and those that do must have non-zero alpha values. The character's head must be looking straight in the direction perpendicular to the image plane. The head must be contained in the center 128128 box, and the eyes and the mouth must be wide open. (The network can handle images with eyes and mouth closed as well. However, in such a case, it cannot open them because there's not enough information on what the opened eyes and mouth look like.) In 3D character animation terms, the input is the rest pose shape to be deformed. See a course on 3D character animation if you are curious.
As said earlier, the character's face configuration is controlled by a "pose." In my case, it is a 6-dimensional vector. Three components control the facial features and have values in the closed interval .
- Two components control the opening of the eyes; one for the left eye and the other for the right. The value of means the eye is fully open, and means the eye is fully closed.
- One component controls the opening of the mouth. This time, however, means the mouth is fully closed, and means the mouth is fully open. The contradicting semantics of the eye and the mouth parameters stem from the semantics of morph weights of the 3D models.
The three other components control how the head is rotated. In 3D animation terms, the head is controlled by two "joints," connected by a "bone." The neck root joint is at where the neck is connected to the body, and the neck tip joint is at where the neck is connected to the head. In the skeleton of the character, the tip is a child of the root. So, a 3D transformation applied to the root would also affect the tip, but not the other way around.
Figure 4B. Two joints that control the character's head.
The three components of the pose vector have values in the interval .
- One component controls the rotation around the -axis of the neck root joint. Here, I use a coordinate system where the -axis points up, the -axis points to the left side of the character, and the -axis points to the front. (See Figure 4B.) So, the component controls how much the neck is tilted sideway. I limit the rotation angle to the range . The value of corresponds to , and the value of corresponds to .
- One component controls the rotation around the -axis of the neck tip join. Physically, it indicates how much the head is tilted up or down. Again, we map the component value's range of to the rotation angle's range of . A positive value means the head is tilted up, a negative value means it is tiled down, and the value of means the head is facing in the direction parallel to the -axis.
- The last component has the same angular range as the previous one, but it controls the rotation around the -axis of the neck tip join. In other words, it control the horizontal direction of the face.
I omitted many types of movement, including those of eyebrows, irises, and the upper body. I do so to make the problem smaller and easier to solve so that I can finish the system and demonstrate that it works faster. While the system is limited right now, adding more types of movement is not conceptually different from what I do here. I leave this as future work.
To recap, the input consists of an image of a character's face and a 6-dimensional pose vector. The output is another image of the face that is posed accordingly.
5 Dataset
I want to animate drawn characters, so it would be the most advantageous to train the network with drawings. However, I created a training dataset by rendering 3D character models. While 3D renderings are not the same as drawings, they are much easier to work with because 3D models are controllable. I can come up with any pose, apply it to a model, and render an image showing exactly that pose. Moreover, a 3D model can be used to generate hundreds of training images, so I only need to collect several thousand models. If I were to use drawings, I would have to collect hundreds of thousands of them and annotate each one with the pose of the character. Annotating hundreds of thousands images is much harder than processing several thousands 3D models.
I use models created for a 3D animation software called MikuMikuDance (MMD). The main reason is that there are tens of thousands of downloadable models of anime characters. I am also quite familiar with the file format because I used MMD models to generate training data for one of my previous research papers. Over the years, I have developed a library to manipulate and render the models, and it has allowed me to automate much of the data generation process.
To create a training dataset, I downloaded around 13,000 MMD models from websites such as Niconi Solid and BowlRoll. I also found models by following links from VPVP wiki, みさきる!, and Nico Nico Pedia. Downloading alone took about two months.
Not all models are usable. Some of them are not even character models, and I had to discard models that my library could not handle. To reduce repetitive training data, I also subjectively removed models whose appearances, I thought, were too close to those of other models. After all elimination, I was left with around 8,000 models. Some of them are shown in the video below. https://youtu.be/6-zpzdH6j30
5.1 Data Annotation
The raw model data are not enough to generate training data. In particular, there are two problems.
The first problem is that I did not know exactly where each model's head was. I need to know this because the input specification requires that the head be contained in the middle 128 128 box of the input image. So, I created a tool that allowed me to annotate each model with the -position of the bottom and the top of the head. The bottom corresponds to the tip of the chin, but the top does not have a precise definition. I mostly set the top so that the whole skull and the flat portion of hair that covers it are included in the range, arbitrarily excluding hair that pointed upward. If the character wears a hat, I simply guessed the location of the head's top. Fortunately, the positions do not have to be precise for a neural network to work. You can see the tool in action in the video below: https://youtu.be/hTn3ErAMaDQ
The second problem is that I did not know how to exactly control each model's eyes. Facial expressions of MMD models are implemented with "morphs" (aka blend shapes). A morph typically corresponds a facial feature being deformed in a particular way. For example, for most models, there is a morph corresponding to closing both eyes and another corresponding to opening the mouth as if to say "ah."
To generate the training data, I need to know the names of three morphs: one that closes the left eye, one that closes the right, and one that opens the mouth. The last one is named "あ" in almost all models, so I did not have a problem with it. The situation is more difficult with the eye-closing morphs. Different modelers name them differently, and one or both of them might be missing from some models.
I created a tool that allowed me to cycle through the eye controlling morphs and mark ones that have the right semantics. You can see a session of me using the tool in the following video.
https://youtu.be/3M7OukBVpNo
You can see in the video that I collected 6 morphs instead of 2. The reason is that MMD models generally come with two types of winks. Normal winks have eyelids curved downward, and smile winks have eyelids curved upward, resulting in a happy look. Moreover, for each type of wink, there can be three different morphs: one that closes the right eye, one that closes the left, and one that closes both. At the point of data annotation, I was not sure which type of wink and morph to use, so I decided to collect them all. In the end, I decided to use only the normal winks because more models have them. While it seems that morphs that close both eyes are superfluous, some models do not have any morphs that close only one eye.
Annotating the models, including developing tools to do so, took about 4 months. It was the most time consuming part of the project.
5.2 Pose Sampling
Another important part of the training data is the pose, which I need to specify one for every training example. I generated poses by sampling each component of the pose vector independently. For the eye and mouth controlling parameters, I sample them uniformly from the interval. For the head joint parameters, I sampled from a probability distribution whose density grows linearly from the center of the range (i.e., ) to the extreme values (i.e., and ). The density is depicted in the figure below:
Figure 5A. Probability distribution for sampling the head joint parameters.
I chose this distribution to increase the frequency of hard training examples: when some head joint parameters are far from , there would be a large difference between the head configuration and that of the rest pose. I believe that forcing the network to solve challenging problems from the get-go would make it perform better in general.
5.3 Rendering
To generate a training image, I decide on a model and a pose. I rendered the posed model using an orthographic projection so that the -positions of the top and bottom of the head (obtained through manual annotation in Section 5.1) corresponds to the middle 128-pixel vertical strip of the image. The reason for using the orthographic projection rather than the perspective projection is that drawings, especially of VTubers, do not seem to have foreshortening effects.
Rendering a 3D model requires specifying the light scattering properties of the model's surface. MMD generally uses toon shading, but I used a more standard Phong reflection model because I was too lazy to implement toon shading. Depending on the model data, the resulting training images might look more 3D-like than typical drawings. However, in the end, the system still worked well on drawings despite being trained on 3D-like images.
Figure 5B. Comparison between (a) a rendering by the MikuMikuDance software, and (b) a rendering by my library. MMD produces a flat appearance that is more similar to a drawing. You can notice that the nose is much more noticeable in (b) than in (a). Nonetheless, because the images are very similar overall, the network trained with (b) would still generalize well to drawings. The character is Aduchi Momo and is © Ichikara Inc. The 3D model was created by 弐形 and is available here.
Rendering also requires specifying the lighting in the scene. I used two light sources. First is a directional white light of the magnitude that points straight in the direction. The light's direction was chosen to minimize shadow in the rendering. Second is a white ambient light source of magnitude .
Another detail of the data generation process is that each training example consists of three images. The first is that of the character in the rest pose. The second only contains changes to facial features. The third adds face rotation to the second. I do this because I have separate networks for manipulating facial features and rotating the face, and they need different training data. Note that, since the image with the rest pose does not depend on the sampled pose, we only need to render it once for each model.
| Figure 5C. For each training example, I rendered three images: (a) one with the character in the rest pose, (b) one with only facial expression changes, and (c) one with both facial expression changes and face rotation. Training the face morpher uses (a) and (b), but training the face rotator uses (b) and (c). |
5.4 Datasets
I divided the models into three subsets so that I can use them to generate the training, validation, and test datasets. While downloading the models, I organized them into folders according to the source materials. For example, models of Fate/Grand Order characters and those of Kantai Collection characters would go into different folders. I used models of VTubers from Nijisanji to generate the validation set and models of other VTubers to generate the test set. The training set was created from characters from anime, manga, and video games. Because the origins for the characters are different, there are no overlaps between the three datasets.
The numerical breakdown of the three datasets are as follows:
Training Set | Validation Set | Test Set | |
---|---|---|---|
Models | 7,881 | 79 | 72 |
Sampled Poses | 500,000 | 10,000 | 10,000 |
Rest Pose Images | 7,881 | 79 | 72 |
Expression Changed Images | 500,000 | 10,000 | 10,000 |
Fully Posed Images | 500,000 | 10,000 | 10,000 |
Total Number of Images | 1,007,881 | 20,079 | 20,072 |
Data generation was fully automated. The whole process took about 16 hours.
6 Networks
As discussed in the Overview section, my neural network system consists of many subnetworks. I'll now describe them in details.
6.1 Face Morpher
The first step to pose a character's face is to modify its facial features. More specifically, we need to close its eyes and mouth.
In their paper, Pumarola et al. describes a network that can modify human facial features according to the given Action Units (AU), which represent movements of facial muscles
. As the AU is a very general coding system, their network can do much more than closing eyes and mouth. As a result, I thought it would be effective at the task we have at hand. I tried it out, and it did not disappoint.
However, I did not use everything from the paper because my problem is much simpler than theirs. In particular, their training data do not come in pairs of faces of the same person with different facial expressions. So, they need to use a GAN with a cycle consistency loss to perform unsupervised learning. My data, however, are paired (i.e., I have Images (a) and (b) in Figure 5C for every training example), so I can do vanilla supervised learning. As a result, I only need their generator network.
For completeness, I shall describe Pumarola et al.'s generator in details. You can see an overview of the architecture in Figure 6A.
|| Figure 6A. The architecture of the face morpher. This is a reproduction of Figure 3 in Pumarola et al.'s paper. |
The network modifies facial expression by producing a change image, which is combined with the original input image through an alpha mask . (Pumarola et al. call it an attention mask, but I use a more common term here.) To do so, the input image and the pose is fed to an encoder-decoder network which will produce a 64-dimensional feature vector for each pixel of the input image. This image of feature vectors is then processed with two separate trains of a 2D convolutional unit and a suitable nonlinearity to produce the alpha mask and the change image. The detailed specification of the network is given in Appendix A.1.
Pumarola et al. trained their network with a rather complex loss function. To my surprise, a simple L1 pixel difference loss sufficed for my problem. Mathematically, the loss function is given by:Here:
- The subscript stands for "face morpher".
- denotes for the probability distribution of the training data. A sample is a tuple where is the character image in rest pose (Figure 5Ca), is the pose vector, and is the character image with facial expression (Figure 5Cb).
- denotes the face morpher network.
I optimized the network with the Adam algorithm using the same setting as Pumarola et al.'s: learning rate of , , , and batch size of 25. The network was trained for 6 epochs (3,000,000 examples), taking about 2 days with my GeForce GTX 1080 Ti GPU.
6.2 Face Rotator
The face rotator consists of two subnetworks. The two-algorithm rotator rotates the character's face with two different algorithms, each with its own strength and weakness. To combine their strength, the combiner takes the two output images, blends them together with an alpha mask, and also retouches the image to improve quality.
6.2.1 Two-Algorithm Rotator
The architecture is depicted in Figure 6B and is specified in details in Appendix A.2.
Figure 6B. The architecture of the two-algorithm face morpher.
The network can be thought of as an extension to Pumarola et al.'s generator: it has all the units of the generator but now contains a new output pathway. The old one is just Pumarola et al.'s network being asked to rotate face instead of closing eyes and mouth.
The new pathway produces output using an approach for rotating objects described in Zhou et al's paper
. The idea is that rotating an object, especially by a small angle, largely involves moving pixels in the input image to different locations. Zhou et al. thus propose computing the appearance flow: a map that tells where in the input image each pixel of the output image should be copied from. This map and the original image are then passed to a pixel sampling unit to generate the output image. In my architecture, the appearance flow is computed simply by passing the output of the encoder-decoder network to a new convolution unit.
I trained the network using two different losses. The first is just the L1 pixel difference loss:
- The subscript stands for face rotator, and the superscript stands for the L1 loss.
- again denotes the probability distribution of the training data. A sample is now where is the image with facial expression change (Figure 5Cb), the pose, and the fully posed image (Figure 5Cc).
- The index goes through the 2 output pathways, and denotes the output of the th pathway: \begin{align} I_k = G_k(I_e, \rho). \end{align} Here, denotes the Pumarola et al.'s pathway, and denotes Zhou et al.'s.
The second loss is a sum between the L1 pixel difference loss and Johnson et al's perceptual feature reconstruction loss
- The superscript stands for "perceptual."
- , , and denote the Johnson et al.'s feature reconstruction loss defined using the <tt style="box-sizing: border-box;">relu1_2</tt>, <tt style="box-sizing: border-box;">relu2_2</tt>, <tt style="box-sizing: border-box;">relu3_3</tt> layers of VGG16 , respectively. It is defined as:
where (\notes the feature tensor outputted by the th used layer. The loss is defined as a sum of two L1 norms instead of one because of a difference in the input format: VGG16 takes a 3-channel image as input while all images in this article have 4 channels. To solve the format disagreement, I create two 3-channel images from each 4-channel image . The first image, denoted by , is formed by taking only the RGB channels of . The second image, denoted by , is a 3-channel grayscale image formed by copying the A channel of to the three output channels.
- is the weight for the th feature reconstruction loss. I define it to be: where , , and is the number of channels, height, and width of . In effect, the feature reconstruction loss is rescaled so that it "matches" the size of the processed images before being multiplied by , a constant I picked arbitrarily.
Again, I trained the network with Adam, using the same hyperparameters as those of the face morpher. When using the L1 loss, I set the batch size to 25 and trained and for 6 epochs (3,000,000 examples). Training again took about 2 days. However, because evaluating the feature reconstruction loss requires much more memory, I had to reduced the batch size to 8 when training with the perceptual loss. I also found that, staring from a randomized state, training with both the L1 and the perceptual loss terms lead to instability. As a result, I started with the state after being trained with only the L1 term for half an epoch (obtained from a snapshot of training with only the L1 loss), and trained with the full perceptual loss for 6 epochs. Training took 6 days in this case.
6.2.2 Combiner
It is instructive to see the outputs of the two-algorithm rotator to see that one of them alone does not suffice.
|Figure 6C. Using various algorithms to rotate a character's neck to the right by . The character is Tokino Sora (© Tokino Sora Ch.), and I used the official model download from Niconi 3D.
In Figure 6C, a character's neck is rotated, and, as a result, part of her long hair that was occluded by the body becomes visible. We can see that Pumarola et al.'s algorithm produced a blurry face. I surmise that this is due to requiring the network to produce all the new pixels from a compressed feature encoding, which can lose the high frequency details of the original image. Similar behavior from other encoder-decoder architectures are observed in previous works; for example, those by Tatarchenko et al. and Park et al. . Zhou et al., on the other hand, reuse pixels from the input image and so is capable of producing sharp results.
Nevertheless, it is difficult to reconstruct disoccluded parts by copying existing pixels, especially when the right location to copy from is far away. We see in Figure 6Cb that Zhou et al.'s algorithm used the arm pixels to reconstruct the disoccluded hair. On the other hand, Pumarola et al.'s hair has a more natural color.
By combining the outputs of the two algorithms, we can get a much better result: relocated visible pixels would remain sharp, and pixels of disoccluded parts would have natural colors. The combiner network is depicted in Figure 6D and specified in details in Appendix A.3.
Figure 6D. The architecture of the combiner.
I use U-Net as the main body of the combiner in order to facilitate per pixel operations. Its output is then transformed into two alpha masks and a change image. The first alpha mask is used to combine the two input images. The second alpha mask and the change image are then combined with the output of the previous step to produce the final output. This last step "retouches" the combined image to improve its quality.
The combiner was trained separately from the two-algorithm rotator to reduce memory usage. I ran the latter on all training examples to generate the former's input. Again, I experimented with two loss functions. The first is the L1 loss:
- The subscript stands for "combiner."
-
stands for the output of the combiner. More precisely:
where denotes the combiner itself.
The second is the perceptual loss:- is the weight of of the perceptual loss relative to the L1 pixel loss. I set it to:
The weight is lower for the combiner as I chose it so that the two losses would have roughly the same magnitude when evaluated on the training and validation sets.
The training procedure was similar to that of the face morpher. However, for expediency, the duration was 3 epochs instead of 6. The batch size for the L1 loss was 20, and training finished in a day. For the perceptual loss, the batch size was 12, and training lasted 2 days.
7 Evaluation
Performance is evaluated using two metrics. First is the average per-pixel root mean square error (RMSE) between the network's output and the ground truth image. Second is the average structural similarity index (SSIM) . The scores are computed using the 10,000 examples in the test dataset.
8 Related Works
Image translation. The networks I developed solve an instance of the image translation problem: we are given an image and some optional extra information, and we are asked to produce another image that satisfies some requirements. Pix2pix is a general framework for image translation that casts the problem into creating a conditional generative adversarial network (cGAN) that takes the input image as a conditioning information
. Subsequent works extends it to allow unsupervised learning , and modification of multiple image attributes by a single network . To animate eyes and mouths, my work relies on Pumarola et al.'s paper in which the extra information is the AUs describing a human facial expression. All these works use GAN to automatically discover a loss function tailored to their domains. However, I only used fixed loss functions and was surprised that the approach worked well as it did.
Object rotation. My work also borrows from previous works that attempt to rotate objects in images. This problem is an instance of image translation where the extra information is the angles to rotate the object by. Tatarchenko et al. trained a neural network to rotate renderings of cars and chairs from the ShapeNet dataset
. However, because their network produces the pixels directly and uses L2 loss on the output images, the results are too blurry to be used in media production. Zhou et al. use appearance flow to make the results much sharper , and I use their algorithm to rotate faces. Nevertheless, appearance flow can produce nonsensical pixels in disoccluded areas. Park et al. solve this problem by keeping only the pixels that were visible before rotation and having a GAN fill in the missing disoccluded parts . My approach also fixes disoccluded parts, but it combines appearance flow results with that of another algorithm rather than generating missing pixels from scratch.
Animating human images. Creating animation from a single or a few human images has been researched extensively by the graphics and vision communities, and much progress was made recently thanks to deep learning. These works typically seek to make the human in a given "target" image or video imitate the action of another human in a separate "driving" video. Chan et al. devise a pipeline to make the target person imitate the full-body dance moves of the driving video
. It makes extensive use of neural pose estimation and requires a video of the target person in the training phase. Zakharov et al. train an embedder network to map one or several target images into a latent code, which is used to control a generator network to translates a facial landmark image so that it has the face of the target human . Wang et al. propose a general framework for video translation in which target images are transformed to weights of a part of an image generation network . Their approach can transfer facial expressions as well as full-body poses. Note that these works solve problems that are different from mine because their source of movement is the driving video rather than a sequence of abstract parameters.
Animating artworks. Researchers have also used machine learning to animate drawings. Hamada et al. trained a GAN to generate fully-body images of anime characters, conditioned on a stick figure image that specifies the character's pose
. Their work is different from mine in two ways. First, the character's appearance is determined by the latent code and so cannot be specified directly like in my work. Second, their GAN requires a full-body stick figure image as conditioning information while my network requires only the pose parameter values. Researchers from Preferred Networks showed in their website that they could smoothly animate facial expressions of anime characters. Their approach seems to be able to generate a variety of mouth and eye shapes, and the outputs are overall sharp and free of artifacts. However, it is unclear whether the character's identity is well preserved by their algorithm: one can see that the irises change color as the eyes are closed. Kevin Frans, who interned at Sizigi Studio (the creator of Waifu Labs), demonstrated a system that can make a Waifu-Labs generated character imitate actions of another character in a video. As the target character appearance is controlled by a latent code, the system may not be able to animate existing characters like my approach does. In August 2019, AlgoAge, a Japanese AI startup, recently promoted DeepAnime: a system that can animate the eyes and mouth of existing characters given voice recordings. The system takes different inputs from my networks, and face rotation does not seem to be a feature. Lastly, Poursaeed et al. trained a network that deforms a 2.5D character model to a given image of the same character with a different pose, allowing inbetween frames to be generated by interpolating the model parameters . However, their approach requires human intervention because the the model must be created by manually segmenting a reference frame.
9 Conclusion
I have presented a neural network system that can modify facial expression and rotate the face of an anime character, given only a single image of the character looking straight at the viewer. Despite the 2D input, the system can rotate the character's face as if it were a 3D object. It also infers how to plausibly close the character's eyes and mouth, taking into account the fact that these facial features may be occluded. As a result, it can be used to generate talking head animations without creating character models, significantly reducing the cost of animation production.
Another strength of the system is its ease of use. For many previous works, the character being animated is tied to a GAN latent code, making it difficult to customize appearance or to preserve identity as the character moves. On the other hand, my system takes the character image as a direct input and can animate existing characters. Moreover, the character's pose is determined by 6 numerical parameters, allowing it to be controlled by any process that modulates the numbers. I demonstrated controlling characters with UI manipulation and face tracking performed on a live video stream.
Central to this project's success is a scalable way to generate training data by leveraging downloadable 3D models. Working alone in my free time over the period of 6 months and spending next to no money, I was able to create a large, paired dataset that allows straightforward supervised learning. With high quality training data, I could use relatively simple networks to produce good animations.
I see applications of the system in VTuber content creation and video game production. I believe this work shows that machine learning can be a useful tool in animation.
The approach described in this article, of course, has several limitations:
- The input image must adhere strictly to the specification in Section 4. The character has to stand upright and look straight ahead. Moreover, the image must include the alpha channel. As a result, the system cannot be applied to character images in the wild.
- Currently, the system only knows one way to manipulate the eyes and the mouth: closing them. It would be much more useful if it learns to hallucinate various mouth and eye shapes.
- Generated images are blurry and contain artifacts in disoccluded parts.
Lastly, I see many possible directions for future work:
- Change the training data and process to allow character images in the wild to be animated.
- Fix blurriness and visual artifacts by incorporating generative models.
- Extend the approach to enable incorporating multiple drawings to take advantage of multiple views of the same characters present in almost all character design sheets.
- Enable multiple mouth, eye, and eyebrow shapes in order to make the animation more expressive.
- Make the network recurrent in order to simulate dynamic elements such as hair and cloth movement.
- Infer 2.5D layered models or full 3D models from drawings.
10 Disclaimer
While I am an employee of Google Japan, this work is done in my free time without using Google's resource. My day job is writing backends for a part of Google Maps, and I do not belong to any of Google's research organizations. In other words, this project has nothing to do with work. Moreover, I do not list myself as affiliated with Google when publishing this article. What is expressed here is my opinion and should not be considered to be the company's.
By the terms of my employment, Google may claim rights to the intellectual property of the invention. I am trying to have the copyright to the software assigned to me via an internal review process, and we will see how that goes. I will also try to publish an academic paper out of this.