【转载】Image Operations with cGAN

http://www.k4ai.com/imageops/index.html

mage Operations with cGAN

on AIMLtechGANAR/VR

Figure 1. A typical sample image pair used in the experiment that trains cGAN to erase the background. At left is the input image, at right is the expected output image where the background has been manually erased.

In this report we explore the possibility of using cGAN (Conditional Generative Adversarial Networks) for performing automatic graphic operations on the photographs or videos of human faces, similar to those typically done manually using a software tool such as Photoshop or After Effects, by learning from examples.

Motivation

A good part of my research in Machine Learning has to do with images, videos, and 3D objects (e.g., Monocular Depth PerceptionGenerate photo from sketchGenerate Photo-realistic AvatarsHow to build a Holodeck series), so I find myself constantly in need of an artist for the tedious task of create a large number of suitable image/video training datasets for such researches. Given that cGAN is a versatile tool for learning some sort of image-to-image mapping, then the natural question is whether cGAN is useful for automating some of these tasks.

And more importantly, we seek to find out whether we can use cGAN for performing an atypical type of supervised learning that does not involve category labels assigned to the training samples, nor the goal is about learning categorization. Instead, in our experiments the image pairs used for training embody the non-textual intention of the teacher, and the goal is for the system to learn the image mapping operations that achieve the intended goal, so that such operations can be applied successfully to unseen data samples.

For this report we will focus on dealing with human facial images. The image operations investigated include erasing background, image level adjustment, patching small flaws, as well as translation, scaling, and alignment. Experiments regarding removing outlier video frames will be reported in a separated post.

This report is part of a series of studies on the possibility of using GAN/cGAN as the latent representation for representing human faces, which is why the datasets used here are mostly images of human faces.

It should be noted that we have used relatively small datasets in the experiments here, mainly because this is only an exploratory study. For a more complete study, some of the more promising ones should be followed up with larger-scale experiments.

Experimental Setup

The setup for the experiments is as follows:

Hardware: (unless noted otherwise) Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.

Software: 

Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind.

Torch 7, Python 2.7, Cuda 8

cGAN implementation: pix2pix, a Torch implementation for cGAN based on the paper Image-to-Image Translation with Conditional Generative Adversarial Networks by Isola, et al.

Training parameters. Unless noted otherwise, all training sessions use the following parameters: batch size: 1, L1 regularization, beta1: 0.5, learning rate: 0.0002, images are flipped horizontally to augment the training. All training images are scaled to 286 pixel width, then cropped to 256 pixel width, with a small random jitter applied during the process.

Experiment #1: erasing image background

Figure 2a. Example of a 5-star (i.e., best) test result, where the generated output image (at right) has the background erased completely.

Figure 2b. Example of a 4-star test result, where the output image is pretty good, but showing some vestige background.

Figure 2c. Example of a 3-star test result, where large area of the background remains, but it quite likely that the problem can be remedied with more extensive training.

Figure 2d. Example of a 2-star test result, where aside from large area of background remains visible, there is also significant encroachment into the left side of the subject.

Figure 2e. Example of a 1-star (i.e., worst) test result, where the erasure pattern is almost random.

The purpose of this experiment to see whether cGAN (using the pix2pix implementation) can be used to learn to erase the background of photos, in particular those with human facial images. Figure 1 shows a typical training sample.

Datasets: Photos from previous experiments are recycled for use here, augmented with additional facial images scrapped from the Internet. Such photos are then paired with target images that have the photo background manually erased. Since the manual preparation of the target images is labor intensive, we started with a very small training dataset of only 118 sample image pairs.

Training: The training session took 10 hours of computing using the setup described above.

Evaluating test result: A test dataset of 75 samples are used. The output images generated by the trained cGAN are subjectively ranked into the following five categories (see Figure 2.a-2.e for samples):

5-stars: almost perfect. Total: 18 samples.

4-stars: pretty close, but with some minor problems. Total: 17 samples.

3-stars: so-so result. Total: 24 samples.

2-stars: not good, but still within bounds. Total: 13 samples.

1-star: complete disaster. Total: 3 samples.

The test result above may seem unimpressive, but the following facts should be taken into perspective:

The training dataset size is actually quite tiny, considering that typical cGAN training requires a very large training dataset in order for it to properly learn the probability distribution in the dataset. It is in fact quite impressive that such result can be achieved with so little training, which leads one to conclude that the direction is quite promising.

The training dataset is also not very diverse (aside from being small). For example, it does not contain photos that are either black-and-white, or much off-center, or close-up with no background, as such it naturally do not test well against those types of photos (which was in fact intentionally included in the test dataset). Adding more training photos of those types has shown to almost always improve the result.

Overall we judge that cGAN could work well for background erasure, provided that a large and sufficiently diverse training dataset (relative to the expected test samples) is used.

Experiment #2: image alignment

In this experiment we try to get cGAN to learn how to align images from examples, which involves translation, scaling, and cropping.

We actually do not expect this to work, but ran the experiment anyways just so that we can set it up as a goal post for others to explore it further.

Figure 3a. A typical training image pair for the alignment experiment. Here the goal is for cGAN to learn to convert the input image (at left) to a target image (at right), through scaling, translation, and cropping, so that the two eyes are roughly at the same positions for all samples.

Figure 3b. An example of the test result, where the output image (at right) has little resemblance to the input image (at left), and very far from the expected output image (not shown here).

Datasets: Photos from previous experiments are re-used here (which are manually cropped), augmented with the original un-cropped version. See Figure 3a for a sample training image pair.

Since the manual preparation of the target images is labor intensive, we started with a very small training dataset of only 18 sample image pairs to probe the possibilities.

Training: The training session took 4 hours, which shows that cGAN was able to learn to map the image pairs in the training to near perfection. However, testing result is an entirely different matter.

Analysis

All test results look like Figure 3b, where the output image (at right) looks like a jumble of many faces, roughly at the right position, but totally unrecognizable. In other words, this experiment has failed miserably as expected.

So why wouldn't cGAN work for this task? To human eyes the translation and scaling of an image are fairly simple operations, but it is not the case for cGAN. The successive convolutional layers in cGAN is very good at capturing local dependency among nearby pixels, but global operations such as scaling or translation, which affects all pixels equally, is not what cGAN is designed for. The cGAN design is still in its infancy, and as it is right now it does not handle translation, scaling, and rotation well.

So what would it take to make this work? One approach is to incorporate something like the Spatial Transformer Networks to see whether it makes any difference, which we shall explore in a future post.

Experiment #3: video processing

In this experiment we want to find out if cGAN can be used for some simple video processing, including background erasure, image tone level adjustments, and making small repairs.

In other words, we seek to find out whether cGAN can be used to learn the intended image operations from a small number of training samples, and then apply the operations to an entire video to achieve satisfactory result.

In this experiment we treat a video pretty much as a collection of images, without taking advantage of its sequential nature (which we shall explore in another report). While it is similar to Experiment #1 for background erasure, there are some differences:

We also want to try incorporating other image operations at the same time, such as image tone level adjustment, as well as the patching of minor flaws.

The video sequence is essentially images of the same person, which allows us to explore more efficient training methods. Here we apply the technique of drill training to get good result from very few training samples. Drill training refers to the technique of using small number of images for intense training, with the goal of getting good results for this particular training set, but possibly increases test errors against a wider test dataset.

Experimental Setup The setup is the same as Experiments #1 and #2, except the following:

Figure 4a. Example of a training sample, where aside from levels adjustment and background erasure, the intruding fingers of another person at the lower-left corner are also erased and patched manually.


Figure 4b. A short video segment that demonstrates the result of applying a trained cGAN model to an original video (at left), which effectively erases the background as well as brightens up the image (at right). Video source credit: Interview with Adele - The bigger your career gets, the smaller your life gets | Skavlan, acquired for research purposes per the YouTube fair use guidelines

Figure 4c. Example of a more successful patch-up operation, where a trained cGAN attempts to patch the intruding fingers at the lower corner with color patterns that somewhat match the clothing. Some other observed cases are less successful at this.

Hardware: a standard laptop (Intel i7 CPU with 8GB RAM) is used instead due to resource constraints. This hardware runs the experiment about 10-15 times slower than using an AWS/EC2 g2.2xlarge GPU instance.

Model: a pix2pix cGAN model trained in Experiment #1 is used as the initial model.

Dataset: video segment of a celebrity interview is used for the test. The video is sampled at 10 fps and cropped to 400x400 pixels at the image center. 1185 frames are selected for this test, most of which (1128 frames) have the same person as the main subject in the image. No manual alignment or color adjustment are applied. Out of these 1185 frames 22 are selected and manually modified for use as the training dataset, with the rest used as the test dataset. The manual modifications done on the training dataset are as follows: 

Background are erased to show pure white.

Image are adjusted using the Photoshop Levels tool for better brightness and contrast.

Minor intrusion of other people in the image are erased and patched up as appropriate (see Figure 4a).

For trainings we apply the technique of drill training to reduce training time and number of samples required. Overall the training took several days using the non-GPU setup.

Figure 4b shows a 10-second segment of the test result. Our observations of the result are:

The background erasure has worked remarkably well, even with only 20 training samples. The outline of the clothing appears a bit wavy, which is due to the difficulty in guessing the outline of the dark clothing over dark background, and the current method provides no continuity between frames.

The levels adjustment applied in the training samples, which brightens up the images, are successfully transferred to the test result and makes the resulting video brighter.

cGAN can be seen patching up the intruding fingers problem with some of the test samples (see Figure 4c), where only two such patching examples (see Figure 4a) were provided in the training dataset. The result is by no means satisfactory, but it points to the possibility of getting much better result if more training is applied.

Conclusion

This is a preliminary study using very small datasets to demonstrate the possibilities. Further comprehensive experimentation is definitely needed.

The experiments conducted in this report are not meant just as fun applications of the cGAN method. The experiments above show that as an atypical type of supervised learning, cGAN can be used to perform certain types of image operations for achieving practical purposes.

Overall in our limited experiments we have shown that operations such as background erasure and image levels adjustment worked well. For such image operations just training 2% of the frames in a video is sufficient to transfer the image operations to the entire video with good result. The operation of patching up minor flaws has worked to some limited degree.

The operations of scaling and alignment did not work at all, which was expected. This actually shows the limitations of the current cGAN architecture. We may conduct a more detailed study on this in a separate post later.

It is worth noting that the background erasure operation may seem to bear some surface resemblance to semantic segmentation (e.g., as described in this paper), in the sense that both can be used to separate certain recognizable targets out of an image. They are in fact very different, because cGAN is generative, and the method here does not require any training on category labels.

Going Forward

Following are some planned follow-up studies:

Figure 5. A t-SNE visualization of the 512-dimensional data points in a latent representation space, with each point representing a frame in a video. The outlier video frames are shown in red.

As extensions to Experiment #3, explore how to take advantage of the sequential nature of a video, where adjacent frames are similar, in order achieve better test quality or faster training time.

Use cGAN to synthesize missing frames in a video, or for creating smooth slow-motion replay.

Detect outlier video frames that are substantially different from the training dataset. This can be used for carrying out semi-automatic cleanup of a video in order to remove wanted frames, which is quite useful for my own research since processing video for machine learning is a very tedious process. 

The idea here is that it is generally believed that GAN is able to learn a meaningful latent representation, and this implies the possibility that the unwanted data samples can be easily detected as some sort of outliers far apart from the training dataset in this latent representation (see Figure 5). It should be interesting to find out if it fact works well in the context of the automatic removal of unwanted video frames.

Use cGAN for image/video indexing and retrieval. The idea here is related to the last point regarding the latent representation learned by cGAN, since a good latent representation should make it easier to do indexing and retrieval.

Acknowledgments

I want to show my appreciation to the pix2pix team for their excellent paper and implementation. Without which this work would have been much harder to complete.

Last but not least, I want to show my gratitude to Fonchin Chen for helping with the unending process of collecting and processing the images needed for the project.

References

Isola et al,Image-to-Image Translation with Conditional Generative Adversarial Networks, 2016.

pix2pix, a Torch implementation for cGAN

kaihuchen

Kaihu is a AI researcher, software architect, programmer, and entrepreneur. He has a PhD degree in AI, 4 U.S. software patents, won 3rd place in the 1987 World Computer Go Competition.

U. S. A. http://smesh.net/p/ai

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,732评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,496评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,264评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,807评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,806评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,675评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,029评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,683评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,704评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,666评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,773评论 1 332
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,413评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,016评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,978评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,204评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,083评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,503评论 2 343

推荐阅读更多精彩内容