Generative Adversarial Nets（2014NIPS）

原理理解了，具体数学上的定义都没看。。。被问蒙了。。。
学习不能不求甚解呜呜呜
仔细看一遍！！

Introduction

a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation
In this case, we can train both models using only the highly successful backpropagation and dropout algorithms and sample from the generative model using only forward propagation

Adversarial nets

data $x$ ，生成器分布 $p_g$ ，定义先验 $p_z(z)$ ，其中 $z$ 为input noise variables，到data space的映射为 $G(z;\theta_g)$ ，其中 $G$ 为用多层感知机表示的可微函数。
定义 $D(x;\theta_d)$ ，为多层感知机，输出为a single scalar。 $D(x)$ 表示 $x$ 是来自data而不是 $p_g$ 的概率。
给 $D$ 的输入分配标签，训练 $D$ 最大化损失。同时训练 $G$ 最小化 $log(1 − D(G(z)))$
损失函数：
$\min_{G}\max_{D}V(D,G)=E_x~p_{data(x)}[\log{D(x)}]+E_z~p_{z(z)}[\log{1-D(G(z))}]$

其中 $E_x~p_{data(x)}[\log{D(x)}]$ 表示 $[\log{D(x)}]$ 在 $x$ 属于 $p_{data(x)}$ 分布下的期望

the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit

训练早期，G比较差的时候，D的输入非常容易辨别，因此 $\log{1-D(G(z))}$ 的梯度值会很小，不利于G的训练。因此可以用 $\max\log{D(G(z))}$ 来代替。

Theoretical Results

G含蓄的将概率分布 $p_g$ 定义为当 $z$ ~ $p_z$ 时，样本 $G(Z)$ 的分布，因此，我们希望算法1能够收敛于对 $p_{data}$ 的较好的估计。

Algorithm 1

Minibatch stochastic gradient descent
训练D的steps $k$ 为超参数，实验中设置为1（最简单情况）
梯度更新可以使用标准梯度下降，实验中使用的是带动量的梯度下降。

for number of training iterations do
$\quad$ for k steps do
$\qquad$ 选择m个 $z$
$\qquad$ 选择m个 $x$
$\qquad$ 更新判别器梯度：
$\quad$ end for
$\quad$ 选择m个 $z$
$\quad$ 更新生成器梯度
end for

1、 $p_g=p_{data}$ 的全局最优性

G固定，D的最优解为
$D^*_G(x)=\frac{p_{data}(x)}{p_{data}(x)+p_g(x)}$
代回可得实际训练损失 $C(G)$ ，最小为-log4（满足 $D^*_G(x)$ 时）
一般情况：
$C(G)=-log4+KL(p_{data}||\frac{p_{data}+p_g}{2})+KL(p_g||\frac{p_{data}+p_g}{2})$
可以写成Jensen–Shannon divergence（JS散度）：
$C(G)=-log4+2*JSD(p_{data}||p_g)$
两种分布间的JS散度通常是非负的，且当他们相等时散度为0.