2020.03.21更新

最近自监督学习引起了非常广泛的讨论，知乎大佬们写了一些非常好的总结，供大家参考：

Representation Learning with Contrastive Predictive Coding

$z_t=g_{enc}(x_t)$

观测序列 $x_t$ ——非线性编码器 $g_{enc}$ ——潜在表示序列 $z_t$

$c_t=g_{ar}(z_{<=t})$

潜在表示序列 $z_{<=t}$ ——自回归模型 $g_{ar}$ ——上下文潜在表示 $c_t$ （——观测值 $x_{t+k}$ ）

不直接用生成模型 $p_k(x_{t+k}|c_t)$ 预测未来观测值 $x_{t+k}$

密度比（density ratio）：保护 $x_{t+k}$ 和 $c_t$ 的互信息

$f_k(x_{t+k},c_t)\propto\frac{p(x_{t+k}|c_t)}{p(x_{t+k})}$

log-bilinear model

$f_k(x_{t+k},c_t)=\exp{(z_{t+k}^TW_kc_t)}$

https://blog.csdn.net/u013265285/article/details/69062795

http://licstar.net/archives/328

Three New Graphical Models for Statistical Language Modeling

http://www.doc88.com/p-9089781351111.html

互信息：两个变量之间的相关性

http://www.omegaxyz.com/2018/08/02/mi/

Data-Efficient Image Recognition with Contrastive Predictive Coding

1 实现框架

1. 无监督预训练（训练蓝色encoder）：空间预测任务

a. patch —— encoder —— mean pooling —— single vector

b. center of image —— context network —— context vector —— predict unseen

2. 用CPC表示进行分类

训练好的encoder + classifier (去掉context) —— 分类结果

2 具体细节

2.1 Feature Encoder

a patch $x_{i,j}$ —— deep residual network —— mean-pooling —— a single vector

256*256 image —— 64*64 patch （32*32 overlap）—— 7*7 feature vector $z_{i,j}=f_\theta(x_{i,j})$

2.2 Context Network

$c_{i,j}=f_{context}(z_{i,j})$

PixelCNN (Pixel Recurrent Neural Networks)

PixelCNN (去掉softmax)

Gated PixelCNN (Conditional Image Generation with PixelCNN Decoders)

1 Gate

$y=\tanh(W_{k,f}*x)\odot\sigma(W_{k,g}*x)$

* conv

2 Blind spot

Horizontal Stack

1 x (n//2+1) conv with pad &crop 把中心点之后的截掉(e.g 3变2，7变4)

Mask B

2 w/o mask

3 w/ mask

Mask A

Vertical Stack

(n//2+1) x n conv with pad:n//2+1

A single layer in Gated PixelCNN

NOTE：

pixelCNN https://blog.csdn.net/p_lart/article/details/88602253

gated pixelCNN https://blog.csdn.net/Jasminexjf/article/details/82499513

CODE：

http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html

https://github.com/kundan2510/pixelCNN/blob/master/layers.py

2.3 Predictor

根据当前上下文特征向量 $c_{i,j}$ 预测未来特征向量 $z_{i+k,j}$ ，线性预测：

$\hat{z}_{i+k,j}=W_kc_{i,j}$

3 对比损失（InfoNCE）

目标：从数据集中的一系列随机采样的patch表示 $\{z_l\}$ 中，正确找到target（有点像在好多块拼图中找正确的）

用softmax计算target概率，用交叉熵计算损失：位置+预测偏移的损失

其中，负样本 $\{z_l\}$ 是该图像或其他图像的其他patch，这个损失叫InfoNCE，最大化 $c_{i,j}$ 和 $z_{i+k,j}$ 的相互信息。

鼠老师的解释

预测 ——> 分类 ——> NCE

匹配程度——点乘相似度： $\hat{z}^T_{i+k,j}z_{i+k,j}$ (1*4096) * (4096*1) —— 1d

如果 x 高的地方 y 也比较高， x 低的地方 y 也比较低，那么整体的内积是偏大的，也就是说 x 和 y 是相似的。

softmax 归一到0-1之间

交叉熵 $H(p,q)=-\sum_xp(x)\log q(x)=-\sum1*\log (softmax)$

NCE 损失 https://www.cnblogs.com/arachis/p/NCE_Loss.html

通俗易懂解释NCE https://www.zhihu.com/question/50043438

负采样 https://blog.csdn.net/qq_28444159/article/details/77514563

NCE+负采样 https://blog.csdn.net/wizardforcel/article/details/84075703

4 patch和image不匹配

symmetric padding

https://blog.csdn.net/guyuealian/article/details/78113325 (有图)

https://blog.csdn.net/Hansry/article/details/84071316

bn

https://www.cnblogs.com/wanghui-garcia/p/10877700.html

利用CPC进行无监督学习

避免trivial shortcuts: ways of solving the problem without learning semantics.

1. make the network larger (deeper & wider ResNet-170)

2. layer normalization

3. upward + downward direction: use different context networks

4. patch augmentation: color dropping (掉色) / randomly flip patches horizontally / jitter

利用CPC进行半监督学习

给定数据集，包含N 张图像 $\{x_n \}$ (无标签)

$\theta^*= \arg \min_\theta \frac 1N \sum_{n=1}^NL_{CPC}[f_\theta(x_n)]$

给定一个小数据集，包含M张图像 $(x_m,y_m)$ (有标签)

$\phi^*= \arg \min_\phi \frac 1M \sum_{m=1}^ML_{Sup}[g_\phi f_{\theta^*}(x_m),y_m]$

实验结果

Experiment 1 （label率 - acc）

Experiment 2 （other methods）

Experiment 3 （Transfer：classification - detection）

Experiment4 （Transfer：frozen / fine-tune）

Experiment5 （Linear Separate）

Experiment6 Iteration

总结

我们的研究结果表明，以前的研究远未充分挖掘上下文信息作为视觉表征学习的监控信号的潜力。

我们通过构建一个更强大的架构来解决一个CPC任务，并增加CPC任务的难度，训练出了更好的图像表示，即使每个类别只训练了13个图像，也比所有以前的方法在ImageNet上有更大幅度提高。这些特征在不进行微调的情况下提供了几乎同样强大的性能，表明了适用于许多视觉任务的通用、无监督功能的潜力。

然而，为了简化比较和全面探索的结构设计，本文仅在单个图像中使用空间预测，仅探索单个图像中的上下文，但还有许多其他预测任务可能进一步促进这些结果，如[15]所建议的。一个理想的任务应该包括时间和其他方式，我们相信对比特征预测可以作为其中许多方法的统一基础。考虑到自监督特征学习的快速发展，我们认为进一步的改进可能会导致无监督特征在视觉社区感兴趣的许多任务中优于有监督特征。

分布式训练

https://blog.csdn.net/m0_38008956/article/details/86559432

https://pytorch.org/docs/master/distributed.html

CPC