论文阅读:《Bootstrap Your Own LatentA New Approach to Self-Supervised Learning》
论文地址:https://arxiv.org/abs/2006.07733
论文翻译:https://blog.csdn.net/qq_41344430/article/details/108362989
网络更加庞大,难以训练,需要大量的标记数据来监督训练,成本过高。所以需要一种自监督学习,来训练网络,使网络更加泛化。网络没有预训练,直接使用在自己的标签数据上时,效果可能不会太好,并且收敛也较慢。如果网络能在大规模的数据集上完成自监督训练,只需要训练出它的强特征提取能力,无论是在后续的任务中,是冻结网络权重,还是不冻结权重继续学习有标签数据,网络都是能够提供极强的特征提取能力,并且极大提高网络收敛速度。该方法更多是为网络的迁移学习做准备的,特别是在应对数据量非常少的情况下,如果网络没有一个事先的强特征提取能力,对后续特定学习效果将不会太好,并且也将影响网络泛化性。
在讲这篇论文之前,先从自监督训练的崩塌问题开始说起。我们知道现在大部分的自监督训练都是通过约束同一张图的不同形态之间的特征差异性来实现特征提取,不同形态一般通过指定的数据增强实现,那么如果只是这么做的话(只有正样本对),网络很容易对所有输入都输出一个固定值,这样特征差异性就是0,完美符合优化目标,但这不是我们想要的,这就是训练崩塌了。因此一个自然的想法是我们不仅仅要拉近相同数据的特征距离,也要拉远不同数据的特征距离,换句话说就是不仅要有正样本对,也要有负样本对,这确实解决了训练崩塌的问题,但是也带来了一个新的问题,那就是对负样本对的数量要求较大,因为只有这样才能训练出足够强的特征提取能力,因此我们可以看到这方面的代表作如SimCLR系列都需要较大的batch size才能有较好的效果。 这篇论文提出的BYOL特点在于没有负样本对,这是一个非常新奇的想法,通过增加prediction和stop-gradient避免训练退化。
先对图像做个数据增强,再用ResNet提特征,然后用MLP做个变换,然后online network再用MLP去预测target network的输出,为什么online network要用两个MLP呢?第一个MLP(Projector)是因为SimCLR发现这样好使,作者就follow了这个做法。第二个MLP(Predictor)对这篇论文很重要。最后用输出的两个特征计算 Loss作为loss,loss的梯度只在online network上反传,那个双斜杠就是梯度不反传的意思(stop gradient),target network的参数是online network的滑动平均。
无监督版本的Mean Teacher
Among these methods, mean teacher (MT) alsouses a slow-moving average network, called teacher, to produce targets for an online network, called student. An consistency loss between the softmax predictions of the teacher and the student is added to the classificationloss.
Description of BYOL
整体上分为online network和target network两部分,如上流程图所示,通过约束这两个网络输出特征的均方误差(MSE)来训练online network,而target network的参数更新取决于当前更新后的online network和当前的target network参数,这也就是论文中提到的slow-moving average做法,灵感来源于强化学习。
The online network is defined by a setof weights and is comprised of three stages: an encoder , a projector and a predictor 。
The target network has the same architecture as the online network, but uses a different set ofweights .
Bootstrapping
Bootstrapping is any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias, variance, confidence intervals, prediction error, etc.) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.
Bootstrapping estimates the properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution function of the observed data. In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples with replacement, of the observed data set (and of equal size to the observed data set).
It may also be used for constructing hypothesis tests. It is often used as an alternative to statistical inference based on the assumption of a parametric model when that assumption is in doubt, or where parametric inference is impossible or requires complicated formulas for the calculation of standard errors.
参考资料:https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
参考资料:https://zhuanlan.zhihu.com/p/343288895
https://blog.csdn.net/u014380165/article/details/110408249
https://blog.csdn.net/weixin_44070509/article/details/120241756
https://zhuanlan.zhihu.com/p/163811116
https://blog.csdn.net/weixin_48866452/article/details/117991840**