Semi-supervised learning

A set of unlabeled data, usually U >> R (unlabeled data多于labelled data)
Transductive learning: unlabeled data is the testing data, training的时候用了testing data的feature，但是不能找它的label出来
Inductive learning: unlabeled data is not the testing data，training的时候不考虑testing data。
之所以有效果的原因：未标记的数据的特征是有价值的，例如下图，未标记的样本分布决定SVM的超平面怎么划：

who knows

Why semi-supervised learning?

Collecting data is easy, but collecting “labelled” data is expensive
We do semi-supervised learning in our lives.

Why semi-supervised learning helps?
semi-supervised learning伴随一些假设，semi有没有用取决于假设合不合理。

why semi

outline

-Semi-supervised generative model
-Low-density separation assumption
-Smoothness assumption
-Better Representation

Semi-supervised generative model

supervised generative model

semi-supervised generative model

Unlabeled data 会影响Probability和decision boundary的估测。
给出一组初始化参数θ
Step1: compute the posterior probability of unlabeled data, depending on model θ
Step2: update model

semi计算

初始值影响收敛的结果

Low-density Separation Assumption

非黑即白的假设

非黑即白

Self-training model

先用labelled data训练一个模型，用这个模型train一些unlabeled data生成一些假的标签Pseudo-label
从unlabeled data里面选a set of data加进labelled data set里面去，如何选择？可以自己设置一些方法，比如比unlabeled data里的data sets设置权重。
再重复以上过程。

Self-training

Self-training v.s. generative model

Hard label (强制assign label) v.s. Soft label (label prob)
假设现在用的是neural network，哪个work？

Hard v.s. Soft label

进阶版Entropy-based Regularization

不强制assign label，但是假设output符合某种distribution，如果distribution是集中的那就是比较好的。
怎么评估y的集中程度？就是用Entropy的方法。
Entropy越小distribution越集中。

entropy

Semi-supervised SVM

Semi-supervised SVM穷举所有unlabeled data的label的可能性，然后对每一个可能的结果算SVM，哪一种可能性会让margin最大同时minimize error。

semi svm

Smoothness Assumption

近朱者赤，近墨者黑
假设：x分布是不平均的，在某些地方集中，在某些地方分散，如果两个x在某个高密度区域相近，那么它们的y是一样的。
connected by a high density path。

Smothness

Smooth用于文件分类

因为词汇很多，你的label data和unlabeled data之间可能没有任何overlap

Smooth用于文件分类1

但如果collect到够多的unlabeled data，那就会得到一些overlap。
Smooth用于文件分类2

如何实现？

方法1: Cluster and then label

cluser

这个方法不一定work，因为需要cluster很强。

方法2: Graph-based approach

每一笔data画成graph，如果两个点在graph走的到，就属于一个class。
怎么画图？
有些现成的，比如hyperlink of webpages，citation of papers，
但是有时候可能需要自己想办法画图。

定义两个x之间算相似度的方法；
Add edge：KNN, e-Neighborhood
给edge一些weight，让它跟x之间的相似度成正比，比如Gaussian Radial Basis Function。

Graph-based1

Graph-based approach原理：
The labelled data influence their neighbors.
Propagate through the graph.
这种方法的前提是critical data要够多，要不连接传不过去。

Graph-based2

Define the smoothness of the labels on the graph

Graph smoothness1

另外S可以通过矩阵运算得到，即计算L, W为图的邻接矩阵，D的对角线上的值为每行的和

Graph smoothness2

在神经网络传播时，将S乘上权重λ加到损失函数上：

Graph smoothness3

Better Representation

去芜存菁，化繁为简
（到讲supervised的时候再讲）

Better Representation原理：

better represent

我自己的summary：
半监督学习有一堆远远多于labeled data的unlabeled data，它的学习过程基于一些假设。半监督生成式(generative)模型给出一组初始θ，据此计算unlabeled标签属于哪一类label的可能性(soft label)，再update模型，初始值影响收敛的结果。Self-training模型基于非黑即白（Low-density Separation）的假设，会强制assign label (hard label)生成一些假的标签Pseudo-label。有一种进阶版Entropy-based正则化不强制给label，但假设output符合某种distribution，该分布越集中越好。Semi-supervised SVM穷举所有unlabeled data的label的可能性，然后对每一个可能的结果算SVM，哪一种可能性会让margin最大同时minimize error。还有一种假设是Smoothness Assumption（近朱者赤，近墨者黑），意思是假设如果两个x在某个高密度区域相近，那么它们的y是一样的。基于这种假设有cluster-based和graph-based的算法，前者有限制，依赖好的cluster，后者比如标签传播算法（Label Propagation Algorithm）是一种基于图的半监督算法，通过构造图结构（数据点为顶点，点之间的相似性为边）来寻找训练数据中有标签数据和无标签数据的关系。

b站学习链接：https://www.bilibili.com/video/BV1Ht411g7Ef?p=23

李宏毅《机器学习》第21讲半监督学习

李宏毅《机器学习》第21讲半监督学习

Semi-supervised learning

Semi-supervised generative model

Low-density Separation Assumption

Self-training model

Self-training v.s. generative model

进阶版Entropy-based Regularization

Semi-supervised SVM

Smoothness Assumption

Smooth用于文件分类

方法1: Cluster and then label

方法2: Graph-based approach

Better Representation

推荐阅读更多精彩内容

李宏毅《机器学习》第21讲 半监督学习

Semi-supervised learning

Semi-supervised generative model

Low-density Separation Assumption

Self-training model

Self-training v.s. generative model

进阶版Entropy-based Regularization

Semi-supervised SVM

Smoothness Assumption

Smooth用于文件分类

方法1: Cluster and then label

方法2: Graph-based approach

Better Representation

推荐阅读更多精彩内容

李宏毅《机器学习》第21讲半监督学习