Adaboost

Adaboost(Adaptive Boosting，自适应增强)
学习步骤：
1、Adaboost算法流程；
2、通过例子学习Adaboost；
3、Adaboost公式推导；
4、总结

算法流程

输入：
训练数据集
$S=\{(x_1,y_1),(x_2,y_2),...,(x_n,y_n)\}$
输出：
最终分类器 $G_m(x;d)$ ， $d$ 表示划分点；
初始化：
第1次训练时，样本权重均匀分布，既有：
$W_1=\{w_{1,1},w_{1,2},...,w_{1,n}\}$ ，
其中 $w_{1,i}=\frac{1}{n},i \in[1,n]$
循环： $m=1,2,3,...,M$
（1）生成划分点，将弱学习器作用于不同的划分点；
（2）选择错误率最小的划分点，得到 $G_m(x;d_m)$ ，对应的分类错误率为：
$ε_m=P(G_m(x_i;d)_m \neq y_i)=\sum_{i=1}^nw_{m,i}*I((G_m(x_i;d_m) \neq y_i)$
其中 $I((G_m(x_i;d_m) \neq y_i)$ 为指示函数，当 $G_m(x_i;d_m) = y_i$ 时返回1，当 $G_m(x_i;d_m) \neq y_i$ 时返回0。
（3）计算弱分类器 $G_m(x_i;d_m)$ 对应的系数，该系数表示该弱分类器在最终分类器（强分类器）中的最要程度：
$α_m=η*ln(\frac{1-ε_m}{ε_m})$
其中η表示学习率，一般设置为 $\frac{1}{2}$ 或1。
显然， $α_m$ 随着 $ε_m$ 的减小二增大，意味着分类误差率越小的弱分类器在最终分类器中的最用越大。

此时得到分类器： $F_m(x)=F_{m-1}(x)+α_mG_m(x;d_m)$

（4）更新参数 $w$ :
$w_{m+1,i}=\frac{w_{m,i}*e^{-y_i*α_m*G_m(x;d_m)}}{Z_m}$
其中， $Z_m=\sum_{ k=1}^nw_{m,k}*e^{-y_k*α_m*G_m(x; d_m)}$ ,
$Z_m$ 为规范化因子，这一点和Softmax一样。但是很多AdaBoost中也有不使用规范化因子的。
循环结束条件：
$ε_m$ 小于某个阈值（一般是0.5），或迭代达到最大次数。
最终分类器为：
$G=Sign(\sum_{i=1}^Mα_m*G_m(x;d_m))$
其中Sign(x)是符号函数:
当x>0，sign(x)=1;
当x=0，sign(x)=0;
当x<0， sign(x)=-1

误差为分类错误样本的权重之和。
Adaboost采用的是指数损失：

https://www.julyedu.com/question/big/kp_id/23/ques_id/989

Adaboost例子

https://zhuanlan.zhihu.com/p/27126737

图1

一共有n=23条数据，其中13个正例，10个反例。
1、初始化权重

w_{1,i}=\frac{1}{23}

2、对属性

x_1

去重后排序：
[0.05, 0.08, 0.1, 0.12, 0.2, 0.25, 0.3, 0.33, 0.4, 0.5, 0.55, 0.6, 0.66, 0.7, 0.77, 0.8, 0.88]
求划分点：
[0.065, 0.09, 0.11, 0.16, 0.225, 0.275, 0.315, 0.365, 0.45, 0.525, 0.575, 0.63, 0.68, 0.735, 0.785, 0.848]

对属性 $x_2$ 去重后排序：
[0.1, 0.15, 0.2, 0.3, 0.4, 0.44, 0.5, 0.55, 0.6, 0.65, 0.66, 0.68, 0.7, 0.77]
求划分点：
[0.125, 0.175, 0.25, 0.35, 0.42, 0.47, 0.525, 0.575, 0.625, 0.655, 0.67, 0.69, 0.735]

3、由弱分类器（决策树桩）：
$G_1(x)=\begin{cases}1，&x<d\\-1，&x>d\end{cases}$

图2

当

G_1(x)

对样本分类错误率>0.5时，使用其反转后弱分类器，其反转为：

G_2(x)=\begin{cases}1，&x>d\\-1，&x<d\end{cases}

图3

4、第1（m=0）次迭代，对属性 $x_2$ ，
为什么先是 $x_2$ 而不是 $x_1$ ？
应该是通过Gini系数筛选的。
（1）当划分点 $d=0.125$ 时， $G_1$ 分错了13个，错误率为0.565>0.5, 则进行反转，使用 $G_2$ 进行分类，分错了10个；
（2）当划分点 $d=0.175$ 时， $G_2$ 分错了8个；
（3）当划分点 $d=0.11$ 时， $G_2$ 分错了8个；
...
[(0.125, 10, 0.435, 'G2'), (0.175, 8, 0.348, 'G2'), (0.25, 8, 0.348, 'G2'), (0.35, 7, 0.304, 'G2'), (0.42, 6, 0.261, 'G2'), (0.47, 7, 0.304, 'G2'), (0.525, 6, 0.261, 'G2'), (0.575, 6, 0.261, 'G2'), (0.625, 7, 0.304, 'G2'), (0.655, 9, 0.391, 'G2'), (0.67, 10, 0.435, 'G2'), (0.69, 11, 0.478, 'G2'), (0.735, 11, 0.478, 'G2')]
上面的列表中的元素（划分点，划分错误样本数，错误率，分类器）。我们现在错误率最小的即 $ε_1=0.261$ ，对应划分点有[0.42,0.525,0.575]，这里选择划分点为0.575（后面在更新权重的时候需要再次用到基于该划分点的弱分类器），有图：

图4

图4中点的大小反映对应权重，此时它们权重都相同；点的颜色反映类别。

计算系数：
$α_1=\frac{1}{2}ln(\frac{1-ε}{ε})=\frac{1}{2}ln(\frac{1-0.261}{0.261}) =0.52$

更新权重 $w_1 \to w_2$
$w_{2,i}=\frac{w_{1,i}*e^{-y_i*α_1*G'(x)}}{\sum_{ k=1}^nw_{1,k}*e^{-y_k*α_1*G'(x)}}$
我们把分母最为规范化因子拿出来先计算：
（需要说明的是有时候并没有用规范化因子）
$Z_1=\sum_{ k=1}^nw_{1,k}*e^{-y_k*α_1*G'(x)}=0.878$
因为分母都一样，都是 $Z_1$ ，如果分类正确分子为 $w_{1,k}*e^{-α_1}$ ，其中 $e^{-α_1}<1$ ；如果分类错误分子为 $w_{1,k}*e^{α_1}$ ，其中 $e^{α_1}>1$ ；这样分类正确的权重会变小，分类错误的权重会变大。

图5

5、第2（m=1）次迭代，对属性 $x_1$ ，使用权重 $w_2$
[(0.065, 11, 0.373, 'G2'), (0.09, 11, 0.535, 'G2'), (0.11, 10, 0.452, 'G2'), (0.16, 9, 0.423, 'G2'), (0.225, 9, 0.423, 'G2'), (0.275, 10, 0.452, 'G2'), (0.315, 11, 0.481, 'G2'), (0.365, 11, 0.481, 'G2'), (0.45, 11, 0.481, 'G2'), (0.525, 10, 0.452, 'G2'), (0.575, 11, 0.481, 'G2'), (0.63, 9, 0.423, 'G2'), (0.68, 10, 0.452, 'G2'), (0.735, 9, 0.423, 'G2'), (0.785, 10, 0.506, 'G2'), (0.848, 11, 0.373, 'G2')]
错误率最小的即 $ε_2=0.423$ ，对应划分点有[0.16,0.225,0.63,0.735]，这里选择划分点为0.16，有图：

图6

计算系数：
$α_2=\frac{1}{2}ln(\frac{1-0.423}{0.423}) =0.26$

更新权重 $w_2 \to w_3$
$w_{3,i}=\frac{w_{2,i}*e^{-y_i*α_2*G'(x)}}{\sum_{ k=1}^nw_{2,k}*e^{-y_k*α_2*G'(x)}}$
$Z_2=1.026$
权重更新如下：

图7

最终得到的AdaBoost为：

图8

代码如下：

   import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import seaborn as sns
sns.set_style('white')

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

# Toy Dataset
x1 = np.array([.1, .2, .4, .8, .8, .05, .08, .12, .33, .55, .66, .77, .88, .2, .3, .4, .5, .6, .25, .3, .5, .7, .6])
x2 = np.array([.2, .65, .7, .6, .3, .1, .4, .66, .77, .65, .68, .55, .44, .1, .3, .4, .3, .15, .15, .5, .55, .2, .4])
y = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
X = np.vstack((x1, x2)).T


def plot_decision_boundary(classifier, X, y, N=10, scatter_weights=np.ones(len(y)), ax=None):
    '''Utility function to plot decision boundary and scatter plot of data'''
    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))

    # Check what methods are available
    if hasattr(classifier, "decision_function"):
        zz = np.array([classifier.decision_function(np.array([xi, yi]).reshape(1, -1)) for xi, yi in
                       zip(np.ravel(xx), np.ravel(yy))])
    elif hasattr(classifier, "predict_proba"):
        zz = np.array([classifier.predict_proba(np.array([xi, yi]).reshape(1, -1))[:, 1] for xi, yi in
                       zip(np.ravel(xx), np.ravel(yy))])
    else:
        zz = np.array([classifier(np.array([xi, yi]).reshape(1, -1)) for xi, yi in zip(np.ravel(xx), np.ravel(yy))])

    # reshape result and plot
    Z = zz.reshape(xx.shape)
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])

    # Get current axis and plot
    if ax is None:
        ax = plt.gca()
    ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5)
    ax.contour(xx, yy, Z, 2, cmap='RdBu')
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright, s=scatter_weights * 40)
    ax.set_xlabel('$X_1$')
    ax.set_ylabel('$X_2$')


def AdaBoost_scratch(X, y, M=10, learning_rate=1):
    # Initialization of utility variables
    N = len(y)
    estimator_list, y_predict_list, estimator_error_list, estimator_weight_list, sample_weight_list = [], [], [], [], []

    # Initialize the sample weights
    sample_weight = np.ones(N) / N
    sample_weight_list.append(sample_weight.copy())

    # For m = 1 to M
    for m in range(M):
        # Fit a classifier
        estimator = DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2)
        estimator.fit(X, y, sample_weight=sample_weight)
        y_predict = estimator.predict(X)

        # Misclassifications
        incorrect = (y_predict != y)

        # Estimator error
        estimator_error = np.mean(np.average(incorrect, weights=sample_weight, axis=0))

        # Boost estimator weights
        estimator_weight = learning_rate * np.log((1. - estimator_error) / estimator_error)

        # Boost sample weights
        sample_weight *= np.exp(estimator_weight * incorrect * ((sample_weight > 0) | (estimator_weight < 0)))

        # Save iteration values
        estimator_list.append(estimator)
        y_predict_list.append(y_predict.copy())
        estimator_error_list.append(estimator_error.copy())
        estimator_weight_list.append(estimator_weight.copy())
        sample_weight_list.append(sample_weight.copy())

    # Convert to np array for convenience
    estimator_list = np.asarray(estimator_list)
    y_predict_list = np.asarray(y_predict_list)
    estimator_error_list = np.asarray(estimator_error_list)
    estimator_weight_list = np.asarray(estimator_weight_list)
    sample_weight_list = np.asarray(sample_weight_list)

    # Predictions
    preds = (np.array([np.sign((y_predict_list[:, point] * estimator_weight_list).sum()) for point in range(N)]))
    print('Accuracy = ', (preds == y).sum() / N)

    return estimator_list, estimator_weight_list, sample_weight_list


def plot_AdaBoost_scratch_boundary(estimators, estimator_weights, X, y, N=10, ax=None):
    def AdaBoost_scratch_classify(x_temp, est, est_weights):
        '''Return classification prediction for a given point X and a previously fitted AdaBoost'''
        temp_pred = np.asarray([(e.predict(x_temp)).T * w for e, w in zip(est, est_weights)]) / est_weights.sum()
        return np.sign(temp_pred.sum(axis=0))

    '''Utility function to plot decision boundary and scatter plot of data'''
    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))

    zz = np.array(
        [AdaBoost_scratch_classify(np.array([xi, yi]).reshape(1, -1), estimators, estimator_weights) for xi, yi in
         zip(np.ravel(xx), np.ravel(yy))])

    # reshape result and plot
    Z = zz.reshape(xx.shape)
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])

    if ax is None:
        ax = plt.gca()
    ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5)
    ax.contour(xx, yy, Z, 2, cmap='RdBu')
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright)
    ax.set_xlabel('$X_1$')
    ax.set_ylabel('$X_2$')

# 这是使用sklearn自带的AdaBoost
boost = AdaBoostClassifier( base_estimator = DecisionTreeClassifier(max_depth = 1, max_leaf_nodes=2),algorithm = 'SAMME',n_estimators=10, learning_rate=1.0)
boost.fit(X,y)
plot_decision_boundary(boost, X,y, N = 50)#, weights)
plt.show()
boost.score(X,y)

# 这是我们自己实现的
estimator_list, estimator_weight_list, sample_weight_list = AdaBoost_scratch(X,y, M=10, learning_rate = 1)
plot_AdaBoost_scratch_boundary(estimator_list, estimator_weight_list, X, y, N = 50 )
plt.title('Estimator decision boundary')
plt.show()

# 下面是画出各迭代过程的图
estimator_list, estimator_weight_list, sample_weight_list  = AdaBoost_scratch(X,y, M=10, learning_rate = 1)
fig = plt.figure(figsize = (14,14))
a= list(range(0,9))
a.append(0)

for m in a:
    fig.add_subplot(3,3,m+1)
    s_weights = (sample_weight_list[m,:] / sample_weight_list[m,:].sum() ) * 40
    plot_decision_boundary(estimator_list[m], X,y,N = 50, scatter_weights =s_weights )
    plt.title('Estimator decision boundary, m = {}'.format(m))
    plt.show()

上面的例子参考了如下文章：
但是该文章中并没有使用规范化因子，其次该文章中计算系数 $α$ 使用的是learning_rate=1，而不是上面例子中使用的 $\frac{1}{2}$
AdaBoost: Implementation and intuition

另外比较好的文章：
Adaboost算法流程及示例
 A Step by Step Adaboost Example

Adaboost公式推导

主要是推导为什么 $α_m=η*ln(\frac{1-ε_m}{ε_m})$ ？

$F_m(x)=F_{m-1}(x)+α_mG_m(x;d_m)$
$F_1(x)=α_1G_1(x;d_1)$
$\qquad=\frac{1}{2}ln(\frac{1-ε_1}{ε_1})G_1(x;d_1)$
$\qquad=\frac{1}{2}ln(\frac{1-\sum_{i=1}^nw_{m,i}*I((G_m(x_i;d_m) \neq y_i)}{\sum_{i=1}^nw_{m,i}*I((G_m(x_i;d_m) \neq y_i)})G_1(x;d_1)$

https://www.cnblogs.com/liuq/p/9883621.html

总结：

1、Adaboost是加法模型（加性模型），加法模型的优化通常是一个复杂的优化问题，前向分布算法求解这一优化问题的思路是：从前往后每一步只学习一个基函数及其系数，逐步逼近优化目标函数；
2、Adaboost采用的是指数损失，损失函数为
$L=\sum_{i=1}^ne^{-y_i*F_m}$ ，
其中， $F_m=F_{m-1}+α_M*G_m(x)$ 。
所以说，最优化损失函数 $L$ ，等同于最优化 $F_m$ ，最优化 $F_m$ 等同于最优化 $G_m(x)$ ，而最优化 $G_m(x)$ 。

随机森林会对变量做子抽样，比如变量是p，随机森林每次会随机抽取log p个变量拟合一棵决策树。显然，随机森林适合p比较大的情况。否则log p可能就是1.+ 2.+这种情况，毫无意义。adaboost和GBDT很类似，可以理解成前者就是后者取指数损失的一个特例。适合p比较小的时候用。当然，这两者都只适用于n>>p的情况，此时样本携带了足够多的信息去拟合非线性的关系。也就是说，随机森林也不适合p特别大的情况。如果p>>n，以LASSO为首的惩罚回归是首选工具。

Adaboost

算法流程

Adaboost例子

Adaboost公式推导

总结：

推荐阅读更多精彩内容