偏差和方差的判别
高偏差和高方差本质上为学习模型的欠拟合和过拟合问题。
对于高偏差和高方差问题,即学习模型的欠拟合和过拟合问题,我们通常绘制如下图表进行判断:
高偏差——欠拟合问题
- Jtrain(Θ)误差大
- JCV(Θ)误差 ≈ Jtrain(Θ)误差
高方差——过拟合问题
- Jtrain(Θ)误差小
- JCV(Θ)误差 >> Jtrain(Θ)误差
补充笔记
Diagnosing Bias vs. Variance
In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.
- We need to distinguish whether bias or variance is the problem contributing to bad predictions.
- High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.
The training error will tend to decrease as we increase the degree d of the polynomial.
At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.
High bias (underfitting): both Jtrain(Θ) and JCV(Θ) will be high. Also, JCV(Θ)≈Jtrain(Θ).
High variance (overfitting): Jtrain(Θ) will be low and JCV(Θ) will be much greater than Jtrain(Θ).
The is summarized in the figure below:
正则化的偏差与方差
在训练模型的过程中,为了避免过拟合问题我们通常使用正则化方法。但对于正则化参数λ的选择,我们是需要谨慎考虑的。
之前,我们在考虑正则化参数λ的选择时,只是考虑单变量的情况。现在,我们要考虑在多项式的情况下,正则化参数λ的取值问题。
例如:对于某一多项式模型,我们使用正则化方法。其中,正则化参数λ=0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10。现求出最佳的正则化参数λ的值。
首先,我们将数据集分为训练集、交叉验证集和测试集三部分。
然后,当正则化参数λ=0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10时,我们分别求出Jtran(θ)和JCV(θ)。
最后,我们利用测试集对JCV(θ)最小时的某个正则化参数λ值进行计算,求出其Jtest(θ)。
图中,假设正则化参数λ=0.08时,JCV(θ)最小。
为了便于理解,以及便于找到最佳的正则化参数λ的值,我们可以画出下图:
补充笔记
Regularization and Bias/Variance
In the figure above, we see that as λ increases, our fit becomes more rigid. On the other hand, as λ approaches 0, we tend to over overfit the data. So how do we choose our parameter λ to get it 'just right' ? In order to choose the model and the regularization term λ, we need to:
- Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
- Create a set of models with different degrees or any other variants.
- Iterate through the λs and for each λ go through all the models to learn some Θ.
- Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0.
- Select the best combo that produces the lowest error on the cross validation set.
- Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.
学习曲线
通过绘制学习曲线可以帮助我们了解学习算法是否运行正常。学习曲线为训练集误差、交叉验证集误差与训练集样本数量m之间的函数关系图。
上图中,假设函数为hθ(x) = θ0 + θ1x + θ2x2,且此处不考虑正则化。当m = 1时,我们的假设函数hθ(x)能完美拟合训练集,其Jtrain(θ) = 0,但对于交叉验证集而言,假设函数hθ(x)的泛化能力差,其JCV(θ)的值将较大;当m=2时,我们的假设函数hθ能较好地拟合训练集,其Jtrain(θ)的值将稍微增大,但对于交叉验证集而言,假设函数hθ(x)的泛化能力依旧较差,其JCV(θ)的值将较比之前有略微减小;······;但m足够大时,Jtrain(θ)的值将增大到某一特定值后保持水平,JCV(θ)的值将减小到某一特定值后保持水平,且Jtrain(θ)的值与JCV(θ)的值非常接近。
因此,当学习算法处于高偏差的情况时,我们增加训练集样本数量是毫无用处的。
上图中,我们的假设函数hθ(x) = θ0 + θ1x + θ2x2 + ... + θ100x100,此处考虑正则化,其中正则化参数λ的值很小。当m = 5时,假设函数hθ(x)能够较好地拟合训练集,其Jtrain(θ)的值较小,但假设函数hθ(x)的泛化能力较差,其JCV(θ)的值较大;当m = 12时,假设函数hθ(x)依旧能够较好地拟合训练集,但其Jtrain(θ)的值稍微增大一些,JCV(θ)的值略微减小一些;······;当m足够大时,Jtrain(θ)的值逐渐增大,JCV(θ)的值逐渐减小。
因此,此时学习算法处于高偏差的情况时,我们增加训练集样本数量可能会有些帮助。
注:当m足够大时,Jtrain(θ)的值逐渐增大,JCV(θ)的值逐渐减小,这两者是否会相交,视频中尚未交代清楚。
补充笔记
Learning Curves
Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:
- As the training set gets larger, the error for a quadratic function increases.
- The error value will plateau out after a certain m, or training set size.
Experiencing high bias:
Low training set size: causes Jtrain(Θ) to be low and JCV(Θ) to be high.
Large training set size: causes both Jtrain(Θ) and JCV(Θ) to be high with Jtrain(Θ)≈JCV(Θ).
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.
Experiencing high variance:
Low training set size: Jtrain(Θ) will be low and JCV(Θ) will be high.
Large training set size: Jtrain(Θ) increases with training set size and JCV(Θ) continues to decrease without leveling off. Also, Jtrain(Θ) < JCV(Θ) but the difference between them remains significant.
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
下一步决定做什么
在机器学习应用建议(一)一文的开头,我们就预测结果存在高误差而提出了如下的解决方法:
- 获取更多的样本
- 尝试减少特征变量的数量
- 尝试获取更多的特征变量
- 尝试增加多项式特征
- 尝试减小正则化参数λ的值
- 尝试增大正则化参数λ的值
对于这些方法,我们分别进行了研究得出了如下结论:
- 获取更多的样本——适合高方差(过拟合)问题
- 尝试减少特征变量的数量——适合高方差(过拟合)问题
- 尝试获取更多的特征变量——适合高偏差(欠拟合)问题
- 尝试增加多项式特征——适合高偏差(欠拟合)问题
- 尝试减小正则化参数λ的值——适合高偏差(欠拟合)问题
- 尝试增大正则化参数λ的值 ——适合高方差(过拟合)问题
对于神经网络模型而言,使用“小”的模型,其容易出现高偏差(欠拟合)问题,但其优势在于计算代价较小;使用“大”的模型(即隐藏层激活单元较多或有多个隐藏层。),其容易出现高方差(过拟合)问题,且其计算代价较大。但一般而言,正则化的神经网络模型越“大”其性能越好。
通常我们选择只含有一层隐藏层的神经网络模型。但对于其他情况,只含有一层隐藏层的神经网络模型并不是最优的模型。因此,我们可以将数据集分为训练集、交叉验证集和测试集三部分,分别对隐藏层层数不同的神经网络模型进行训练,找到一个JCV(Θ)最小的神经网络模型为止。
补充笔记
Deciding What to Do Next Revisited
Our decision process can be broken down as follows:
- Getting more training examples: Fixes high variance
- Trying smaller sets of features: Fixes high variance
- Adding features: Fixes high bias
- Adding polynomial features: Fixes high bias
- Decreasing λ: Fixes high bias
- Increasing λ: Fixes high variance.
Diagnosing Neural Networks
- A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
- A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.
Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.
Model Complexity Effects:
- Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
- Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
- In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.