overfitting and underfitting

For a deep learning or machine learning model, we not only require it to have a good fit on the training dataset (training error), but also expect it to have a good fit on an unknown dataset (test set) ( generalization ability), the resulting test error is called generalization error. To measure the generalization ability, the most intuitive performance is the overfitting and underfitting of the model. Overfitting and underfitting are used to describe the two states of a model during training. Generally speaking, the training process will be a graph as shown below.

At the beginning of training, the model is still in the learning process and is in the underfitting region. As training progresses, both training error and test error decrease. After reaching a critical point, the error of the training set decreases, and the error of the test set increases, and at this time it enters the overfitting region-because the trained network overfits the training set, but the data outside the training set does not. not work.

  1. What is underfitting?

Underfitting is when the model fails to obtain a low enough error on the training set. In other words, the model complexity is low, and the model performs poorly on the training set, and cannot learn the laws behind the data.

How to solve underfitting?

Underfitting basically occurs at the beginning of training, and underfitting should not be considered much after continuous training. But if it still exists, you can increase the complexity of the network or add features to the model, which are good solutions to underfitting.

  1. What is overfitting?

Overfitting is when the gap between training error and test error is too large. In other words, the model complexity is higher than the actual problem, and the model performs well on the training set but poorly on the test set. The model "rotates" the training set (remembers the properties or characteristics of the training set that are not applicable to the test set), does not understand the laws behind the data, and has poor generalization ability.

Why does overfitting occur?

The main reasons are as follows: 1. The training data set has a single sample and insufficient samples. If the training samples only have negative samples, then the generated model predicts positive samples, which is definitely not accurate. Therefore, the training samples should be as comprehensive as possible, covering all data types. 2. The noise interference in the training data is too large. Noise refers to interfering data in the training data. Too much interference will result in the recording of many noisy features, ignoring the relationship between the real input and output. 3. The model is too complicated. The model is too complex, and it has been able to "rotate" the information of the training data, but it cannot be adapted when encountering unseen data, and the generalization ability is too poor. We want the model to have stable output for different models. Models that are too complex are an important factor in overfitting.

  1. How to prevent overfitting?

To solve the overfitting problem, it is necessary to significantly reduce the test error without excessively increasing the training error, thereby improving the generalization ability of the model. We can use the Regularization method. So what is regularization? Regularization refers to modifying the learning algorithm so that it reduces generalization error rather than training error.

Commonly used regularization methods can be divided into: (1) parameter regularization methods that directly provide regularization constraints, such as L1/L2 regularization; (2) achieve lower generalization through engineering skills Error methods, such as Early stopping and Dropout; (3) Implicit regularization methods that do not directly provide constraints, such as data augmentation.

Acquiring and Using More Data (Dataset Augmentation) - A Fundamental Approach to Overfitting
The best way to make a machine learning or deep learning model generalize better is to use more data for training. However, in practice, the amount of data we have is limited. One way to solve this problem is to create "fake data" and add to the training set - data set augmentation. Increase the size of the training set by adding additional copies of the training set, thereby improving the generalization ability of the model.

Taking the image dataset as an example, we can do: rotate the image, zoom the image, randomly crop, add random noise, translate, mirror, etc. to increase the amount of data. In addition, in the problem of object classification, CNN has a strong "invariance" rule in the process of image recognition, that is, the shape, posture, position, and overall brightness of the object to be recognized in the image will not affect the classification. result. We can multiply the database by means of image translation, flipping, zooming, and cutting.

Adopt a suitable model (control the complexity of the model)
Overly complex models can lead to overfitting problems. For the design of the model, it is currently recognized that a deep learning law "deeper is better". Various domestic and foreign experts have found through experiments and competitions that for CNN, the more layers, the better the effect, but it is also easier to produce overfitting, and the calculation time is longer.

According to Occam's Razor: Of the hypotheses that can also explain known observed phenomena, we should pick the "simplest" one. For the design of models, we should choose simple and appropriate models to solve complex problems.

reduce the number of features
For some feature engineering, it is possible to reduce the number of features—remove redundant features and manually choose which ones to keep. This method can also solve the overfitting problem.

L1/L2 regularization
(1) L1 regularization

Add an L1 regularization term to the original loss function, which is the sum of the absolute values of all weights [formula], multiplied by λ/n. Then the loss function becomes:

[formula]

Corresponding gradient (derivative):

Wherein [Formula] simply takes the sign of each element of [Formula].

The weights [formula] update during gradient descent becomes:

When [formula], |w| is not differentiable. So we can only update w according to the original unregularized method.

When [Formula], sgn([Formula] )>0, then the updated [Formula] becomes smaller during gradient descent.

When [Formula], sgn([Formula] )>0, then the updated [Formula] becomes larger during gradient descent. In other words, L1 regularization makes the weight [formula] lean towards 0, so that the weight in the network is as 0 as possible, which is equivalent to reducing the complexity of the network and preventing overfitting.

This is why L1 regularization produces sparser solutions. Here sparsity means that some parameters in the optimal value are 0. The sparse nature of L1 regularization has been widely used in feature selection mechanisms to select meaningful features from a subset of available features.

(2) L2 regularization

L2 regularization is often referred to as weight decay, which is to add an L2 regularization term to the original loss function, that is, the sum of the squares of all weights [formula], and then multiply by λ/2n. Then the loss function becomes:

Corresponding gradient (derivative):

It can be found that the L2 regularization term has no effect on the update of the bias b, but has an impact on the update of the weight [formula]:

The [formula] here are all greater than 0, so the [formula] is less than 1. So during gradient descent, the weight [formula] will gradually decrease, tending to 0 but not equal to 0. This is the origin of weight decay.

L2 regularization has the effect of making the weight parameter [formula] smaller. Why can it prevent overfitting? Because the smaller weight parameter [formula] means that the complexity of the model is lower, the fit to the training data is just right, and the training data will not be overfit, thereby improving the generalization ability of the model.

dropout
Dropout is a trick used when training the network, which is equivalent to adding noise to the hidden units. Dropout refers to randomly "removing" a part of hidden units (neurons) with a certain probability (such as 50%) each time during the training process. The so-called "deletion" is not a deletion in the real sense, in fact, the activation function of the part of the neurons is set to 0 (the output of the activation function is 0), so that these neurons do not calculate.

Why does dropout help prevent overfitting?

(a) Different training models will be generated during the training process, and different training models will also generate different calculation results. As the training continues, the calculation results will fluctuate within a range, but the mean value will not change greatly, so the final training result can be regarded as the average output of different models.

(b) It eliminates or weakens the association between neuron nodes, reducing the network's dependence on a single neuron, thereby enhancing the generalization ability.

Early stopping
The process of training the model is the process of learning and updating the parameters of the model. This parameter learning process often uses some iterative methods, such as gradient descent. Early stopping is a method of truncating the number of iterations to prevent overfitting, that is, stopping iterations before the model iteratively converges on the training dataset to prevent overfitting.

In order to obtain a good performing neural network, many epochs (the number of times to traverse the entire dataset, one epoch at a time) may be passed during the training process. If the number of epochs is too small, the network may underfit; if the number of epochs is too large, overfitting may occur. Early stopping is designed to solve problems where the number of epochs needs to be set manually. Specific approach: After each epoch (or every N epoch), obtain the test results on the validation set. As the epoch increases, if the test error is found to increase on the validation set, stop training and use the weight after the stop as The final parameters of the network.

Why can you prevent overfitting? When the neural network has not run too many iterations, the w parameter is close to 0, because when the w value is randomly initialized, its value is a small random value. As you start the iterative process, the value of w will get bigger and bigger. At the back, the value of w has become very large. So what early stopping does is stop the iterative process at an intermediate point. We will get a medium w parameter, and we will get similar results to L2 regularization, choosing a neural network with a small w parameter.

Disadvantage of Early Stopping: Instead of solving the two problems of optimizing the loss function and overfitting in different ways, one method is used to solve both problems at the same time, and the result is that the things to consider become more complex. The reason why it cannot be handled independently is because if you stop optimizing the loss function, you may find that the value of the loss function is not small enough, and at the same time you do not want to overfit.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,029评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,395评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,570评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,535评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,650评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,850评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,006评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,747评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,207评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,536评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,683评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,342评论 4 330
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,964评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,772评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,004评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,401评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,566评论 2 349

推荐阅读更多精彩内容

  • 16宿命:用概率思维提高你的胜算 以前的我是风险厌恶者,不喜欢去冒险,但是人生放弃了冒险,也就放弃了无数的可能。 ...
    yichen大刀阅读 6,041评论 0 4
  • 公元:2019年11月28日19时42分农历:二零一九年 十一月 初三日 戌时干支:己亥乙亥己巳甲戌当月节气:立冬...
    石放阅读 6,876评论 0 2
  • 昨天考过了阿里规范,心里舒坦了好多,敲代码也犹如神助。早早完成工作回家喽
    常亚星阅读 3,034评论 0 1
  • 三军可夺气,将军可夺心。是故朝气锐,昼气惰,暮气归。善用兵者,避其锐气,击其惰归,此治气者也。以治待乱,以静待哗,...
    生姜牛奶泡腾片阅读 1,572评论 0 1
  • 追星逐渐从心理上的偏好转向致力于打榜花钱的形式主义,明星信息的公开化与非法售卖也导致私生饭等盲目甚至变性的行...
    黑彧阅读 1,595评论 0 3