Using the results we can plot the way the cost changes as the network learns**This and the next four graphs were generated by the program overfitting.py.:
This looks encouraging, showing a smooth decrease in the cost, just as we expect. Note that I've only shown training epochs 200 through 399.This gives us a nice up-close view of the later stages of learning,which, as we'll see, turns out to be where the interesting action is.
Let's now look at how the classification accuracy on the test data changes over time:
Again, I've zoomed in quite a bit. In the first 200 epochs (not shown) the accuracy rises to just under 82 percent. The learning then gradually slows down. Finally, at around epoch 280 the classification accuracy pretty much stops improving. Later epochs merely see small stochastic fluctuations near the value of the accuracy at epoch 280.Contrast this with the earlier graph, where the cost associated to the training data continues to smoothly drop.*** If we just look at that cost, it appears that our model is still getting "better". But the test accuracy results show the improvement is an illusion*. Just like the model that Fermi disliked, what our network learns after epoch 280no longer generalizes to the test data. And so it's not useful learning. We say the network is overfitting orover training beyond epoch 280.
To put it another way, you can think of the validation data as a type of training data that helps us learn good hyper-parameters.
The training and validation sets are used during training.
for each epoch
for each training data instance
propagate error through the network
adjust the weights calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training
Once you're finished training, then you run against your testing set and verify that the accuracy is sufficient.
The error surface will be different for different sets of data from your data set (batch learning). Therefore if you find a very good local minima for your test set data, that may not be a very good point, and may be a very bad point in the surface generated by some other set of data for the same problem. Therefore you need to compute such a model which not only finds a good weight configuration for the training set but also should be able to predict new data (which is not in the training set) with good error. In other words the network should be able to generalize the examples so that it learns the data and does not simply remembers or loads the training set by overfitting the training data.
The validation data set is a set of data for the function you want to learn, which you are not directly using to train the network. You are training the network with a set of data which you call the training data set. If you are using gradient based algorithm to train the network then the error surface and the gradient at some point will completely depend on the training data set thus the training data set is being directly used to adjust the weights. To make sure you don't overfit the network you need to input the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for the validation and also the test set indicates that the network predicts well for the train set examples, also it is expected to perform well when new example are presented to the network which was not used in the training process.
Early stopping is a way to stop training. There are different variations available, the main outline is, both the train and the validation set errors are monitored, the train error decreases at each iteration (backprop and brothers) and at first the validation error decreases. ***The training is stopped at the moment the validation error starts to rise. *** The weight configuration at this point indicates a model, which predicts the training data well, as well as the data which is not seen by the network . But because the validation data actually affects the weight configuration indirectly to select the weight configuration. This is where the Test set comes in. This set of data is never used in the training process. Once a model is selected based on the validation set, the test set data is applied on the network model and the error for this set is found. This error is a representative of the error which we can expect from absolutely new data for the same problem.
more detaills: function of train set,validation set and test set