Recognize Digits
training input - x: It'll be convenient to regard each training input x as a 28×28=784-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image.
desired output - y=y(x): y is a 10-dimensional vector. For example, if a particular training image, x, depicts a 6, then y(x)=(0,0,0,0,0,0,1,0,0,0)T is the desired output from the network. Note that T here is the transpose operation, turning a row vector into an ordinary (column) vector.
cost function (sometimes referred to as a loss or objective function):
Here, w denotes the collection of all weights in the network, b all the biases, n is the total number of training inputs, a is the vector of outputs from the network when x is input, and the sum is over all training inputs, x. Of course, the output a depends on x, w and b, but to keep the notation simple I haven't explicitly indicated this dependence. The notation ∥v∥ just denotes the usual length function for a vector v.
We'll call C the quadratic cost function; it's also sometimes known as the mean squared error or just MSE.
Inspecting the form of the quadratic cost function, C(w,b) is non-negative. Furthermore, the cost C(w,b) becomes small, i.e., C(w,b)≈0, precisely when y(x) is approximately equal to the output, a, for all training inputs, x.
The aim of our training algorithm will be to minimize the cost C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent.
Why not maximize number of images correctly classified by the network?
The number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won't cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance.
Gradient Descent
Suppose we're trying to minimize some function, C(v). This could be any real-valued function of many variables, v=v1,v2,…
And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We'd randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of C - those derivatives would tell us everything we need to know about the local "shape" of the valley, and therefore how our ball should roll.
let's think about what happens when we move the ball a small amount Δv1 in the v1 direction, and a small amount Δv2 in the v2 direction. Calculus tells us that C changes as follows:
We're going to find a way of choosing Δv1 and Δv2 so as to make ΔC negative; i.e., we'll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define Δv to be the vector of changes in v, Δv≡(Δv1,Δv2)T. We'll also define the gradient of C to be the vector of partial derivatives, ∇C, i.e.:
What's really exciting about the equation is that it lets us see how to choose Δv so as to make ΔC negative.
where η is a small, positive parameter (known as the learning rate).
ΔC≈−η∇C⋅∇C=−η∥∇C∥2. Because ∥∇C∥2≥0, this guarantees that ΔC≤0, if we change v according to the prescription in (10). This is exactly the property we wanted! And so we'll take Equation (10) to define the "law of motion" for the ball in our gradient descent algorithm. That is, we'll use Equation (10) to compute a value for Δv, then move the ball's position v by that amount:
To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. At the same time, we don't want η to be too small, since that will make the changes Δv tiny, and thus the gradient descent algorithm will work very slowly.
Summary Example?????????????
Suppose in particular that CC is a function of mm variables, v1,…,vmv1,…,vm. Then the change ΔCΔC in CC produced by a small change Δv=(Δv1,…,Δvm)TΔv=(Δv1,…,Δvm)T is
ΔC≈∇C⋅Δv,(12)(12)ΔC≈∇C⋅Δv,
where the gradient ∇C∇C is the vector
∇C≡(∂C∂v1,…,∂C∂vm)T.(13)(13)∇C≡(∂C∂v1,…,∂C∂vm)T.
Just as for the two variable case, we can choose
Δv=−η∇C,(14)(14)Δv=−η∇C,
and we're guaranteed that our (approximate) expression (12) for ΔCΔC will be negative. This gives us a way of following the gradient to a minimum, even when CC is a function of many variables, by repeatedly applying the update rule
v→v′=v−η∇C.(15)(15)v→v′=v−η∇C.
You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position vv in order to find a minimum of the function CC. The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of CC, a point we'll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.
Indeed, there's even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let's suppose that we're trying to make a move ΔvΔv in position so as to decrease CC as much as possible. This is equivalent to minimizing ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv. We'll constrain the size of the move so that ∥Δv∥=ϵ‖Δv‖=ϵ for some small fixed ϵ>0ϵ>0. In other words, we want a move that is a small step of a fixed size, and we're trying to find the movement direction which decreases CC as much as possible. It can be proved that the choice of ΔvΔv which minimizes ∇C⋅Δv∇C⋅Δv is Δv=−η∇CΔv=−η∇C, where η=ϵ/∥∇C∥η=ϵ/‖∇C‖ is determined by the size constraint ∥Δv∥=ϵ‖Δv‖=ϵ. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease CC.