1.Logistic Regression(classification regression)
Linear Regression may be not suited well for some classification problem,such as classifying the email `which is spam or not ,or judging the cancer's condition depend on its size.
So,there is another algorithm——logistic regression,which has several features Xi,and the output y only two conditions——zero or one.
Hypothesis Representation
In the linear regression,the hypothesis result is θ'x which can be larger than 1 or smaller than 0,so we use sigmoid function to modify the hypothesis result during 1 and 0.
Decision Boundary
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.
decision boundary can be linear or nonlinear ,sometimes even complicated curve.
As we can seen above,if we define:
h(z) > 0.5 —> y = 1 ;
h(z) < 0.5 —> y = 0 ;
which means, z > 0 is the boundary.
so,if z = θ'x ,then θ'x > 0 is the boundary which divide the area into two parts——y = 0 and y = 1; θ'x = θ0*x0 + θ1*x1 + θ2*x2 (this is a linear boundary)
Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.
so, we define the cost function of logistic regression as this :
We can rewrite the cost equation into the form:
cost(h(x),y) = -ylog(h(x)) - (1-y)log(1-h(x))
Gradient Descent
The form is same as the gradient descent of linear regression.
A vectorized implementation is:
Advanced Optimization
"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.
2.Multi-class Classification: One-vs-others
if we have more than two categories,instead of y = {0,1} we will expand our definition so that y = {0,1...n}.We divide our problem into n+1 (+1 because the index starts at 0) binary classification problems.
To summarize:
Train a logistic regression classifier hθ(x)for each class to predict the probability that y = i .
To make a prediction on a new x, pick the class that maximizes hθ(x).
3.PROBLEM : Over-fitting
The hypothesis function may predict the examples in the training set very well,but can not predict the unseen data well.
As is shown in the picture above,the first curve has few features so it does not fit the data well,which called "under-fitting" or "high bias".The second curve is right well.And the last curve fitting all the examples in the training set but it looks like a unreasonable and complicate drawing may can not predict the unseen data.So,under this condition,the curve is called "over-fitting" or "high-variance" .
What are the reasons of over-fitting?
1).too many features
2).too complicate hypothesis function
How to solve it?
1).reduce the features
2).regularization (正则化)
.Keep all the features, but reduce the magnitude of parameters θj.
.Regularization works well when we have a lot of slightly useful features.
Cost Function
the regular formula:
Regularized Linear Regression
It will change the form of gradient descent and normal equation.
Gradient Descent
Normal Equation
Recall that if m < n, then X'X is non-invertible. However, when we add the term λ⋅L, then X'X+ λ⋅L becomes invertible.
Regularized Logistic Regression
We can regularize logistic regression in a similar way that we regularize linear regression.
so,the gradient descent function is changed as following: