Linear Regression Learning Notes

Support vector machine has been studied for two weeks,But i didn't study it thoroughtly,i think it so difficlut,then i studied linear regression for a period time.

Our target:

Linear model makes a prediction by computing a dot product sum of the input feature vector,the add a bias term,this function:

linear regression

we can make it simple like this:

linear regression

we can use:

theta = (X.T * X)**(-1) * X.T.y

To computing wT of f(x)

But the noise made it impossible to recover the excat linear.so we need optimize this solver,common way is:Least Squares Method or Gradient Descent and so on...

This Least Squares Method:

Our goal is to get the f(x) generated close to the real y,like:

our goal

Then we need MSE(mean-square error) minimum.

our target

so our target is to convert the minimum of the formula above,Let's take the derivative of w and b with respect to this formula,The exact process is not derived here,and then we want its closed-form solution.we can get:

closed-form of w and b

Gradient Descent

It's easy to understand,first we specified a vector of theta,then we can make a learning-rate of eta,if the eta is too big,our curvers may not converge,but if eta to small,we need a long time to adjust theta,ok,now,we're going to iterate over the sample,the minimum value of MSE was calculated using the learning rate,this cost function is closer our target,but we can't get global minimum,because there are local minimum.we can use Batch Gradient Descent(BGD) to find global minimum.

Batch Gradient Descent

MSE(theta) = 2/m * X.T * (X*theta -y)

#m is the sample matrix's number,X is the sample matrix,y is target number

BGD's algorithm

for:

theta = theta - learning_rate * MSE

Use the above formula,we can get global minimum and jump out of the local minimum,but we must use long time to computing,because every time we need to compute all the instances,Stochastic Gradient Descent is a way to reduce computing time.

Stochastic Gradient Descent

To make up for the defect of BGD running time to long,We introduce the SGD,SGD trains one instance at a time,therefore,we are not sure whether all samples will be used after the training,this learn curves dont like BGD is step by step achieve global minimum,SGD is even bouncing upward,upward,but it's always like global minimum,so we can get the optimal solution if our learning-rate remain the same,we're adjusting the learning-rate less and less,so it's going to be close to the optimal solution.

Mini-batch Gradient Descent

If you want to get tht optimal solution,you dont want to take a long time,we can use Mini-batch Gradient Descent,it combines BGD and SGD,yeah,Each time we use a subset of instances,and learning-rate is loss and loss,We dont waste as much time as BDG and the instability of SGD.

Regularized Linear Models

The previous model is easy to overfitting,a good wat to avoid it is regularize,like Ridge Regression and Lasso Regression.

Ridge Regression:a regularization term equal to a*SUM(theta**2) is added to the cost function,then we can get the ridge regression cost function,the hyperparameter alpha controls how much you want to regularize the model:

J(theta) = MSE(theta) + 1/2 * alpha* SUM(theta ** 2)

Then,this function's closed_form solution is:

theta = (X**T * X + alpha * A)**(-1) * y

#A is the n*n identity matrix

Elastic Net is a middle ground between ridge regressoin and lasso regression.The following is Elastic Net's cost function

J(theta) = MSE(theta) + ratio*alpha|theta(i)| + (1-ratio)/2 * alpha *SUM(theta(i) ** 2)

Ratio bigger,Lasso heavier,ratio samller,Lasso lighter,In general,Elastic Net is preferred over lasso since lasso may behave erratically when the number of freatures than the number of training instances or when several features are strongly correlated.so,if the running time to long,maybe overfitting,the time to short,maybe underfitting,so we stopping running when MSE between predict and true is minimum.This is called early stopping.

Logistic Regression

This cost function:

逻辑斯蒂回归

This function is actually a sigmoid function,this function is very steep when (w**T*x + b) is equal to zero,to avoid unclassified situation as much as possible.emmmmm,you can use this model from sklearn....ok.

Softmax Regression

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes,each class has it own dedicated parameter vector,following:

Sk(X) = theta(k) ** T * X #softmax score for class k

MSE(theta) = 1/m * SUM(f(x) - y(x))*X(i) #cross entropy gradient vector for class k

Calculate each possible category,We separate this category from other categories,it's equal to 1 if the target class for the ith instance is K,otherwise,it is equal to 0.