Support vector machine has been studied for two weeks,But i didn't study it thoroughtly,i think it so difficlut,then i studied linear regression for a period time.
Our target:
Linear model makes a prediction by computing a dot product sum of the input feature vector,the add a bias term,this function:
we can make it simple like this:
we can use:
theta = (X.T * X)**(-1) * X.T.y
To computing wT of f(x)
But the noise made it impossible to recover the excat linear.so we need optimize this solver,common way is:Least Squares Method or Gradient Descent and so on...
This Least Squares Method:
Our goal is to get the f(x) generated close to the real y,like:
Then we need MSE(mean-square error) minimum.
so our target is to convert the minimum of the formula above,Let's take the derivative of w and b with respect to this formula,The exact process is not derived here,and then we want its closed-form solution.we can get:
Gradient Descent
It's easy to understand,first we specified a vector of theta,then we can make a learning-rate of eta,if the eta is too big,our curvers may not converge,but if eta to small,we need a long time to adjust theta,ok,now,we're going to iterate over the sample,the minimum value of MSE was calculated using the learning rate,this cost function is closer our target,but we can't get global minimum,because there are local minimum.we can use Batch Gradient Descent(BGD) to find global minimum.
Batch Gradient Descent
MSE(theta) = 2/m * X.T * (X*theta -y)
#m is the sample matrix's number,X is the sample matrix,y is target number
BGD's algorithm
for:
theta = theta - learning_rate * MSE
Use the above formula,we can get global minimum and jump out of the local minimum,but we must use long time to computing,because every time we need to compute all the instances,Stochastic Gradient Descent is a way to reduce computing time.
Stochastic Gradient Descent
To make up for the defect of BGD running time to long,We introduce the SGD,SGD trains one instance at a time,therefore,we are not sure whether all samples will be used after the training,this learn curves dont like BGD is step by step achieve global minimum,SGD is even bouncing upward,upward,but it's always like global minimum,so we can get the optimal solution if our learning-rate remain the same,we're adjusting the learning-rate less and less,so it's going to be close to the optimal solution.
Mini-batch Gradient Descent
If you want to get tht optimal solution,you dont want to take a long time,we can use Mini-batch Gradient Descent,it combines BGD and SGD,yeah,Each time we use a subset of instances,and learning-rate is loss and loss,We dont waste as much time as BDG and the instability of SGD.
Regularized Linear Models
The previous model is easy to overfitting,a good wat to avoid it is regularize,like Ridge Regression and Lasso Regression.
Ridge Regression:a regularization term equal to a*SUM(theta**2) is added to the cost function,then we can get the ridge regression cost function,the hyperparameter alpha controls how much you want to regularize the model:
J(theta) = MSE(theta) + 1/2 * alpha* SUM(theta ** 2)
Then,this function's closed_form solution is:
theta = (X**T * X + alpha * A)**(-1) * y
#A is the n*n identity matrix
Elastic Net is a middle ground between ridge regressoin and lasso regression.The following is Elastic Net's cost function
J(theta) = MSE(theta) + ratio*alpha|theta(i)| + (1-ratio)/2 * alpha *SUM(theta(i) ** 2)
Ratio bigger,Lasso heavier,ratio samller,Lasso lighter,In general,Elastic Net is preferred over lasso since lasso may behave erratically when the number of freatures than the number of training instances or when several features are strongly correlated.so,if the running time to long,maybe overfitting,the time to short,maybe underfitting,so we stopping running when MSE between predict and true is minimum.This is called early stopping.
Logistic Regression
This cost function:
This function is actually a sigmoid function,this function is very steep when (w**T*x + b) is equal to zero,to avoid unclassified situation as much as possible.emmmmm,you can use this model from sklearn....ok.
Softmax Regression
Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes,each class has it own dedicated parameter vector,following:
Sk(X) = theta(k) ** T * X #softmax score for class k
MSE(theta) = 1/m * SUM(f(x) - y(x))*X(i) #cross entropy gradient vector for class k
Calculate each possible category,We separate this category from other categories,it's equal to 1 if the target class for the ith instance is K,otherwise,it is equal to 0.