Regression problems goal: to predict the value of a continuous response/dependent variable

Steps: training data, model, learning algorithm, and evaluation metrics

Theoretical Part

Linear Regression

Data

training data

Training instance	Diameter (in inches)	Price (in dollars)
1	6	7
2	8	9
3	10	13
4	14	17.5
5	8	18
sample size	$x$	y

Visualize via matplotlib

>>> import matplotlib.pyplot as plt
>>> X = [[6], [8], [10], [14], [18]]
>>> y = [[7], [9], [13], [17.5], [18]]
>>> plt.figure()
>>> plt.title('Pizza price plotted against diameter')
>>> plt.xlabel('Diameter in inches')
>>> plt.ylabel('Price in dollars')
>>> plt.plot(X, y, 'k.')
>>> plt.axis([0, 25, 0, 25])
>>> plt.grid(True)
>>> plt.show()

Model fitting

>>> from sklearn.linear_model import LinearRegression
>>> # Training data
>>> X = [[6], [8], [10], [14], [18]]
>>> y = [[7], [9], [13], [17.5], [18]]
>>> # Create and fit the model
>>> model = LinearRegression()
>>> model.fit(X, y)
>>> print('A 12" pizza should cost: $%.2f' % model.predict([12])[0])
A 12" pizza should cost: $13.68

The sklearn.linear_model.LinearRegression class is an estimator.
Estimators predict a value based on the observed data. In scikit-learn, all estimators implement the fit() and predict() methods.

Comparison

import numpy as np
m, n = (10000, 10000)
xs = np.linspace(0, 20, m)
plt.figure()
plt.plot(xs,pizzamodel.predict(xs.reshape(-1,1)),'r')
plt.plot(X,y,'bo',markersize=10)
plt.grid(1)
plt.title('predicted vs. sample')
plt.show()

LinearR_comparison.png

Evaluation of model fitness

some definitions:
cost function/loss function := define and measure the error of a model
residuals or training errors := the difference between predicted value and training data y value
prediction errors or test errors := the difference between predicted value and test data y value

some defs for linear regression
residual sum of squares cost function
LSE: least square estimators

when we have a cost function, we can find the values of our model's parameters
that minimize it.

Note, unbiased estimator for a variance of a dataset should have N-1 instead of N as the denominator

Evaluation

test data

Test instance	Diameter (in inches)	Observed Price (in dollars)	Predicted price (in dollars)
1	8	11	7759
2	9	8.5	10.7522
3	11	15	12.7048
4	16	18	17.5863
5	12	11	13.6811
sample size	$x$	y	y_predicted

Several measures can be used to assess our model's predictive capabilities. We will
evaluate our pizza-price predictor using r-squared
$r^2$ = 1, no errors
$r^2$ = .5, half of the variance in the response variable can be predicted using the model
In the case of simple linear regression, r-squared is equal to the square of the Pearson product moment correlation coefficient, or Pearson's r.

$R^2$ := 1 - $\frac{cost\ function\ :=\ sum\ of\ square\ of\ residual}{total\ sum\ of\ squares}$

>>> from sklearn.linear_model import LinearRegression
>>> X = [[6], [8], [10], [14], [18]]
>>> y = [[7], [9], [13], [17.5], [18]]
>>> X_test = [[8], [9], [11], [16], [12]]
>>> y_test = [[11], [8.5], [15], [18], [11]]
>>> model = LinearRegression()
>>> model.fit(X, y)
>>> print('R-squared: %.4f' % model.score(X_test, y_test))
R-squared: 0.6620

Multiple Linear Regression

for $Y = \beta \cdot X$
solution: $\beta$ = $(X^TX)^{-1}XY$

so a package to solve matrix inverse calculation is introduced: np.linalg (linear algebra)

>>> from numpy.linalg import inv
>>> from numpy import dot, transpose

 dot(inv(dot(transpose(X), X)), dot(transpose(X), y))

a least squares function from numpy: np.linalg.lstsq

Polynomial regression

Quadratic regression

$y = \alpha + \beta_1 x + \beta_2 x^2$
e.g.

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X_train = [[6], [8], [10], [14], [18]]
>>> y_train = [[7], [9], [13], [17.5], [18]]
>>> X_test = [[6], [8], [11], [16]]
>>> y_test = [[8], [12], [15], [18]]
>>> regressor = LinearRegression()
>>> regressor.fit(X_train, y_train)
>>> xx = np.linspace(0, 26, 100)
>>> yy = regressor.predict(xx.reshape(xx.shape[0], 1))
>>> plt.plot(xx, yy)

1.png

note here's the different part

>>> quadratic_featurizer = PolynomialFeatures(degree=2)
>>> X_train_quadratic = quadratic_featurizer.fit_transform(X_train)
>>> X_test_quadratic = quadratic_featurizer.transform(X_test)
"""X_train_quadratic:
array([[  1.,   6.,  36.],
       [  1.,   8.,  64.],
       [  1.,  10., 100.],
       [  1.,  14., 196.],
       [  1.,  18., 324.]])"""
>>> regressor_quadratic = LinearRegression()
>>> regressor_quadratic.fit(X_train_quadratic, y_train)
>>> xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))

PolynomialFeatures(degree=N).fit_transform(x) : x-> $1, x, x^2, ..., x^N$
MAIN POINT is to transform x into N multiple variables, still use LinearRegression()

>>> plt.plot(xx, regressor_quadratic.predict(xx_quadratic), c='r',
linestyle='--')

2.png

>>> plt.title('Pizza price regressed on diameter')
>>> plt.xlabel('Diameter in inches')
>>> plt.ylabel('Price in dollars')
>>> plt.axis([0, 25, 0, 25])
>>> plt.grid(True)
>>> plt.scatter(X_train, y_train)
>>> plt.show()
>>> print(X_train)
>>> print(X_train_quadratic)
>>> print(X_test)
>>> print(X_test_quadratic)
>>> print('Simple linear regression r-squared', regressor.score(X_test, y_test))
>>> print('Quadratic regression r-squared', regressor_quadratic.score(X_test_quadratic, y_test))

3.png

$R^2$ increases to .87

When degree = 9, $R^2 = -.09$
which is over-fitting

Regulation

Regularization is a collection of techniques that can be used to prevent over-fitting.
Regularization adds information to a problem, often in the form of a penalty against complexity, to a problem.

Occam's razor : a hypothesis with the fewest assumptions is the best

Ridge regression (Tikhonov regularization)(L2):
$RSS_{ridge} = \sum_{i=1}^n (y - x_i^T)^2 +\lambda \sum_{j=1}^p \beta_j^2$

$\lambda$ :Hyperparameters, which are parameters of the model that are not learned automatically and must be set manually

Least Absolute Shrinkage and Selection Operator (LASSO).(L1):
$RSS_{ridge} = \sum_{i=1}^n (y - x_i^T)^2 +\lambda \sum_{j=1}^p \beta_j$
NOTE: The LASSO produces sparse parameters; most of the coefficients will become zero, and the model will depend on a small subset of the features, while ridge most nonzero.
When explanatory variables are correlated, the LASSO will shrink the coefficients of one variable toward zero. Ridge regression will shrink them more uniformly.

Elastic Net:
$RSS_{ridge} = \sum_{i=1}^n (y - x_i^T)^2 +\lambda_2 \sum_{j=1}^p \beta_j^2 +\lambda_1 \sum_{j=1}^p \beta_j$

Down To Earth

dataset url: https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

Data Exploring

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None,names=['Alcohol','Malic_acid ','Ash','Alcalinity_of_ash','Magnesium', 'Total_phenols','Flavanoids','Nonflavanoid_phenols','Proanthocyanins','Color_intensity','Hue','OD280/OD315_of_diluted wines','Proline'])
data.index.name='quality'

plt.figure(figsize=(15,15))

plt.subplot(2,2,1)
plt.title('alcohol vs quality ')
plt.xlabel('alcohol')
plt.ylabel('quality')
plt.scatter(data['Alcohol'], data.index)

plt.subplot(2,2,2)
plt.title('Ash vs quality ')
plt.xlabel('Ash')
plt.ylabel('quality')
plt.scatter(data['Ash'], data.index)

plt.subplot(2,2,3)
plt.title('Proline vs quality ')
plt.xlabel('Proline')
plt.ylabel('quality')
plt.scatter(data['Proline'], data.index)

plt.subplot(2,2,4)
plt.title('Hue vs quality ')
plt.xlabel('Hue')
plt.ylabel('quality')
plt.scatter(data['Hue'], data.index)

plt.show()

4.png

Model Fitting

Hold-out validation

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None,names=['Alcohol','Malic_acid ','Ash','Alcalinity_of_ash','Magnesium', 'Total_phenols','Flavanoids','Nonflavanoid_phenols','Proanthocyanins','Color_intensity','Hue','OD280/OD315_of_diluted wines','Proline'])
data.index.name='quality'

X = data.loc[:,['Alcohol','Ash','Proline','Hue']]
y = data.index
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 40)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_predictions = regressor.predict(X_test)
print('R-squared:', regressor.score(X_test, y_test))


R-squared: 0.630209361477557

load data
split data set via model_selection.train_test_split
Note
i. train_test_split(data, label, stratify=y ,test_size=0.25(by default), random_state=40), and this is hold-out method with random/stratify sampling
ii. with stratified split the R-squared increases to 0.6554701296431691
train the model and evaluate it on the test set.

Cross validation

cross_validation.cross_val_score(classifier,data,target,cv=5) ,when cv is a nunmber k -> k-fold

GD

reasons: decrease computational complexity; matrix may not be inverted
Gradient Descent is an optimization algorithm that can be used to estimate the local minimum of a function. Fortunately, the residual sum of the squares cost function is convex.

to minimize $SS_{res} = \sum_{i=1}^n (y - f(x_i))^2$
learning rate: too big -> hang around the bottom; too small -> taking too long time

(Batch) gradient descent

uses all of the training instances to update the model parameters in each iteration

Stochastic Gradient Descent (SGD)

updates the parameters using only a single training instance in each iteration. The training instance is usually selected randomly.

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data.data,data.target)
X_scaler = StandardScaler()
y_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
y_train = y_scaler.fit_transform(y_train.reshape(-1,1))
X_test = X_scaler.transform(X_test)
y_test = y_scaler.transform(y_test.reshape(-1,1))
regressor = SGDRegressor(loss = 'squared_loss')
scores = cross_val_score(regressor, X_train, y_train, cv = 5)
print('Cross validation r-squared scores:', scores)
print('Average cross validation r-squared score:', np.mean(scores))
regressor.fit_transform(X_train, y_train)
print('Test set r-squared score', regressor.score(X_test, y_test))

Cross validation r-squared scores: [0.59439483 0.613529   0.72415499 0.78472194 0.69196096]
Average cross validation r-squared score: 0.6817523439301019

MML(skl)——C2