机器学习(3)线性回归

线性回归算法简介

image

线性回归算法以一个坐标系里一个维度为结果,其他维度为特征(如二维平面坐标系中横轴为特征,纵轴为结果),无数的训练集放在坐标系中,发现他们是围绕着一条执行分布。线性回归算法的期望,就是寻找一条直线,最大程度的“拟合”样本特征和样本输出标记的关系

image.png

########## 样本特征只有一个的线性回归问题,为简单线性回归,如房屋价格-房屋面积

将横坐标作为x轴,纵坐标作为y轴,每一个点为(X(i) ,y(i)),那么我们期望寻找的直线就是y=ax+b,当给出一个新的点x(j)的时候,我们希望预测的y^(j)=ax(j)+b

image.png
  • 不使用直接相减的方式,由于差值有正有负,会抵消
  • 不适用绝对值的方式,由于绝对值函数存在不可导的点
image
image

########## 通过上面的推导,我们可以归纳出一类机器学习算法的基本思路,如下图;其中损失函数是计算期望值和预测值的差值,期望其差值(也就是损失)越来越小,而效用函数则是描述拟合度,期望契合度越来越好

image

简单线性回归的最小二乘法推导过程

image.png
image.png
image.png
image.png
image.png
image.png

实现简单线性回归法

import numpy as np
import matplotlib.pyplot as plt
x = np.array([1., 2., 3., 4., 5.])
y = np.array([1., 3., 2., 3., 5.])
plt.scatter(x, y)
plt.axis([0, 6, 0, 6])
plt.show()
image.png
x_mean = np.mean(x)
y_mean = np.mean(y)
num = 0.0
d = 0.0
for x_i, y_i in zip(x, y):
    num += (x_i - x_mean) * (y_i - y_mean)
    d += (x_i - x_mean) ** 2
a = num/d
b = y_mean - a * x_mean
y_hat = a * x + b
plt.scatter(x, y)
plt.plot(x, y_hat, color='r')
plt.axis([0, 6, 0, 6])
plt.show()
image.png
x_predict = 6
y_predict = a * x_predict + b
y_predict
5.2000000000000002
封装我们自己的SimpleLinearRegression

代码SimpleLinearRegression.py

class SimpleLinearRegression1:

    def __init__(self):
        """初始化Simple Linear Regression 模型"""
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        """根据训练集x_train,y_train 训练Simple Linear Regression 模型"""
        assert x_train.ndim == 1,\
            "Simple Linear Regression can only solve simple feature training data"
        assert len(x_train) == len(y_train),\
            "the size of x_train must be equal to the size of y_train"

        ## 求均值
        x_mean = x_train.mean()
        y_mean = y_train.mean()

        ## 分子
        num = 0.0
        ## 分母
        d = 0.0

        ## 计算分子分母
        for x_i, y_i in zip(x_train, y_train):
            num += (x_i-x_mean)*(y_i-y_mean)
            d += (x_i-x_mean) ** 2

        ## 计算参数a和b
        self.a_ = num/d
        self.b_ = y_mean - self.a_ * x_mean

        return self

    def predict(self, x_predict):
        """给定待预测集x_predict,返回x_predict对应的预测结果值"""
        assert x_predict.ndim == 1,\
            "Simple Linear Regression can only solve simple feature training data"
        assert self.a_ is not None and self.b_ is not None,\
            "must fit before predict!"

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        """给定单个待预测数据x_single,返回x_single对应的预测结果值"""
        return self.a_*x_single+self.b_

    def __repr__(self):
        return "SimpleLinearRegression1()"


from playML.SimpleLinearRegression import SimpleLinearRegression1

reg1 = SimpleLinearRegression1()
reg1.fit(x, y)
reg1.predict(np.array([x_predict]))
array([ 5.2])
reg1.a_
0.80000000000000004
reg1.b_
0.39999999999999947
y_hat1 = reg1.predict(x)
plt.scatter(x, y)
plt.plot(x, y_hat1, color='r')
plt.axis([0, 6, 0, 6])
plt.show()
image.png

向量化

image
image
向量化实现SimpleLinearRegression

代码SimpleLinearRegression.py

import numpy as np


class SimpleLinearRegression2:

    def __init__(self):
        """初始化Simple Linear Regression模型"""
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        """根据训练数据集x_train,y_train训练Simple Linear Regression模型"""
        assert x_train.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert len(x_train) == len(y_train), \
            "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        self.a_ = (x_train - x_mean).dot(y_train - y_mean) / (x_train - x_mean).dot(x_train - x_mean)
        self.b_ = y_mean - self.a_ * x_mean

        return self

    def predict(self, x_predict):
        """给定待预测数据集x_predict,返回表示x_predict的结果向量"""
        assert x_predict.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert self.a_ is not None and self.b_ is not None, \
            "must fit before predict!"

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        """给定单个待预测数据x_single,返回x_single的预测结果值"""
        return self.a_ * x_single + self.b_

    def __repr__(self):
        return "SimpleLinearRegression2()"

from playML.SimpleLinearRegression import SimpleLinearRegression2

reg2 = SimpleLinearRegression2()
reg2.fit(x, y)
reg2.predict(np.array([x_predict]))
array([ 5.2])
reg2.a_
0.80000000000000004
reg2.b_
0.39999999999999947
向量化实现的性能测试
m = 1000000
big_x = np.random.random(size=m)
big_y = big_x * 2 + 3 + np.random.normal(size=m)
%timeit reg1.fit(big_x, big_y)
%timeit reg2.fit(big_x, big_y)
1 loop, best of 3: 984 ms per loop
100 loops, best of 3: 18.7 ms per loop
reg1.a_
1.9998479120324177
reg1.b_
2.9989427131166595
reg2.a_
1.9998479120324153
reg2.b_
2.9989427131166604

衡量线性回归算法的指标

衡量标准

image

其中衡量标准是和m有关的,因为越多的数据量产生的误差和可能会更大,但是毫无疑问越多的数据量训练出来的模型更好,为此需要一个取消误差的方法,如下

image

MSE 的缺点,量纲不准确,如果y的单位是万元,平方后就变成了万元的平方,这可能会给我们带来一些麻烦

image
image

RMSE 平方累加后再开根号,如果某些预测结果和真实结果相差非常大,那么RMSE的结果会相对变大,所以RMSE有放大误差的趋势,而MAE没有,他直接就反应的是预测结果和真实结果直接的差距,正因如此,从某种程度上来说,想办法我们让RMSE变的更小小对于我们来说比较有意义,因为这意味着整个样本的错误中,那个最值相对比较小,而且我们之前训练样本的目标,就是RMSE根号里面1/m的这一部分,而这一部分的本质和优化RMSE是一样的.

衡量回归算法的标准,MSE vs MAE

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
波士顿房产数据
boston = datasets.load_boston()
boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
print(boston.DESCR)
    Boston House Prices dataset
    ===========================
    
    Notes
    ------
    Data Set Characteristics:  
    
        :Number of Instances: 506 
    
        :Number of Attributes: 13 numeric/categorical predictive
        
        :Median Value (attribute 14) is usually the target
    
        :Attribute Information (in order):
            - CRIM     per capita crime rate by town
            - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
            - INDUS    proportion of non-retail business acres per town
            - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
            - NOX      nitric oxides concentration (parts per 10 million)
            - RM       average number of rooms per dwelling
            - AGE      proportion of owner-occupied units built prior to 1940
            - DIS      weighted distances to five Boston employment centres
            - RAD      index of accessibility to radial highways
            - TAX      full-value property-tax rate per $10,000
            - PTRATIO  pupil-teacher ratio by town
            - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
            - LSTAT    % lower status of the population
            - MEDV     Median value of owner-occupied homes in $1000's
    
        :Missing Attribute Values: None
    
        :Creator: Harrison, D. and Rubinfeld, D.L.
    
    This is a copy of UCI ML housing dataset.
    http://archive.ics.uci.edu/ml/datasets/Housing


​    
​    This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
​    
    The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
    prices and the demand for clean air', J. Environ. Economics & Management,
    vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
    ...', Wiley, 1980.   N.B. Various transformations are used in the table on
    pages 244-261 of the latter.
    
    The Boston house-price data has been used in many machine learning papers that address regression
    problems.   
         
    **References**
    
       - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
       - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
       - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='<U7')
x = boston.data[:,5] ## 只使用房间数量这个特征
x.shape
(506,)
y = boston.target
y.shape
(506,)
plt.scatter(x, y)
plt.show()
image.png
np.max(y)
50.0
x = x[y < 50.0]
y = y[y < 50.0]
x.shape
(490,)
y.shape
(490,)
plt.scatter(x, y)
plt.show()
image.png
使用简单线性回归法
from playML.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
x_train.shape
(392,)
y_train.shape
(392,)
x_test.shape
(98,)
y_test.shape
(98,)
from playML.SimpleLinearRegression import SimpleLinearRegression
reg = SimpleLinearRegression()
reg.fit(x_train, y_train)
SimpleLinearRegression()
reg.a_
7.8608543562689555
reg.b_
-27.459342806705543
plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), color='r')
plt.show()
image.png
plt.scatter(x_train, y_train)
plt.scatter(x_test, y_test, color="c")
plt.plot(x_train, reg.predict(x_train), color='r')
plt.show()
image.png
y_predict = reg.predict(x_test)
MSE
mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
mse_test
24.156602134387438
RMSE
from math import sqrt

rmse_test = sqrt(mse_test)
rmse_test
4.914936635846635
MAE
mae_test = np.sum(np.absolute(y_predict - y_test))/len(y_test)
mae_test
3.5430974409463873
封装我们自己的评测函数

代码:

import numpy as np
from math import sqrt


def accuracy_score(y_true, y_predict):
    """计算y_true和y_predict之间的准确率"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)


def mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum((y_true - y_predict)**2) / len(y_true)


def root_mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""

    return sqrt(mean_squared_error(y_true, y_predict))


def mean_absolute_error(y_true, y_predict):
    """计算y_true和y_predict之间的MAE"""

    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)
from playML.metrics import mean_squared_error
from playML.metrics import root_mean_squared_error
from playML.metrics import mean_absolute_error
mean_squared_error(y_test, y_predict)
24.156602134387438
root_mean_squared_error(y_test, y_predict)
4.914936635846635
mean_absolute_error(y_test, y_predict)
3.5430974409463873
scikit-learn中的MSE和MAE
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
mean_squared_error(y_test, y_predict)
24.156602134387438
mean_absolute_error(y_test, y_predict)
3.5430974409463873

最好的衡量线性回归法的指标 R Squared

RMSE 和 MAE的局限性
image

可能预测房源准确度,RMSE或者MAE的值为5,预测学生的分数,结果的误差是10,这个5和10没有判断性,因为5和10对应不同的单位和量纲,无法比较

解决办法-R Squared简介

image
R Squared 意义
image

使用BaseLine Model产生的错误会很大,使用我们的模型预测产生的错误会相对少些(因为我们的模型充分的考虑了y和x之间的关系),用这两者相减,结果就是拟合了我们的错误指标,用1减去这个商结果就是我们的模型没有产生错误的指标

image
image

实现 R Squared (R^2)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
x = boston.data[:,5] ## 只使用房间数量这个特征
y = boston.target

x = x[y < 50.0]
y = y[y < 50.0]
from playML.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
from playML.SimpleLinearRegression import SimpleLinearRegression

reg = SimpleLinearRegression()
reg.fit(x_train, y_train)
SimpleLinearRegression()
reg.a_
7.8608543562689555
reg.b_
-27.459342806705543
y_predict = reg.predict(x_test)
R Square
from playML.metrics import mean_squared_error

1 - mean_squared_error(y_test, y_predict)/np.var(y_test)
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-2-a7a5d5c1ca17> in <module>()
      1 from playML.metrics import mean_squared_error
      2 
----> 3 1 - mean_squared_error(y_test, y_predict)/np.var(y_test)


NameError: name 'y_test' is not defined
封装我们自己的 R Score

代码(playML/metrics.py)

def r2_score(y_true, y_predict):
    """计算y_true和y_predict之间的R Square"""

    return 1 - mean_squared_error(y_true, y_predict)/np.var(y_true)
from playML.metrics import r2_score

r2_score(y_test, y_predict)
0.61293168039373225
scikit-learn中的 r2_score
from sklearn.metrics import r2_score

r2_score(y_test, y_predict)
0.61293168039373236

scikit-learn中的LinearRegression中的score返回r2_score:http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

在我们的SimpleRegression中添加score
import numpy as np
from .metrics import r2_score


class SimpleLinearRegression:

 

    def score(self, x_test, y_test):
        """根据测试数据集 x_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(x_test)
        return r2_score(y_test, y_predict)



reg.score(x_test, y_test)
0.61293168039373225

多元线性回归

多元线性回归简介和正规方程解

image
image
image
image

补充(矩阵点乘:A(m行)·B(n列) = A的每一行与B的每一列相乘再相加,等到结果是m行n列的)

image

补充(一个1xm的行向量乘以一个mx1的列向量等于一个数)

image
多元线性回归公式推导过程

######## 基础知识






######## 多元线性回归公式推导过程



多元线性回归实现

image

实现我们自己的 Linear Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
X.shape
(490, 13)
from playML.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
使用我们自己制作 Linear Regression

代码playML/LinearRegression.py

import numpy as np
from .metrics import r2_score


class LinearRegression:

    def __init__(self):
        """初始化Linear Regression模型"""

        ## 系数向量(θ1,θ2,.....θn)
        self.coef_ = None
        ## 截距 (θ0)
        self.interception_ = None
        ## θ向量
        self._theta = None

    def fit_normal(self, X_train, y_train):
        """根据训练数据集X_train,y_train 训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        ## np.ones((len(X_train), 1)) 构造一个和X_train 同样行数的,只有一列的全是1的矩阵
        ## np.hstack 拼接矩阵
        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        ## X_b.T 获取矩阵的转置
        ## np.linalg.inv() 获取矩阵的逆
        ## dot() 矩阵点乘
        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)

        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
        assert self.coef_ is not None and self.interception_ is not None,\
            "must fit before predict"
        assert X_predict.shape[1] == len(self.coef_),\
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"

from playML.LinearRegression import LinearRegression

reg = LinearRegression()
reg.fit_normal(X_train, y_train)
LinearRegression()
reg.coef_
array([ -1.18919477e-01,   3.63991462e-02,  -3.56494193e-02,
         5.66737830e-02,  -1.16195486e+01,   3.42022185e+00,
        -2.31470282e-02,  -1.19509560e+00,   2.59339091e-01,
        -1.40112724e-02,  -8.36521175e-01,   7.92283639e-03,
        -3.81966137e-01])
reg.intercept_
34.161435496224712
reg.score(X_test, y_test)
0.81298026026584658

09 scikit-learn中的回归问题

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
X.shape
(490, 13)
from playML.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
scikit-learn中的线性回归
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lin_reg.coef_
array([ -1.18919477e-01,   3.63991462e-02,  -3.56494193e-02,
         5.66737830e-02,  -1.16195486e+01,   3.42022185e+00,
        -2.31470282e-02,  -1.19509560e+00,   2.59339091e-01,
        -1.40112724e-02,  -8.36521175e-01,   7.92283639e-03,
        -3.81966137e-01])
lin_reg.intercept_
34.161435496246924
lin_reg.score(X_test, y_test)
0.81298026026584758
kNN Regressor
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X_train, y_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train_standard, y_train)
knn_reg.score(X_test_standard, y_test)
0.84664511530389497
from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        "weights": ["uniform"],
        "n_neighbors": [i for i in range(1, 11)]
    },
    {
        "weights": ["distance"],
        "n_neighbors": [i for i in range(1, 11)],
        "p": [i for i in range(1,6)]
    }
]

knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
grid_search.fit(X_train_standard, y_train)
Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished





GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
grid_search.best_params_
{'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
grid_search.best_score_
0.79917999890996905
grid_search.best_estimator_.score(X_test_standard, y_test)
0.88099665099417701

10 线性回归参数的可解释性

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lin_reg.coef_
array([ -1.05574295e-01,   3.52748549e-02,  -4.35179251e-02,
         4.55405227e-01,  -1.24268073e+01,   3.75411229e+00,
        -2.36116881e-02,  -1.21088069e+00,   2.50740082e-01,
        -1.37702943e-02,  -8.38888137e-01,   7.93577159e-03,
        -3.50952134e-01])
np.argsort(lin_reg.coef_)
array([ 4,  7, 10, 12,  0,  2,  6,  9, 11,  1,  8,  3,  5])
boston.feature_names[np.argsort(lin_reg.coef_)]
array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX',
       'B', 'ZN', 'RAD', 'CHAS', 'RM'], 
      dtype='<U7')
print(boston.DESCR)
    Boston House Prices dataset
    ===========================
    
    Notes
    ------
    Data Set Characteristics:  
    
        :Number of Instances: 506 
    
        :Number of Attributes: 13 numeric/categorical predictive
        
        :Median Value (attribute 14) is usually the target
    
        :Attribute Information (in order):
            - CRIM     per capita crime rate by town
            - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
            - INDUS    proportion of non-retail business acres per town
            - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
            - NOX      nitric oxides concentration (parts per 10 million)
            - RM       average number of rooms per dwelling
            - AGE      proportion of owner-occupied units built prior to 1940
            - DIS      weighted distances to five Boston employment centres
            - RAD      index of accessibility to radial highways
            - TAX      full-value property-tax rate per $10,000
            - PTRATIO  pupil-teacher ratio by town
            - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
            - LSTAT    % lower status of the population
            - MEDV     Median value of owner-occupied homes in $1000's
    
        :Missing Attribute Values: None
    
        :Creator: Harrison, D. and Rubinfeld, D.L.
    
    This is a copy of UCI ML housing dataset.
    http://archive.ics.uci.edu/ml/datasets/Housing


​    
    This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
    
    The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
    prices and the demand for clean air', J. Environ. Economics & Management,
    vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
    ...', Wiley, 1980.   N.B. Various transformations are used in the table on
    pages 244-261 of the latter.
    
    The Boston house-price data has been used in many machine learning papers that address regression
    problems.   
         
    **References**
    
       - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
       - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
       - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


RM对应的是房间数,是正相关最大的特征,也就是说房间数越多,房价越高,这是很合理的
NOX对应的是一氧化氮浓度,也就是说一氧化氮浓度越低,房价越低,这也是非常合理的
由此说明,我们的线性回归具有可解释性,我们可以在对研究一个模型的时候,可以先用线性回归模型看一下,然后根据感性的认识去直观的判断一下是否符合我们的语气

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,125评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,293评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,054评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,077评论 1 291
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,096评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,062评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,988评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,817评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,266评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,486评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,646评论 1 347
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,375评论 5 342
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,974评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,621评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,796评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,642评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,538评论 2 352