第二章 端对端的机器学习项目 Part II

这篇文章是本人学习 《Hands-On-Machine-Learning-with-Scikit-Learn-and-TensorFlow》的读书笔记第二篇。整理出来是希望在巩固自己的学习效果的同时,希望能够帮助到同样想学习的人。本人也是小白,可能很多地方理解和翻译不是很到位,希望大家多多谅解和提意见。

4. 为机器学习算法准备数据

把特征和目标值分开,方便后续做特征转换。

housing = strat_train_set.drop('median_house_value',axis=1) #删除目标值
housing_labels = strat_train_set['median_house_value'].copy()

数据清洗

total_bedrooms 属性中存在缺失值,缺失值的处理:

  • 删除有缺失值的数据点
  • 删除整个 total_bedrooms 属性
  • 用值来填充缺失值(0,平均数,中位数等)
housing.dropna(subset = ['total_bedrooms']) #option1
housing.drop('total_bedrooms',axis=1) #option2
median = housing['total_bedrooms'].median()
housing['total_bedrooms'].fillna(median) #option3

使用第三种方法来填充缺失值时,在测试集上也应该使用同样的中位数值填充缺失值。使用 Scikit-Learn 的 Imputer来实现缺失值的填充。

try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer
    #create an imputer instances
imputer = SimpleImputer(strategy='median') #specify median method

housing_num = housing.drop("ocean_proximity", axis=1) #drop non-numerical attribute
imputer.fit(housing_num) #fit the imputer instance to the training data
X = imputer.transform(housing_num) #replacing missing values with learned medians
housing_tr = pd.DataFrame(X, columns=housing_num.columns,index=housing.index) #convert Numpy arrays into pandas dataframe

文本和类别数据的处理

使用 Scikit-Learn 的 OrdinalEncoder 将文本数据转变为数值型数据。

try:
    from sklearn.preprocessing import OrdinalEncoder
except ImportError:
    from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

ordinal_encoder.categories_
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

Scikit-Learn 中提供 OneHotEncoder 编码可以将字符型的类别变量转换成独热编码的向量。

try:
    from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
    from sklearn.preprocessing import OneHotEncoder
except ImportError:
    from future_encoders import OneHotEncoder # Scikit-Learn < 0.20

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
    with 16512 stored elements in Compressed Sparse Row format>

得到的 housing_cat_1hot 是一个SciPy格式的稀疏矩阵而不是一个 NumPy的数组,可以使用 toarray()的方法将它转换为稠密的 Numpy 数组。

housing_cat_1hot.toarray()

自定义转换器

使用 Scikit-Learn的 FunctionTransformer类可以基于转换函数构建转换器。

from sklearn.preprocessing import FunctionTransformer

def add_extra_features(X, add_bedrooms_per_room=True):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate=False,
                                kw_args={'add_bedrooms_per_room':False}) 
housing_extra_attribs = attr_adder.fit_transform(housing.values)
Figure 1:Housing_extra_attribs

特征缩放

机器学习算法的效果不会太好,当各特征的取值在不同范围时,我们的数据集中total_rooms 的特征取值为[6,39320],median_income的取值为[0,15]。注意,一般目标值是不需要做特征缩放的。常见的特征缩放的方法有最小-最大值缩放,标准化。

  • Min-max Scaling:(num - min)/ (max - min),使数据缩放到(0,1)。Scikit-Learn 中提供了 MinMaxScaler可实现该功能。
  • Standardization:(num - mean)/ variance。不像 Min-max Scaling把数据缩放到0-1的范围,标准化将数据缩放到0均值,单位方差。这对于像神经网络这种希望输入范围在0-1之间的模型来说,标准化可能不是最佳。但是标准化对异常值不敏感,假如median_income中有个错误值为100,则 min-max scaling会将数据缩放到(0,0.15)的范围,而影响数据整体的分布。Scikit-Learn 中提供了StandardScaler实现标准化缩放。

Transformation Pipelines

Scikit-Learn 中提供了 Pipeline类来完成转换序列,使得程序能够按顺序执行每个转换。

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer',SimpleImputer(strategy='median')),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

同样,我们也可以为类别型的变量设置 transformer pipeline。Scikit-Learn 中提供了ColumnTransformer 类整合数值型和类别型的转换器。传入一个转换器的列表,当需要用到 fit()或者transform()的方法时,它并行地运行各个转换器的 fit()或者transform()方法,然后等待他们的结果并最终把他们合并到一起输出。

try:
    from sklearn.compose import ColumnTransformer
except ImportError:
    from future_encoders import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

full_pipeline = ColumnTransformer([
        ('num', num_pipeline, num_attribs),
        ('cat', OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

5 在训练集上进行训练和验证

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

在部分数据上查看预测效果

# try it out on some training instances
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data) #data transformation
print('Predictions:\t\t', lin_reg.predict(some_data_prepared))

使用 Scikit-Learn 中的 mean_squared_error函数,计算 RMSE。

# calculate the mean_squared_error
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels,housing_predictions)
lin_rmse = np.sqrt(lin_mse)

使用 Scikit-Learn 中的 mean_absolute_error函数,计算 MAE。

# calculate the mean_absolute_error
from sklearn.metrics import mean_absolute_error
lin_mae = mean_absolute_error(housing_labels, housing_predictions)

计算后发现 RMSE和 MAE都很多,考虑可能是模型欠拟合,使用决策树算法对数据进行拟合。

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_mse = np.sqrt(tree_mse)

此时计算出来的 tree_mse 为 0。很明显,决策树算法过拟合了。

6 微调模型

计算交叉验证的得分

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                        scoring='neg_mean_squared_error', cv=10)
tree_rmse_scores = np.sqrt(-scores)
Figure 2:Cross validation score for decision tree

计算线性回归的交叉验证得分。

# cross validation scores for linear regression
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

使用随机森林来作预测,同时计算其交叉验证得分。

# choose Random Forest as a regressor
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

# calculate the mean_squared_error for Random Forest Regressor
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)

forest_scores = cross_val_score(forest_reg, housing_prepared,housing_labels,
                               scoring='neg_mean_squared_error',cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

使用线性核的SVM作为分类器,并计算其 RMSE。

from sklearn.svm import SVR

svm_reg = SVR(kernel='linear')
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)

使用 Scikit-Learn 的 GridSearchCV来帮助选择参数

from sklearn.model_selection import GridSearchCV

param_grid = [
   # try 12 (3×4) combinations of hyperparameters
    {'n_estimators':[3,10,30],'max_features':[2,4,6,8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]},
]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                          scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

输出最佳的参数组合和最优的估计参数。


Figure 3: Best params and best estimator
# look at the score of each hyperparameter combination tested during the grid search
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'],cvres['params']):
    print(np.sqrt(-mean_score),params)

# 以 dataframe 的方式显示结果
pd.DataFrame(grid_search.cv_results_)

使用随机搜索来进行参数选择。

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    'n_estimators': randint(low=1, high=200),
    'max_features': randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                               n_iter=10, cv=5, scoring='neg_mean_squared_error',
                               random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
Figure 4:Random search result

输出每个属性值对于正确预测的相对重要程度。

feature_importances = grid_search.best_estimator_.feature_importances_

extra_attribs = ['rooms_per_hhold','pop_per_hhold', 'bedrooms_per_room']
cat_encoder = full_pipeline.named_transformers_['cat']
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
Figure 5: Relative importance of each attribute

有了这个信息,可以考虑删除点一些不是很重要的变量。

在测试集上评估系统性能

final_model = grid_search.best_estimator_

X_test = strat_test_set.drop('median_house_value', axis=1)
y_test = strat_test_set['median_house_value'].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

计算测试集的 RMSE 95%的置信区间。

# we can compute a 95% confidence interval for the test RMSE
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)

np.sqrt(stats.t.interval(confidence, m-1,loc=np.mean(squared_errors),
                        scale=stats.sem(squared_errors)))
# we could also compute the interval manually like this
tscore = stats.t.ppf((1 + confidence)/2, df=m-1)
tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)
# Alternatively, we could use a z-scores rather than t-scores
zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)

整合数据准备和预测的Pipeline

full_pipeline_with_predictor = Pipeline([
    ('preparation', full_pipeline),
    ('linear', LinearRegression())
])

full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)

使用 joblib保存模型

my_model = full_pipeline_with_predictor

from sklearn.externals import joblib
joblib.dump(my_model, 'my_model.pkl') #save model

my_model_loaded = joblib.load('my_model.pkl') #load model

7 上线、监督、维护你的系统

  • 需要编写程序监督你的系统运行,当性能出问题时应及时预警。
  • 评估系统性能时需要对系统的预测进行抽样,评估是否准确,可能需要人为的分析。
  • 时常评估系统输入的数据质量。
  • 定期使用新数据重新训练模型。

8 练习题的解答

  1. Question: 构建一个SVM回归算法,尝试使用多种参数,比如 kernel='linear'(C有多种值)或者 kernel='rbf'(C和gamma有多种值)。
from sklearn.model_selection import GridSearchCV

param_grid = [
        {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
        {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
         'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
    ]

svm_reg = SVR()
grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search.fit(housing_prepared, housing_labels)

negative_mse = grid_search.best_score_
rmse = np.sqrt(-negative_mse)
  1. Question: 使用RandomizedSearchCV代替GridSearchCV 。
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal

param_distribs = {
        'kernel': ['linear', 'rbf'],
        'C': reciprocal(20, 200000),
        'gamma': expon(scale=1.0),
    }

svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
                                n_iter=50, cv=5, scoring='neg_mean_squared_error',
                                verbose=2, n_jobs=4, random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

negative_mse = rnd_search.best_score_
rmse = np.sqrt(-negative_mse)
  1. Question:在数据处理的Pipeline中加入转换器用来选择最重要的特征
from sklearn.base import BaseEstimator, TransformerMixin

# np.argpartition(arr,k) 将数组arr中所有元素(包括重复元素)从小到大排列,比第k大的元素
# 小的放在前面,大的放在后面,输出新数组索引
def indices_of_top_k(arr, k):  
    return np.sort(np.argpartition(np.array(arr),-k)[-k:])

class TopFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importances, k):
        self.feature_importances = feature_importances
        self.k = k
        
    def fit(self, X, y=None):
        self.feature_indices = indices_of_top_k(self.feature_importances, self.k)
        return self
    
    def transform(self,X):
        return X[:, self.feature_indices]

找到最大的 k 个特征对应的序号

k = 5
top_k_feature_indices = indices_of_top_k(feature_importances, k)
top_k_feature_indices
Figure 6:Top k features
preparation_and_feature_selection_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k))
])

housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)
Figure 7:Features of the first 3 instances
  1. Question:创建一个Pipeline完成所有的数据处理过程和最后的预测。
prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k)),
    ('svm_reg', SVR(**rnd_search.best_params_))
])

prepare_select_and_predict_pipeline.fit(housing, housing_labels)
Figure 8:Prediction on several instances
  1. Question:使用GridSearchCV自动发现一些数据处理的方法。
param_grid = [{
    'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
    'feature_selection__k': list(range(1, len(feature_importances) + 1))
}]

grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,
                                scoring='neg_mean_squared_error', verbose=2, n_jobs=4)

grid_search_prep.fit(housing, housing_labels)

程序

我把书中的程序都用 Python 3运行了一遍,确保没有Bug并且都加了注释,方便大家理解。原书的数据集和代码在这个网站上,我自己运行的程序在我的GitHub上。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,539评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,911评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,337评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,723评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,795评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,762评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,742评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,508评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,954评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,247评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,404评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,104评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,736评论 3 324
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,352评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,557评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,371评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,292评论 2 352

推荐阅读更多精彩内容