一. 赛题描述
1-1. 赛题背景
火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。
1-2. 赛题说明
经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量
1-3. 数据说明
数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。选手利用训练数据训练出模型,预测测试数据的目标变量,排名结果依据预测结果的MSE(mean square error)
1-4. 结果评估
预测结果以 mean square error 作为评判标准
1-5. 比赛网址
网址: https://tianchi.aliyun.com/competition/entrance/231693/information
二. 预测流程
2-1. 导入数据
- 代码:
from pandas import read_csv
train_file_path = 'train.xlsx'
test_file_path = 'test.xlsx'
train_data = read_csv(train_file_path)
test_data_X = read_csv(test_file_path)
print(train_data.head())
- 结果:
V0 V1 V2 V3 V4 ... V34 V35 V36 V37 target
0 0.566 0.016 -0.143 0.407 0.452 ... -4.789 -5.101 -2.608 -3.508 0.175
1 0.968 0.437 0.066 0.566 0.194 ... 0.160 0.364 -0.335 -0.730 0.676
2 1.013 0.568 0.235 0.370 0.112 ... 0.160 0.364 0.765 -0.589 0.633
3 0.733 0.368 0.283 0.165 0.599 ... -0.065 0.364 0.333 -0.112 0.206
4 0.684 0.638 0.260 0.209 0.337 ... -0.215 0.364 -0.280 -0.028 0.384
[5 rows x 39 columns]
2-2. 数据预处理(1)
由于数据中各列不存在缺省值,因此不需要进行缺省值处理。在本文,主要进行训练数据和测试数据的特征数据比对,排除具有较大差异的特征
- 代码:
import seaborn
import matplotlib.pyplot as plt
# 对数据的自变量和因变量进行区分
train_data_X = train_data.drop(columns=['target'])
train_data_y = train_data['target']
all_data_X = concat([train_data_X, test_data_X])
# 对训练数据和测试数据进行绘图,排除具有较大差异的特征
seaborn.distplot(train_data_y)
plt.show()
for col in all_data_X.columns:
seaborn.distplot(train_data_X[col])
seaborn.distplot(test_data_X[col])
plt.show()
-
结果:
.
.
.
.
.
.
.
.
.
结论:
由图(图有39张,因此没有全部展示)可得, 'V5', 'V9', 'V11', 'V17', 'V19', 'V22', 'V28' 特征应该被舍去
2-3. 数据预处理(2)
- 代码:
from sklearn.model_selection import train_test_split
train_data = train_data.drop(['V5', 'V9', 'V11', 'V17', 'V19', 'V22', 'V28'], axis=1)
test_data_X = test_data_X.drop(['V5', 'V9', 'V11', 'V17', 'V19', 'V22', 'V28'], axis=1)
train_data_X = train_data.drop(columns=['target'])
train_data_y = train_data['target']
X_train, X_test, y_train, y_test = train_test_split(train_data_X, train_data_y, test_size=0.2, random_state=1)
2-4. 模型挑选
测试主流回归预测模型,选择一个模型作为最终的预测模型
- 代码:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor, BaggingRegressor
from pandas import DataFrame
def Linear_Regression(X_train, X_test, y_train, y_test, n_jobs = None):
"""简单线性回归"""
model = LinearRegression(n_jobs=n_jobs)
model.fit(X_train, y_train)
print("Linear_Regression_训练集的分数为:",model.score(X_train, y_train))
print("Linear_Regression_测试集的分数为:",model.score(X_test, y_test))
return model
def KNeighbors_Regressor(X_train, X_test, y_train, y_test, n_neighbors=5, n_jobs = None):
"""K近邻回归"""
model = KNeighborsRegressor(n_neighbors=n_neighbors, n_jobs=n_jobs)
model.fit(X_train, y_train)
print("KNeighbors_Regressor_训练集的分数为:",model.score(X_train, y_train))
print("KNeighbors_Regressor_测试集的分数为:",model.score(X_test, y_test))
return model
def Linear_SVR(X_train, X_test, y_train, y_test):
"""支持向量机"""
model = LinearSVR()
model.fit(X_train, y_train)
print("Linear_SVR_训练集的分数为:",model.score(X_train, y_train))
print("Linear_SVR_测试集的分数为:",model.score(X_test, y_test))
return model
def DecisionTree_Regressor(X_train, X_test, y_train, y_test):
"""决策树回归"""
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
print("DecisionTree_Regressor_训练集的分数为:",model.score(X_train, y_train))
print("DecisionTree_Regressor_测试集的分数为:",model.score(X_test, y_test))
return model
def RandomForest_Regressor(X_train, X_test, y_train, y_test, n_estimators=20):
"""随机森林回归"""
model = RandomForestRegressor(n_estimators=n_estimators)
model.fit(X_train, y_train)
print("RandomForest_Regressor_训练集的分数为:",model.score(X_train, y_train))
print("RandomForest_Regressor_测试集的分数为:",model.score(X_test, y_test))
return model
def AdaBoost_Regressor(X_train, X_test, y_train, y_test, n_estimators=50):
model = AdaBoostRegressor(n_estimators=n_estimators)
model.fit(X_train, y_train)
print("AdaBoost_Regressor_训练集的分数为:",model.score(X_train, y_train))
print("AdaBoost_Regressor_测试集的分数为:",model.score(X_test, y_test))
return model
def GradientBoosting_Regressor(X_train, X_test, y_train, y_test, n_estimators=100):
"""梯度提升树"""
model = GradientBoostingRegressor(n_estimators=n_estimators)
model.fit(X_train, y_train)
print("GradientBoosting_Regressor_训练集的分数为:",model.score(X_train, y_train))
print("GradientBoosting_Regressor_测试集的分数为:",model.score(X_test, y_test))
return model
def Bagging_Regressor(X_train, X_test, y_train, y_test):
model = BaggingRegressor()
model.fit(X_train, y_train)
print("Bagging_Regressor_训练集的分数为:",model.score(X_train, y_train))
print("Bagging_Regressor_测试集的分数为:",model.score(X_test, y_test))
return model
def ExtraTree_Regressor(X_train, X_test, y_train, y_test):
model = ExtraTreeRegressor()
model.fit(X_train, y_train)
print("ExtraTree_Regressor_训练集的分数为:",model.score(X_train, y_train))
print("ExtraTree_Regressor_测试集的分数为:",model.score(X_test, y_test))
return model
if __name__ == '__main__':
Linear_Regression(X_train, X_test, y_train, y_test)
KNeighbors_Regressor(X_train, X_test, y_train, y_test)
Linear_SVR(X_train, X_test, y_train, y_test)
DecisionTree_Regressor(X_train, X_test, y_train, y_test)
RandomForest_Regressor(X_train, X_test, y_train, y_test)
AdaBoost_Regressor(X_train, X_test, y_train, y_test)
GradientBoosting_Regressor(X_train, X_test, y_train, y_test)
Bagging_Regressor(X_train, X_test, y_train, y_test)
ExtraTree_Regressor(X_train, X_test, y_train, y_test)
- 结果:
Linear_Regression_训练集的分数为: 0.8793754817117719
Linear_Regression_测试集的分数为: 0.9024323511787031
KNeighbors_Regressor_训练集的分数为: 0.873760139567705
KNeighbors_Regressor_测试集的分数为: 0.8203560236180196
Linear_SVR_训练集的分数为: 0.8771322330424864
Linear_SVR_测试集的分数为: 0.9043250479026952
DecisionTree_Regressor_训练集的分数为: 1.0
DecisionTree_Regressor_测试集的分数为: 0.769991117350511
RandomForest_Regressor_训练集的分数为: 0.9769759090722859
RandomForest_Regressor_测试集的分数为: 0.8825221601691665
AdaBoost_Regressor_训练集的分数为: 0.857086400923137
AdaBoost_Regressor_测试集的分数为: 0.8474148168174068
GradientBoosting_Regressor_训练集的分数为: 0.9249193241979118
GradientBoosting_Regressor_测试集的分数为: 0.8904802410675052
Bagging_Regressor_训练集的分数为: 0.9711876503403293
Bagging_Regressor_测试集的分数为: 0.868885350012993
ExtraTree_Regressor_训练集的分数为: 1.0
ExtraTree_Regressor_测试集的分数为: 0.7707929178071851
-
结论:
基于判断,本文选择随机森林算法作为最终的预测算法
2-5. 模型简单调参
- 代码:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
parameters = {'n_estimators':[20, 50, 100, 200, 350, 400, 450, 500, 600, 700]}
model = RandomForestRegressor()
model = GridSearchCV(estimator=model, param_grid=parameters, n_jobs=-1, cv=5, return_train_score=True, verbose=2)
model.fit(data_X, data_y)
result = DataFrame.from_dict(model.cv_results_)
with open('cv_result.csv', 'w') as f:
result.to_csv(f)
- 结果:
,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,1.2687391281127929,0.036266710771224606,0.00800008773803711,0.0040000439755987945,20,{'n_estimators': 20},0.8724420988529944,0.8322415349993869,0.8745788013953939,0.8432332026207715,0.8629602469814748,0.8570911769700043,0.016646092350053526,10,0.9754638977124689,0.9771257604624909,0.9753664943283047,0.9762308314580678,0.9760294018629432,0.9760432771648551,0.0006325906708548068
1,3.2445600032806396,0.0898784607256352,0.015204715728759765,0.004903649262552694,50,{'n_estimators': 50},0.8858828797211615,0.8361689096218232,0.8761299232393122,0.8441668200066398,0.8636069322886752,0.8611910929755224,0.018732991175114472,9,0.9794692363732168,0.980251632264731,0.9791328561835737,0.9797085673705609,0.9800975592098554,0.9797319702803875,0.00040797508416271096
2,6.141848516464234,0.05134326742132711,0.02700052261352539,0.0028347249924286573,100,{'n_estimators': 100},0.8832035309133262,0.8406105214617848,0.8797806810146303,0.8481485011741777,0.8681716790152053,0.8639829827158249,0.01693248163740542,8,0.979245030342729,0.9820405480396872,0.9803227452483777,0.9812077157808432,0.9803618594684622,0.9806355797760199,0.000938902700476217
3,12.38009328842163,0.06053386697040871,0.04712519645690918,0.004078332035249107,200,{'n_estimators': 200},0.8863089810103687,0.840986879636411,0.8792634357812497,0.8485512436509124,0.866409310003805,0.8643039700165492,0.01734516458850397,6,0.9804158095178296,0.9814646553226614,0.9805973655139438,0.981965672419542,0.9809032298087431,0.9810693465165439,0.0005722133540374375
4,21.62595429420471,0.126977234199679,0.08544902801513672,0.002540483855626194,350,{'n_estimators': 350},0.8841537733286139,0.8419244804596531,0.8796419438001326,0.8492947693331497,0.8673989125123374,0.8644827758867772,0.016521528988876447,3,0.9807143879983441,0.9820068299185257,0.9808390807061949,0.9823274408967227,0.9809858416485482,0.9813747162336671,0.0006605171055751274
5,25.034766960144044,0.13447299759533657,0.10861043930053711,0.015611033735857132,400,{'n_estimators': 400},0.885658172546209,0.8420700903434375,0.8790276212752159,0.8476973706085373,0.8677423869060745,0.8644391283358949,0.01705663385976782,5,0.9803901306932666,0.9818169950977109,0.9805862975035965,0.9820265491956262,0.9812115020073965,0.9812062948995193,0.0006475059441958164
6,27.636775636672972,0.2492574970619635,0.10564160346984863,0.005661552843369761,450,{'n_estimators': 450},0.8866806538040704,0.8433382908866536,0.8788635271606756,0.8466585436203181,0.8677485508797311,0.8646579132702898,0.01717472686147294,2,0.9806634036232731,0.9821999212088085,0.9804570987809731,0.9822958155784464,0.9809697712425541,0.9813172020868111,0.0007777937687497462
7,30.86607985496521,0.20500353154321693,0.13047318458557128,0.013913125988523395,500,{'n_estimators': 500},0.8855925991070644,0.8421343083066357,0.8801343316053365,0.8459618990057357,0.866102923739283,0.8639852123528111,0.017518331816349365,7,0.9805425746264094,0.9818577187621902,0.9807738935705386,0.9819684179300116,0.9810995525758069,0.9812484314929912,0.0005718735059135995
8,37.01020722389221,0.0747417846453937,0.15296621322631837,0.026100525193978055,600,{'n_estimators': 600},0.8862715734846807,0.8426327279744084,0.8794599752282756,0.8482937543074928,0.8676135270757153,0.8648543116141145,0.017015808081712314,1,0.9804765515844174,0.9820324250344458,0.9808635785774821,0.982249400979226,0.9812207182315182,0.9813685348814178,0.0006766298915293344
9,39.326469850540164,4.345708733687172,0.14382991790771485,0.027514478265663347,700,{'n_estimators': 700},0.8863222774373519,0.8421825223020779,0.8801658332936056,0.8470082532906615,0.8666985997737111,0.8644754972194817,0.017495667910145084,4,0.9806257425149776,0.9820098406232722,0.9810055306798846,0.9821538484154302,0.9810454877787224,0.9813680900024574,0.0006026377671843295
-
结论:
选择 n_estimators=600 作为模型的参数
2-6. 模型预测
- 代码:
predict_model = RandomForestRegressor(n_estimators=600, n_jobs=-1).fit(train_data_X, train_data_y)
predict_result = predict_model.predict(test_data_X)
result = DataFrame(predict_result)
result.to_csv("result.txt", index=False, header=False)