天池大赛 -- 工业蒸汽量预测

机器学习.jpg

一. 赛题描述

1-1. 赛题背景

火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。

1-2. 赛题说明

经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量

1-3. 数据说明

数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。选手利用训练数据训练出模型,预测测试数据的目标变量,排名结果依据预测结果的MSE(mean square error)

1-4. 结果评估

预测结果以 mean square error 作为评判标准

1-5. 比赛网址

网址: https://tianchi.aliyun.com/competition/entrance/231693/information


二. 预测流程

2-1. 导入数据

  • 代码:
from pandas import read_csv

train_file_path = 'train.xlsx'
test_file_path = 'test.xlsx'
train_data = read_csv(train_file_path)
test_data_X = read_csv(test_file_path)

print(train_data.head())
  • 结果:
V0     V1     V2     V3     V4  ...    V34    V35    V36    V37  target
0  0.566  0.016 -0.143  0.407  0.452  ... -4.789 -5.101 -2.608 -3.508   0.175
1  0.968  0.437  0.066  0.566  0.194  ...  0.160  0.364 -0.335 -0.730   0.676
2  1.013  0.568  0.235  0.370  0.112  ...  0.160  0.364  0.765 -0.589   0.633
3  0.733  0.368  0.283  0.165  0.599  ... -0.065  0.364  0.333 -0.112   0.206
4  0.684  0.638  0.260  0.209  0.337  ... -0.215  0.364 -0.280 -0.028   0.384

[5 rows x 39 columns]


2-2. 数据预处理(1)

由于数据中各列不存在缺省值,因此不需要进行缺省值处理。在本文,主要进行训练数据和测试数据的特征数据比对,排除具有较大差异的特征

  • 代码:
import seaborn
import matplotlib.pyplot as plt

# 对数据的自变量和因变量进行区分
train_data_X = train_data.drop(columns=['target'])
train_data_y = train_data['target']
all_data_X = concat([train_data_X, test_data_X])

# 对训练数据和测试数据进行绘图,排除具有较大差异的特征
seaborn.distplot(train_data_y)
plt.show()
for col in all_data_X.columns:
  seaborn.distplot(train_data_X[col])
  seaborn.distplot(test_data_X[col])
  plt.show()


  • 结果:



    .
    .
    .

    .
    .
    .

    .
    .
    .


  • 结论:
    由图(图有39张,因此没有全部展示)可得, 'V5', 'V9', 'V11', 'V17', 'V19', 'V22', 'V28' 特征应该被舍去


2-3. 数据预处理(2)

  • 代码:
from sklearn.model_selection import train_test_split

train_data = train_data.drop(['V5', 'V9', 'V11', 'V17', 'V19', 'V22', 'V28'], axis=1)
test_data_X = test_data_X.drop(['V5', 'V9', 'V11', 'V17', 'V19', 'V22', 'V28'], axis=1)
train_data_X = train_data.drop(columns=['target'])
train_data_y = train_data['target']
X_train, X_test, y_train, y_test = train_test_split(train_data_X, train_data_y, test_size=0.2, random_state=1)


2-4. 模型挑选

测试主流回归预测模型,选择一个模型作为最终的预测模型

  • 代码:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor, BaggingRegressor
from pandas import DataFrame

def Linear_Regression(X_train, X_test, y_train, y_test, n_jobs = None):
    """简单线性回归"""
    model = LinearRegression(n_jobs=n_jobs)
    model.fit(X_train, y_train)
    print("Linear_Regression_训练集的分数为:",model.score(X_train, y_train))
    print("Linear_Regression_测试集的分数为:",model.score(X_test, y_test))
    return model

def KNeighbors_Regressor(X_train, X_test, y_train, y_test, n_neighbors=5, n_jobs = None):
    """K近邻回归"""
    model = KNeighborsRegressor(n_neighbors=n_neighbors, n_jobs=n_jobs)
    model.fit(X_train, y_train)
    print("KNeighbors_Regressor_训练集的分数为:",model.score(X_train, y_train))
    print("KNeighbors_Regressor_测试集的分数为:",model.score(X_test, y_test))
    return model

def Linear_SVR(X_train, X_test, y_train, y_test):
    """支持向量机"""
    model = LinearSVR()
    model.fit(X_train, y_train)
    print("Linear_SVR_训练集的分数为:",model.score(X_train, y_train))
    print("Linear_SVR_测试集的分数为:",model.score(X_test, y_test))
    return model

def DecisionTree_Regressor(X_train, X_test, y_train, y_test):
    """决策树回归"""
    model = DecisionTreeRegressor()
    model.fit(X_train, y_train)
    print("DecisionTree_Regressor_训练集的分数为:",model.score(X_train, y_train))
    print("DecisionTree_Regressor_测试集的分数为:",model.score(X_test, y_test))
    return model

def RandomForest_Regressor(X_train, X_test, y_train, y_test, n_estimators=20):
    """随机森林回归"""
    model = RandomForestRegressor(n_estimators=n_estimators)
    model.fit(X_train, y_train)
    print("RandomForest_Regressor_训练集的分数为:",model.score(X_train, y_train))
    print("RandomForest_Regressor_测试集的分数为:",model.score(X_test, y_test))
    return model

def AdaBoost_Regressor(X_train, X_test, y_train, y_test, n_estimators=50):
    model = AdaBoostRegressor(n_estimators=n_estimators)
    model.fit(X_train, y_train)
    print("AdaBoost_Regressor_训练集的分数为:",model.score(X_train, y_train))
    print("AdaBoost_Regressor_测试集的分数为:",model.score(X_test, y_test))
    return model

def GradientBoosting_Regressor(X_train, X_test, y_train, y_test, n_estimators=100):
    """梯度提升树"""
    model = GradientBoostingRegressor(n_estimators=n_estimators)
    model.fit(X_train, y_train)
    print("GradientBoosting_Regressor_训练集的分数为:",model.score(X_train, y_train))
    print("GradientBoosting_Regressor_测试集的分数为:",model.score(X_test, y_test))
    return model

def Bagging_Regressor(X_train, X_test, y_train, y_test):
    model = BaggingRegressor()
    model.fit(X_train, y_train)
    print("Bagging_Regressor_训练集的分数为:",model.score(X_train, y_train))
    print("Bagging_Regressor_测试集的分数为:",model.score(X_test, y_test))
    return model

def ExtraTree_Regressor(X_train, X_test, y_train, y_test):
    model = ExtraTreeRegressor()
    model.fit(X_train, y_train)
    print("ExtraTree_Regressor_训练集的分数为:",model.score(X_train, y_train))
    print("ExtraTree_Regressor_测试集的分数为:",model.score(X_test, y_test))
    return model


if __name__ == '__main__':
    Linear_Regression(X_train, X_test, y_train, y_test)
    KNeighbors_Regressor(X_train, X_test, y_train, y_test)
    Linear_SVR(X_train, X_test, y_train, y_test)
    DecisionTree_Regressor(X_train, X_test, y_train, y_test)
    RandomForest_Regressor(X_train, X_test, y_train, y_test)
    AdaBoost_Regressor(X_train, X_test, y_train, y_test)
    GradientBoosting_Regressor(X_train, X_test, y_train, y_test)
    Bagging_Regressor(X_train, X_test, y_train, y_test)
    ExtraTree_Regressor(X_train, X_test, y_train, y_test)


  • 结果:
Linear_Regression_训练集的分数为: 0.8793754817117719
Linear_Regression_测试集的分数为: 0.9024323511787031

KNeighbors_Regressor_训练集的分数为: 0.873760139567705
KNeighbors_Regressor_测试集的分数为: 0.8203560236180196

Linear_SVR_训练集的分数为: 0.8771322330424864
Linear_SVR_测试集的分数为: 0.9043250479026952

DecisionTree_Regressor_训练集的分数为: 1.0
DecisionTree_Regressor_测试集的分数为: 0.769991117350511

RandomForest_Regressor_训练集的分数为: 0.9769759090722859
RandomForest_Regressor_测试集的分数为: 0.8825221601691665

AdaBoost_Regressor_训练集的分数为: 0.857086400923137
AdaBoost_Regressor_测试集的分数为: 0.8474148168174068

GradientBoosting_Regressor_训练集的分数为: 0.9249193241979118
GradientBoosting_Regressor_测试集的分数为: 0.8904802410675052

Bagging_Regressor_训练集的分数为: 0.9711876503403293
Bagging_Regressor_测试集的分数为: 0.868885350012993

ExtraTree_Regressor_训练集的分数为: 1.0
ExtraTree_Regressor_测试集的分数为: 0.7707929178071851


  • 结论:
    基于判断,本文选择随机森林算法作为最终的预测算法


2-5. 模型简单调参

  • 代码:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

parameters = {'n_estimators':[20, 50, 100, 200, 350, 400, 450, 500, 600, 700]}
model = RandomForestRegressor()
model = GridSearchCV(estimator=model, param_grid=parameters, n_jobs=-1, cv=5, return_train_score=True, verbose=2)
model.fit(data_X, data_y)
result = DataFrame.from_dict(model.cv_results_)
with open('cv_result.csv', 'w') as f:
  result.to_csv(f)


  • 结果:
,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score

0,1.2687391281127929,0.036266710771224606,0.00800008773803711,0.0040000439755987945,20,{'n_estimators': 20},0.8724420988529944,0.8322415349993869,0.8745788013953939,0.8432332026207715,0.8629602469814748,0.8570911769700043,0.016646092350053526,10,0.9754638977124689,0.9771257604624909,0.9753664943283047,0.9762308314580678,0.9760294018629432,0.9760432771648551,0.0006325906708548068
1,3.2445600032806396,0.0898784607256352,0.015204715728759765,0.004903649262552694,50,{'n_estimators': 50},0.8858828797211615,0.8361689096218232,0.8761299232393122,0.8441668200066398,0.8636069322886752,0.8611910929755224,0.018732991175114472,9,0.9794692363732168,0.980251632264731,0.9791328561835737,0.9797085673705609,0.9800975592098554,0.9797319702803875,0.00040797508416271096
2,6.141848516464234,0.05134326742132711,0.02700052261352539,0.0028347249924286573,100,{'n_estimators': 100},0.8832035309133262,0.8406105214617848,0.8797806810146303,0.8481485011741777,0.8681716790152053,0.8639829827158249,0.01693248163740542,8,0.979245030342729,0.9820405480396872,0.9803227452483777,0.9812077157808432,0.9803618594684622,0.9806355797760199,0.000938902700476217
3,12.38009328842163,0.06053386697040871,0.04712519645690918,0.004078332035249107,200,{'n_estimators': 200},0.8863089810103687,0.840986879636411,0.8792634357812497,0.8485512436509124,0.866409310003805,0.8643039700165492,0.01734516458850397,6,0.9804158095178296,0.9814646553226614,0.9805973655139438,0.981965672419542,0.9809032298087431,0.9810693465165439,0.0005722133540374375
4,21.62595429420471,0.126977234199679,0.08544902801513672,0.002540483855626194,350,{'n_estimators': 350},0.8841537733286139,0.8419244804596531,0.8796419438001326,0.8492947693331497,0.8673989125123374,0.8644827758867772,0.016521528988876447,3,0.9807143879983441,0.9820068299185257,0.9808390807061949,0.9823274408967227,0.9809858416485482,0.9813747162336671,0.0006605171055751274
5,25.034766960144044,0.13447299759533657,0.10861043930053711,0.015611033735857132,400,{'n_estimators': 400},0.885658172546209,0.8420700903434375,0.8790276212752159,0.8476973706085373,0.8677423869060745,0.8644391283358949,0.01705663385976782,5,0.9803901306932666,0.9818169950977109,0.9805862975035965,0.9820265491956262,0.9812115020073965,0.9812062948995193,0.0006475059441958164
6,27.636775636672972,0.2492574970619635,0.10564160346984863,0.005661552843369761,450,{'n_estimators': 450},0.8866806538040704,0.8433382908866536,0.8788635271606756,0.8466585436203181,0.8677485508797311,0.8646579132702898,0.01717472686147294,2,0.9806634036232731,0.9821999212088085,0.9804570987809731,0.9822958155784464,0.9809697712425541,0.9813172020868111,0.0007777937687497462
7,30.86607985496521,0.20500353154321693,0.13047318458557128,0.013913125988523395,500,{'n_estimators': 500},0.8855925991070644,0.8421343083066357,0.8801343316053365,0.8459618990057357,0.866102923739283,0.8639852123528111,0.017518331816349365,7,0.9805425746264094,0.9818577187621902,0.9807738935705386,0.9819684179300116,0.9810995525758069,0.9812484314929912,0.0005718735059135995
8,37.01020722389221,0.0747417846453937,0.15296621322631837,0.026100525193978055,600,{'n_estimators': 600},0.8862715734846807,0.8426327279744084,0.8794599752282756,0.8482937543074928,0.8676135270757153,0.8648543116141145,0.017015808081712314,1,0.9804765515844174,0.9820324250344458,0.9808635785774821,0.982249400979226,0.9812207182315182,0.9813685348814178,0.0006766298915293344
9,39.326469850540164,4.345708733687172,0.14382991790771485,0.027514478265663347,700,{'n_estimators': 700},0.8863222774373519,0.8421825223020779,0.8801658332936056,0.8470082532906615,0.8666985997737111,0.8644754972194817,0.017495667910145084,4,0.9806257425149776,0.9820098406232722,0.9810055306798846,0.9821538484154302,0.9810454877787224,0.9813680900024574,0.0006026377671843295


  • 结论:
    选择 n_estimators=600 作为模型的参数


2-6. 模型预测

  • 代码:
predict_model = RandomForestRegressor(n_estimators=600, n_jobs=-1).fit(train_data_X, train_data_y)
predict_result = predict_model.predict(test_data_X)
result = DataFrame(predict_result)
result.to_csv("result.txt", index=False, header=False)


结束语
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,293评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,604评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,958评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,729评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,719评论 5 366
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,630评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,000评论 3 397
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,665评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,909评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,646评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,726评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,400评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,986评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,959评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,197评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 44,996评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,481评论 2 342