数据集链接http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
相关描述可以在网站上看到,我就不写啦~
分别使用线性回归/决策树/随机森林决策树进行预测,顺便比较了一下哪个模型预测更加精准。
在使用随机森林预测时,如果对时间要求不是很高的话,可以把n_estimators设置的稍微大一些,0-200之间都可以,因为模型准确率函数为一个对数函数。
代码:
读取csv文件
import pandas as pd
import matplotlib.pyplot as plt
bike_rentals=pd.read_csv('./data/hour.csv')
#plt.hist(bike_rentals['cnt'])
#plt.show()
cnt_correlations=bike_rentals.corr()['cnt']
print("\n Reading success! cnt-correlations:\n")
print(cnt_correlations)
处理数据,生成模型并预测
import read_file
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
bike_rentals=read_file.bike_rentals
# Formatting 'hr' column
def assign_label(hour):
if hour >=0 and hour < 6:
return 4
elif hour >=6 and hour < 12:
return 1
elif hour >= 12 and hour < 18:
return 2
elif hour >= 18 and hour <=24:
return 3
bike_rentals['time_labels']=bike_rentals['hr'].apply(assign_label)
#Splitting data
train=bike_rentals.sample(frac=.8)
test=bike_rentals.iloc[~bike_rentals.index.isin(train.index)]
# Removing columns,such as indirect and unuseful columns
columns=list(bike_rentals.columns)
columns.remove('cnt')
columns.remove('casual')
columns.remove('dteday')
columns.remove('registered')
print("\n===========>>>>>>Predictting:\n")
#Predictting target column,selectting mse as metric.
#LinearRegression
model=LinearRegression()
model.fit(train[columns],train['cnt'])
predictions=model.predict(test[columns])
mse=mean_squared_error(test['cnt'],predictions)
print("MSE using LinearRegression: ",end='')
print(mse,'\n')
#DecisionTreeRegression
model=DecisionTreeRegressor(min_samples_leaf=5)
model.fit(train[columns],train['cnt'])
predictions=model.predict(test[columns])
mse=mean_squared_error(test['cnt'],predictions)
print("MSE using DecisionTreeRegression: ",end='')
print(mse,'\n')
#RandomForsetRegression
model=RandomForestRegressor(n_estimators=50,min_samples_leaf=2)
model.fit(train[columns],train['cnt'])
predictions=model.predict(test[columns])
mse=mean_squared_error(test['cnt'],predictions)
test['predictions']=predictions
print("MSE using DecisionTreeRegression: ",end='')
print(mse,'\n')
print(test.iloc[:10][['cnt','predictions']])
结果: