kaggle 钻石价格预测

钻石价格预测 Categorical特征

NO 字段名称 数据类型 字段描述
1 carat Float 克拉数
2 cut String 切割工艺的评级,分为5类 Describe cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal
3 color String 颜色 Color of the diamond, with D being the best and J the wors
4 clarity String 钻石净度的评级,分为8类How obvious inclusions are within the diamond:(in order from best to worst, FL = flawless, I3= level 3 inclusions) FL,IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
5 depth Float 深度百分比,钻石高度除以平均直径,单位:%
6 table Float 台面百分比,钻石台面宽度除以平均直径,单位:%
7 price Int 钻石价格,单位:美元
8 x Float 长度,单位:mm
9 y Float 宽度,单位:mm
10 z Float 深度,单位:mm

# 加载包
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# 加载 warnings
import warnings

# 忽略 warnings
warnings.filterwarnings("ignore")

# 从csv文件中写入数据
data = pd.read_csv('diamonds.csv')
print(plt.style.available)   # 列出所有可用的绘图样式
plt.style.use('ggplot')      # 使用“ggplot”样式
# 查看特征值和目标值
data.head()

Exploratory Data Analysis & 数据预处理

  1. 数据中无缺失值,不需要进行特殊处理
  2. 数据的绝大部分维度特征相对合理
  3. 数据中存在int,float和类型数据的变量,将类型数据做进一步处理
  4. 数据的个数有53940个

不分组knn

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
x = data[['carat','cut','clarity','depth','table','x','y','z']]
y = data.loc[:,'price']
knn.fit(x,y)
prediction = knn.predict(x)
# format():格式化输出
print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test)) 

Prediction: [ 326 326 327 ... 2039 2732 2489]

测试组占30% knn

from sklearn.model_selection import train_test_split
# 切分数据集、测试集,固定随机种子(保证数据集每次的切分都一样)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 1)
# 设置k值为3
knn = KNeighborsClassifier(n_neighbors = 3)
# 设置特征值和预测值
x = data[['carat','cut','clarity','depth','table','x','y','z']]
y = data.loc[:,'price']
# 将模型拟合到训练集
knn.fit(x_train,y_train)
# 预测精准度
prediction = knn.predict(x_test)
print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test)) 

Prediction: [ 449 6321 2131 ... 625 730 4168]
With KNN (K=3) accuracy is: 0.013842541095043875

参数调优

# 模型复杂度
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# 循环K值从1到25
for i, k in enumerate(neig):
    # k从1到25(不包括1、25)
    knn = KNeighborsClassifier(n_neighbors=k)
    # 使用KNN拟合
    knn.fit(x_train,y_train)
    # 训练集的准确度
    train_accuracy.append(knn.score(x_train, y_train))
    # 测试集的准确度
    test_accuracy.append(knn.score(x_test, y_test))
    
# 可视化    
plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))

可以看到knn效果不好,因为我们要的结果是一个变量,knn 更适合分类
那么再试试其他方法

线性回归

x = np.array(data.loc[:,'carat']).reshape(-1,1)
y = np.array(data.loc[:,'price']).reshape(-1,1)
# 线性回归
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
# 预测区域
predict_space = np.linspace(min(x), max(x)).reshape(-1,1)
# 将训练数据拟合到模型中
reg.fit(x,y)
# 预测
predicted = reg.predict(predict_space)
# R^2 
print('R^2 score: ',reg.score(x, y))
# 绘制回归线和散点
plt.plot(predict_space, predicted, color='black', linewidth=3)
plt.scatter(x=x,y=y)
plt.xlabel('carat')
plt.ylabel('price')
plt.show()

R^2 score: 0.8493305264354857

# Ridge
from sklearn.linear_model import Ridge
# 固定随机种子,random_state=2得到的划分与random_state=1时不同
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
ridge = Ridge(alpha = 0.1, normalize = True)
ridge.fit(x_train,y_train)
ridge_predict = ridge.predict(x_test)
print('Ridge score: ',ridge.score(x_test,y_test))

Ridge score: 0.8415434800632169

# Lasso
from sklearn.linear_model import Lasso
x =  data[['carat','cut','clarity','depth','table','x','y','z']]
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
lasso = Lasso(alpha = 0.1, normalize = True)
lasso.fit(x_train,y_train)
ridge_predict = lasso.predict(x_test)
print('Lasso score: ',lasso.score(x_test,y_test))
print('Lasso coefficients: ',lasso.coef_)

Lasso score: 0.8854378481948613
Lasso coefficients: [8434.64906964 -123.88370107 -350.45873524 -38.29023158 -14.11709282
-36.3572615 -0. -36.2808792 ]

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
print('RandomForest score: ',rf.score(x_test,y_test))

RandomForest score: 0.9365442332739605

from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,max_depth=1, random_state=0, loss='ls',verbose = 1).fit(x_train , y_train)
y_pred = gbr.predict(x_test)
print('GradientBoosting score: ',gbr.score(x_test,y_test))
  Iter       Train Loss   Remaining Time 
     1    14094461.6429            0.48s
     2    12496608.8546            0.67s
     3    11168569.3479            0.61s
     4     9986874.8068            0.58s
     5     9008825.0389            0.56s
     6     8133660.7414            0.56s
     7     7402916.6391            0.55s
     8     6762929.6866            0.53s
     9     6204219.0082            0.52s
    10     5728951.1243            0.51s
    20     3198851.7843            0.41s
    30     2385567.8147            0.35s
    40     2077353.7183            0.34s
    50     1886477.5667            0.30s
    60     1755080.7615            0.23s
    70     1660608.8724            0.17s
    80     1592460.5433            0.11s
    90     1541833.7987            0.06s
   100     1504583.3182            0.00s

GradientBoosting score: 0.9066881031052523

可以发现随机森林效果最好

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • 超高速音视频编码器用法: ffmpeg [options] [[infile options] -i infile...
    吉凶以情迁阅读 10,135评论 0 4
  • 那些闪过的啊光影, 现今它已变成了一道风景。 那跃然纸上的曾经, 突然丧失了下笔的本领。 无可奈何从未相请, 一筹...
    panjw阅读 1,171评论 0 3
  • 最近,看了一个小视频,关于同理心和同情心。 同理心,一起感受,激发连结! 当一个人陷入地洞,说:“很黑,我很害怕!...
    丰盛的源点阅读 4,348评论 7 20

友情链接更多精彩内容