kaggle 钻石价格预测

钻石价格预测 Categorical特征

NO 字段名称数据类型字段描述
1 carat Float 克拉数
2 cut String 切割工艺的评级，分为5类 Describe cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal
3 color String 颜色 Color of the diamond, with D being the best and J the wors
4 clarity String 钻石净度的评级，分为8类How obvious inclusions are within the diamond:(in order from best to worst, FL = flawless, I3= level 3 inclusions) FL,IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
5 depth Float 深度百分比，钻石高度除以平均直径，单位：%
6 table Float 台面百分比，钻石台面宽度除以平均直径，单位：%
7 price Int 钻石价格，单位：美元
8 x Float 长度，单位：mm
9 y Float 宽度，单位：mm
10 z Float 深度，单位：mm

# 加载包
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# 加载 warnings
import warnings

# 忽略 warnings
warnings.filterwarnings("ignore")

# 从csv文件中写入数据
data = pd.read_csv('diamonds.csv')
print(plt.style.available)   # 列出所有可用的绘图样式
plt.style.use('ggplot')      # 使用“ggplot”样式

# 查看特征值和目标值
data.head()

Exploratory Data Analysis & 数据预处理

数据中无缺失值，不需要进行特殊处理
数据的绝大部分维度特征相对合理
数据中存在int，float和类型数据的变量，将类型数据做进一步处理
数据的个数有53940个

不分组knn

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
x = data[['carat','cut','clarity','depth','table','x','y','z']]
y = data.loc[:,'price']
knn.fit(x,y)
prediction = knn.predict(x)
# format()：格式化输出
print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test))

Prediction: [ 326 326 327 ... 2039 2732 2489]

测试组占30% knn

from sklearn.model_selection import train_test_split
# 切分数据集、测试集，固定随机种子（保证数据集每次的切分都一样）
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 1)
# 设置k值为3
knn = KNeighborsClassifier(n_neighbors = 3)
# 设置特征值和预测值
x = data[['carat','cut','clarity','depth','table','x','y','z']]
y = data.loc[:,'price']
# 将模型拟合到训练集
knn.fit(x_train,y_train)
# 预测精准度
prediction = knn.predict(x_test)
print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test))

Prediction: [ 449 6321 2131 ... 625 730 4168]
With KNN (K=3) accuracy is: 0.013842541095043875

参数调优

# 模型复杂度
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# 循环K值从1到25
for i, k in enumerate(neig):
    # k从1到25(不包括1、25)
    knn = KNeighborsClassifier(n_neighbors=k)
    # 使用KNN拟合
    knn.fit(x_train,y_train)
    # 训练集的准确度
    train_accuracy.append(knn.score(x_train, y_train))
    # 测试集的准确度
    test_accuracy.append(knn.score(x_test, y_test))
    
# 可视化    
plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))

可以看到knn效果不好，因为我们要的结果是一个变量，knn 更适合分类
那么再试试其他方法

线性回归

x = np.array(data.loc[:,'carat']).reshape(-1,1)
y = np.array(data.loc[:,'price']).reshape(-1,1)
# 线性回归
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
# 预测区域
predict_space = np.linspace(min(x), max(x)).reshape(-1,1)
# 将训练数据拟合到模型中
reg.fit(x,y)
# 预测
predicted = reg.predict(predict_space)
# R^2 
print('R^2 score: ',reg.score(x, y))
# 绘制回归线和散点
plt.plot(predict_space, predicted, color='black', linewidth=3)
plt.scatter(x=x,y=y)
plt.xlabel('carat')
plt.ylabel('price')
plt.show()

R^2 score: 0.8493305264354857

# Ridge
from sklearn.linear_model import Ridge
# 固定随机种子，random_state=2得到的划分与random_state=1时不同
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
ridge = Ridge(alpha = 0.1, normalize = True)
ridge.fit(x_train,y_train)
ridge_predict = ridge.predict(x_test)
print('Ridge score: ',ridge.score(x_test,y_test))

Ridge score: 0.8415434800632169

# Lasso
from sklearn.linear_model import Lasso
x =  data[['carat','cut','clarity','depth','table','x','y','z']]
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
lasso = Lasso(alpha = 0.1, normalize = True)
lasso.fit(x_train,y_train)
ridge_predict = lasso.predict(x_test)
print('Lasso score: ',lasso.score(x_test,y_test))
print('Lasso coefficients: ',lasso.coef_)

Lasso score: 0.8854378481948613
Lasso coefficients: [8434.64906964 -123.88370107 -350.45873524 -38.29023158 -14.11709282
-36.3572615 -0. -36.2808792 ]

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
print('RandomForest score: ',rf.score(x_test,y_test))

RandomForest score: 0.9365442332739605

from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,max_depth=1, random_state=0, loss='ls',verbose = 1).fit(x_train , y_train)
y_pred = gbr.predict(x_test)
print('GradientBoosting score: ',gbr.score(x_test,y_test))

  Iter       Train Loss   Remaining Time 
     1    14094461.6429            0.48s
     2    12496608.8546            0.67s
     3    11168569.3479            0.61s
     4     9986874.8068            0.58s
     5     9008825.0389            0.56s
     6     8133660.7414            0.56s
     7     7402916.6391            0.55s
     8     6762929.6866            0.53s
     9     6204219.0082            0.52s
    10     5728951.1243            0.51s
    20     3198851.7843            0.41s
    30     2385567.8147            0.35s
    40     2077353.7183            0.34s
    50     1886477.5667            0.30s
    60     1755080.7615            0.23s
    70     1660608.8724            0.17s
    80     1592460.5433            0.11s
    90     1541833.7987            0.06s
   100     1504583.3182            0.00s

GradientBoosting score: 0.9066881031052523

可以发现随机森林效果最好

kaggle 钻石价格预测