45Kaggle 数据分析项目入门实战--波士顿房价数据分析预测

房价预测简介

本次实验主要来源 Kaggle 上的一个入门挑战房价预测。房价预测也是 Kaggle 上经典的数据分析入门项目之一。本次实验就是通过该项目来带领你入门数据分析。

image.png

我们都知道，房价一般会与房间面积的大小、房子所在的城市、房子的空间布局等因素有关。而房价预测的任务就是给定与房价相关因素的数据，通过这些数据预测出房子的价格。

数据预览

这里使用的是 Kaggle 房价预测提供的数据，其提供的数据集是 csv 格式的文件，我们可以使用 * Pandas* 对其进行直接的读取。
首先，实验加载所需数据。

import pandas as pd
import warnings
warnings.filterwarnings("ignore")

train = pd.read_csv(
    'https://labfile.oss.aliyuncs.com/courses/1363/HousePrice.csv')
train

使用 Pandas 读取得到的数据是 Pandas 特有的 DataFrame 数据格式，我们可以使用 .head() 来查看数据的前 5 份。

train.head()

同理使用 .tail 方法来查看最后 5 份数据。

train.tail()

使用 .shape 方法查看数据的形状。

train.shape

从上面的显示结果可以看到，总共含有 1460 份数据，每份数据含有 81 列。现在查看数据中都含有哪些列

train.columns

最后一列 SalePrice 表示房子的价格，而前面的 80 列表示与房价相关的因素，通常也称为特征列。例如几个特征列如下：
YearBuilt: 建筑年份
GarageCars：车库的容量
HouseStyle：房子的风格

初识数据

在数据集中 GrLivArea 表示占地面积，现在来看一下房子占地面积与房价的关系，这里通过画图来直观的判断。这里我们使用的绘图工具是 Matplotlib 和 Seaborn 。先导入相关的库。

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
color = sns.color_palette()
sns.set_style('darkgrid')

需要注意的是这里占地面积的单位是平方英尺而不是平方米，所以对于房子占地面积为 2000 或 3000 的数据不必惊讶。

fig, ax = plt.subplots()
# 绘制散点图
ax.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

从图显示的结果可以看出，占地面积与房价大致呈线性相关关系。也就是说，面积越大，房价越高。此外，细心观察可以发现，上图中右下角有两个数据点有点不正常，通常将这类点称之为异常值点。现在将其删除。

# 删除异常值点
train_drop = train.drop(
    train[(train['GrLivArea'] > 4000) & (train['SalePrice'] < 300000)].index)

# 重新绘制图
fig, ax = plt.subplots()
ax.scatter(train_drop['GrLivArea'], train_drop['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

上面我们主要画出的是房子占地面积与房价的关系，而占地面积和房价都是连续的数值，因此可以直接画出它们的关系。而在数据集中还存在另一种类别型特征，对于这类数据，可以通过 * 箱线图* 进行画出。例如，在数据集中 OverallQual 表示房子的材料和成品的质量，是一个类别型特征，现在画出该特征与房价的关系。

var = 'OverallQual'
data = pd.concat([train_drop['SalePrice'], train_drop[var]], axis=1)
# 画出箱线图
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)

从上图中可以看出， OverallQual 的等级越高，也就是房子的材料和质量越好，房价越高。
上面分析了单个特征与房价的关系，现在可以通过热图来分析所有特征之间的相关性以及与房价的关系。这里为了便于查看只取了前 10 个相关度最高的特征。

import numpy as np

k = 10
corrmat = train_drop.corr()  # 获得相关性矩阵
# 获得相关性最高的 K 个特征
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
# 获得相关性最高的 K 个特征组成的子数据集
cm = np.corrcoef(train_drop[cols].values.T)
# 绘制热图
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={
                 'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

从上面结果可以看到，房价大致与占地面积和房子质量相关度最高，这也很符合事实。下面画出这些特征之间的关系。

# 绘制散点图
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea',
        'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(train_drop[cols], size=2.5)
plt.show()

数据预处理

上面只是使用可视化的方法来初步查看数据，让我们先对数据有一个初步的认识。现在来对数据进行简单的预处理。在前面的数据预览时，可以看出第一列为 ID ，也就是说该列对房价没有影响，因此这里先把该列删除。删除之后的列数为 80 列。

train_drop1 = train_drop.drop("Id", axis=1)
train_drop1.head()

SalePrice 列为房价，也即是所要预测的列，这里先对其进行分析。使用 describe 方法查看数据的基本情况。

train_drop1['SalePrice'].describe()

画出其分布图。这里使用 SciPy 提供的接口来进行相关的计算

from scipy.stats import norm, skew

sns.distplot(train_drop1['SalePrice'], fit=norm)

# 获得均值和方差
(mu, sigma) = norm.fit(train_drop1['SalePrice'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

# 画出数据分布图
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
           loc='best')
plt.ylabel('Frequency')
# 设置标题
plt.title('SalePrice distribution')

可以看到，该数据集貌似不是常见的正态分布，即高斯分布。现在画出其 Q-Q 图

from scipy import stats

fig = plt.figure()
res = stats.probplot(train_drop1['SalePrice'], plot=plt)
plt.show()

一般预测模型都会选用机器学习算法，而许多机器学习算法都是基于数据是高斯分布的条件下推导出来的，因此，这里先把房价处理成为高斯分布的形式。这里直接使用 NumPy 提供的数据平滑接口来实现。

# 平滑数据
train_drop1["SalePrice"] = np.log1p(train_drop1["SalePrice"])

# 重新画出数据分布图
sns.distplot(train_drop1['SalePrice'], fit=norm)

# 重新计算平滑后的均值和方差
(mu, sigma) = norm.fit(train_drop1['SalePrice'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
           loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

# 画出 Q-Q 图

fig = plt.figure()
res = stats.probplot(train_drop1['SalePrice'], plot=plt)
plt.show()

经过平滑之后，数据已经大致呈高斯分布的形状。

特征工程

因为数据集可能会含有一些缺失值，我们通过 isnull 方法来查看。

train_drop1.isnull().sum().sort_values(ascending=False)[:20]  # 取前 20 个数据

这里为了便于观察，可以求出其缺失率。

train_na = (train_drop1.isnull().sum() / len(train)) * 100
train_na = train_na.drop(
    train_na[train_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio': train_na})
missing_data.head(20)

从上面的结果，可以看出，在数据集中 PoolQC 列的数据缺失达到 99.45% ，MiscFeature 列的数据缺失达到 96.16%。为了更加直观的观察，对其进行可视化。

f, ax = plt.subplots(figsize=(15, 6))
plt.xticks(rotation='90')
sns.barplot(x=train_na.index, y=train_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

从上面的分析中，我们可以看到，大约有 20 列的数据都存在缺失值，在构建预测模型之前需要对其进行填充。
在数据描述中，PoolQC 表示游泳池的质量，缺失了则代表没有游泳池。从上面的分析结果，该列的缺失值最多，这也就意味着许多房子都是没有游泳池的，与事实也比较相符。
除了 PoolQC 列，还有很多情况类似的列，例如房子贴砖的类型等。因此，对这些类别特征的列都填充 None。

feature = ['PoolQC', 'MiscFeature', 'Alley', 'Fence',
           'FireplaceQu', 'GarageType', 'GarageFinish',
           'GarageQual', 'GarageCond', 'BsmtQual',
           'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
           'BsmtFinType2', 'MasVnrType', 'MSSubClass']
for col in feature:
    train_drop1[col] = train_drop1[col].fillna('None')

对这些类似于车库的面积和地下室面积相关数值型特征的列填充 0 ，表示没有车库和地下室。

feature = ['GarageYrBlt', 'GarageArea', 'GarageCars',
           'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
           'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath',
           'MasVnrArea', 'Electrical']
for col in feature:
    train_drop1[col] = train_drop1[col].fillna(0)

LotFrontage 表示与街道的距离，每个房子到街道的距离可能会很相似，因此这里采用附近房子到街道距离的中值来进行填充。

train_drop1["LotFrontage"] = train_drop1.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

MSZoning 表示分区分类，这里使用众数来填充。

feature = []
train_drop1['MSZoning'] = train_drop1['MSZoning'].fillna(
    train_drop1['MSZoning'].mode()[0])

Utilities 列与所要预测的 SalePrice 列不怎么相关，这里直接删除该列。

train_drop2 = train_drop1.drop(['Utilities'], axis=1)

Functional 表示功能，数据描述里说缺失值代表房子具有基本的功能。因此对其进行常值填充。

train_drop1["Functional"] = train_drop1["Functional"].fillna("Typ")

train_drop2.isnull().sum().sort_values(ascending=False)[:20]

可以看到，进过数据填充之后，已经没有了缺失值。
在数据集中，特征主要分为两种，分别是数值型特征和类别型特征。数值型特征就是连续数值组成的特征，例如房子的面积；而类别型特征则是由两类或两类以上类别组成的特征，例如房子是否带游泳池，即包含是和否两个类别。
在数据集中有一些特征属于类别型特征，但却用数值来表示，例如销售月份。因此，要转换其成为类别型特征。

feature = ['MSSubClass', 'OverallCond', 'YrSold', 'MoSold']
for col in feature:
    train_drop2[col] = train_drop2[col].apply(str)

对一些类别型的特征列进行编码。将其转换成为用数值来表示的类别型特征。

from sklearn.preprocessing import LabelEncoder

cols = ['FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
        'ExterQual', 'ExterCond', 'HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
        'YrSold', 'MoSold']
for c in cols:
    lbl = LabelEncoder()
    lbl.fit(list(train_drop2[c].values))
    train_drop2[c] = lbl.transform(list(train_drop2[c].values))
train_drop2[cols].head()

因为数据没有给出房子的总面积，也就是说没有统计出一楼、二楼以及地下室的总面积。不过我们可以通过数据集来手动提取这一特征。

train_drop2['TotalSF'] = train_drop2['TotalBsmtSF'] + \
    train_drop2['1stFlrSF'] + train_drop2['2ndFlrSF']

在前文分析房子价格 SalePrice 时，由于其不服从正态分布，因此使用平滑的方法让其服从正态分布。这里这对数据集中的数值特征列进行同样的分析。先通过 SciPy 提供的接口 scipy.stats.skew 来判断其偏度。

numeric_feats = train_drop2.dtypes[train_drop2.dtypes != "object"].index

# 检测特征值
skewed_feats = train_drop2[numeric_feats].apply(
    lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew': skewed_feats})
skewness.head(10)

从上面的结果可知，列 MiscVal 的偏度最大，偏度越大也就意味着该列的数据分布越偏离高斯分布。
现在通过 BoxCox 方法「矫正」这些特征列。

from scipy.special import boxcox1p
skewness = skewness[abs(skewness) > 0.75]


skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    train_drop2[feat] = boxcox1p(train_drop2[feat], lam)

对那些用符号表示的类别型特征用 One-Hot 来进行编码。

data_y = train_drop2['SalePrice']
data_X = train_drop2.drop(['SalePrice'], axis=1)

data_X_oh = pd.get_dummies(data_X)
print(data_X_oh.shape)

预测模型

上面主要完成了对数据的预处理和特征工程，现在需要建立预测模型来对所处理的模型进行预测。

from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

现在对数据进行划分，选用 70% 的数据来训练，选用 30% 的数据来测试。

data_y_v = data_y.values  # 转换为 NumPy 数组
data_X_v = data_X_oh.values
length = int(len(data_y)*0.7)

# 划分数据集
train_y = data_y_v[:length]
train_X = data_X_v[:length]
test_y = data_y_v[length:]
test_X = data_X_v[length:]

构建模型，并进行训练。这里使用的是 Lasso 模型，其是线性回归的一种改进版本。

model = Lasso()
model.fit(train_X, train_y)

使用训练好的模型进行预测。并使用均方差来衡量预测结果的好坏。

y_pred = model.predict(test_X)
mean_squared_error(test_y, y_pred)