上节《统计数据可视化工具包:Seaborn》 介绍了seaborn工具包,本节介绍Seaborn在查看波士顿房价数据中的应用
第一步,用迅雷下载波士顿房价数据集到本地,下载链接:https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
第二步,用numpy将数据集读入内存,程序如下:
import numpy as np
# 从文件导入数据
datafile = './housing.data'
housing_data = np.fromfile(datafile, sep=' ')
print(housing_data.shape)
feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
feature_num = len(feature_names)
# 将原始数据进行Reshape,变成[N, 14]这样的形状
housing_data = housing_data.reshape([housing_data.shape[0] // feature_num, feature_num])
print(housing_data.shape)
运行结果
(7084,)
(506, 14)
第三步,查看特征间的关系,主要是变量两两之间的关系(线性或非线性,有无明显较为相关关系)
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# 查看特征间的关系,主要是变量两两之间的关系(线性或非线性,有无明显较为相关关系)
features_np = np.array([x[:13] for x in housing_data], np.float32)
labels_np = np.array([x[-1] for x in housing_data], np.float32)
df = pd.DataFrame(housing_data, columns=feature_names)
sns.pairplot(df.dropna(), y_vars=feature_names[-1], x_vars=feature_names[::-1], diag_kind='kde')
plt.show()
运行结果:第四步,特征之间的相关性分析
# 相关性分析
fig, ax = plt.subplots(figsize=(15, 1))
corr_data = df.corr().iloc[-1]
corr_data = np.asarray(corr_data).reshape(1, 14)
ax = sns.heatmap(corr_data, cbar=True, annot=True)
plt.show()
运行结果:
第五步,查看特征值的数据分布区间
sns.boxplot(data=df.iloc[:, 0:13])
plt.show()
运行结果:结论:
- 波士顿房价数据中的各特征之间的相关系不大
- 各特征的取值范围差异太大!,甚至不能够在一个画布上充分的展示各属性具体的最大、最小值以及异常值等,需要做归一化处理。
归一化代码:
features_max = housing_data.max(axis=0)
features_min = housing_data.min(axis=0)
features_avg = housing_data.sum(axis=0) / housing_data.shape[0]
BATCH_SIZE = 20
def feature_norm(input):
f_size = input.shape
output_features = np.zeros(f_size, np.float32)
for batch_id in range(f_size[0]):
for index in range(13):
output_features[batch_id][index] = (input[batch_id][index] - features_avg[index]) / (features_max[index] - features_min[index])
return output_features
#只对属性进行归一化
housing_features = feature_norm(housing_data[:, :13])
# print(feature_trian.shape)
housing_data = np.c_[housing_features, housing_data[:, -1]].astype(np.float32)
# print(training_data[0])
#归一化后的train_data,我们看下各属性的情况
features_np = np.array([x[:13] for x in housing_data],np.float32)
labels_np = np.array([x[-1] for x in housing_data],np.float32)
data_np = np.c_[features_np, labels_np]
df = pd.DataFrame(data_np, columns=feature_names)
sns.boxplot(data=df.iloc[:, 0:13])
plt.show()
归一化结果: