本章背景
本章是来源于coursera课程 python-machine-learning中的作业1内容。
本章参考
本章内容
- Pandas用法
- DataFrame用法
- Series用法
- K最近邻 (KNN,k-NearestNeighbor)
0. breast cancer 数据集
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print(cancer.DESCR) # Print the data set description
1. Pandas.DataFrame
创建DataFrame:
dataFrame = pd.DataFrame(data=cancer.data, index=pd.RangeIndex(start=0, stop=569, step=1),
columns=cancer.feature_names)
DataFrame切片:
#截取第0-29列(前30列)所有行的数据
X = dataFrame.iloc[:, :30]
统计DataFrame列中某值频数
需要进行转换list:
malignant_count = list(dataFrame['target']).count(0)
or
malignant_count = list(dataFrame.target).count(0)
2. Pandas.Series
malignant_count = list(dataFrame['target']).count(0)
benign_count = list(dataFrame['target']).count(1)
series = pd.Series(data=[malignant_count, benign_count], index=["malignant", "benign"])
3. train_test_split()
<!--
<!-- test_size : float, int or None, optional (default=None)-->
<!-- If float, should be between 0.0 and 1.0 and represent the proportion-->
<!-- of the dataset to include in the test split. If int, represents the-->
<!-- absolute number of test samples. If None, the value is set to the-->
<!-- complement of the train size. If ``train_size`` is also None, it will-->
<!-- be set to 0.25.
-->
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=143, random_state=0)
4. KNN
如下包含所有代码:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# 加载breast_cancer数据集,包含569个样本和30个维度的属性
cancer = load_breast_cancer()
# 将cancer数据集转化为DataFrame,转化后的shape为 (569, 31),其中最后一个为target(0/1)
dataFrame = pd.DataFrame(data=cancer.data, index=pd.RangeIndex(start=0, stop=569, step=1),
columns=cancer.feature_names)
dataTarget = pd.DataFrame(data=cancer.target, index=pd.RangeIndex(start=0, stop=569, step=1), columns=['target'])
finalDataFrame = dataFrame.join(dataTarget)
# Your code here
X = finalDataFrame.iloc[:, :30]
y = pd.Series(data=finalDataFrame.target)
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=143, random_state=0)
# Your code here
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
# 用各个属性的均值尝试一下预测
means = cancerdf.mean()[:-1].values.reshape(1, -1)
label = knn.predict(means)
print('label', label)
# 评估一下测试集上的表现
score = knn.score(X_test, y_test)
print(score)