先放代码出处：
https://github.com/Jean-njoroge/Breast-cancer-risk-prediction

分析分为四个部分，保存在本仓库的juypter notebooks中
1.识别问题和数据源
2.探索性数据分析
3.预处理数据
4.建立模型以预测乳腺细胞组织是恶性还是良性

作者分了6个jupyter notebook

Notebook 01: 加载数据集，识别分析问题

乳腺癌是女性最常见的恶性肿瘤，占美国女性确诊癌症的近三分之一，是女性癌症死亡的第二大原因。乳腺癌是乳房组织细胞异常生长的结果，通常称为肿瘤。肿瘤并不意味着癌症——肿瘤可以是良性（非癌性）、恶性前（癌前）或恶性（癌性）。 MRI、乳房X光检查、超声波和活组织检查等测试通常用于诊断所进行的乳腺癌。

1.1 了解背景

原理：乳房细针抽吸 (FNA) 测试鉴定乳腺癌（这是一种快速且简单的程序，该程序可以从乳房病变或囊肿（肿块、溃疡或肿胀）中取出一些液体或细胞，用类似于血样针）。
通过检测数据和标签构建模型，实现对乳腺癌肿瘤进行分类：

1 = 恶性 (癌性)
0 = 良性 (非癌性)
很明显，这是一个二分类问题。

1.2 认识数据

乳腺癌数据集是由加州大学欧文分校维护的可用机器学习存储库（https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29）。该数据集包含 569 个恶性和良性肿瘤细胞样本。

数据集中的前两列分别存储样本的唯一 ID 号和相应的诊断（M=恶性，B=良性）。
第 3-32 列包含 30 个实值特征，这些特征是根据细胞核的数字化图像计算得出的，可用于构建模型来预测肿瘤是良性还是恶性。

为每个细胞核计算十个实值特征：
a) 半径（从中心到周边点的平均距离）
b) 纹理（灰度值的标准偏差）
c) 周长
d) 面积
e) 平滑度（半径长度的局部变化）
f) 紧凑性（周长^2/面积 - 1.0）
g) 凹度（轮廓凹入部分的严重程度）
h) 凹点（轮廓凹入部分的数量）
i) 对称性
j) 分形维数（“海岸线近似” - 1）

#load libraries
import numpy as np         # linear algebra
import pandas as pd        # data processing, CSV file I/O (e.g. pd.read_csv)

data = pd.read_csv('data/data.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB

总共30个特征，分别是对10个实值特征计算，mean, se, worst
diagnosis 列为标签
数据无空值

# 查看数据前两行
data.head()
# 对标签进行统计
data.diagnosis.value_counts().plot(kind = "bar")

diagnosis

良性：恶性大约为2：1. 在机器学习中最好是正负样本1：1，但是2：1也可以进行正常的分类预测。

#check for missing variables
data.isnull().any()
data.isnull().any().sum()

o

数据无缺失

Notebook 02: EDA 数据探索性分析

探索性数据分析（EDA）是一个非常重要的步骤，应该在任何建模之前完成。这是因为数据科学家能够在不做假设的情况下理解数据的性质。数据探索主要是掌握，数据的结构，值的分布，在数据集中是否存在异常值，特征间相互关系。
主要包括：

描述性统计分析
数据可视化

2.1 Descriptive statistics

%matplotlib inline
import matplotlib.pyplot as plt
#Load libraries for data processing
import pandas as pd 
from scipy.stats import norm
import seaborn as sns # visualization

plt.rcParams['figure.figsize'] = (15,8) 
plt.rcParams['axes.titlesize'] = 'large'

data = pd.read_csv('data/data.csv')
#basic descriptive statistics
data.iloc[:,2:32].describe()

    radius_mean texture_mean    perimeter_mean  area_mean   smoothness_mean compactness_mean    concavity_mean  concave points_mean symmetry_mean   fractal_dimension_mean  ... radius_worst    texture_worst   perimeter_worst area_worst  smoothness_worst    compactness_worst   concavity_worst concave points_worst    symmetry_worst  fractal_dimension_worst
count   569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  ... 569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  569.000000  569.000000
mean    14.127292   19.289649   91.969033   654.889104  0.096360    0.104341    0.088799    0.048919    0.181162    0.062798    ... 16.269190   25.677223   107.261213  880.583128  0.132369    0.254265    0.272188    0.114606    0.290076    0.083946
std 3.524049    4.301036    24.298981   351.914129  0.014064    0.052813    0.079720    0.038803    0.027414    0.007060    ... 4.833242    6.146258    33.602542   569.356993  0.022832    0.157336    0.208624    0.065732    0.061867    0.018061
min 6.981000    9.710000    43.790000   143.500000  0.052630    0.019380    0.000000    0.000000    0.106000    0.049960    ... 7.930000    12.020000   50.410000   185.200000  0.071170    0.027290    0.000000    0.000000    0.156500    0.055040
25% 11.700000   16.170000   75.170000   420.300000  0.086370    0.064920    0.029560    0.020310    0.161900    0.057700    ... 13.010000   21.080000   84.110000   515.300000  0.116600    0.147200    0.114500    0.064930    0.250400    0.071460
50% 13.370000   18.840000   86.240000   551.100000  0.095870    0.092630    0.061540    0.033500    0.179200    0.061540    ... 14.970000   25.410000   97.660000   686.500000  0.131300    0.211900    0.226700    0.099930    0.282200    0.080040
75% 15.780000   21.800000   104.100000  782.700000  0.105300    0.130400    0.130700    0.074000    0.195700    0.066120    ... 18.790000   29.720000   125.400000  1084.000000 0.146000    0.339100    0.382900    0.161400    0.317900    0.092080
max 28.110000   39.280000   188.500000  2501.000000 0.163400    0.345400    0.426800    0.201200    0.304000    0.097440    ... 36.040000   49.540000   251.200000  4254.000000 0.222600    1.058000    1.252000    0.291000    0.663800    0.207500

# Group by diagnosis and review the output.
# 一般用于组内聚合统计，如计算组间的均值，中位数等
diag_gr = data.groupby('diagnosis', axis=0)
diag_gr.median()
diag_gr.size()  # 等同于 data.diagnosis.value_counts()

df.groupby('A')

2.2 Data Visualizations

直方图
密度图
箱线图
热图

# 统一设置 图片背景和图片尺寸
sns.set_style("white")
sns.set_context({"figure.figsize": (10, 8)})

## 直方图
sns.countplot(data['diagnosis'],label='Count',palette="Set3", order=["B","M"]) # order 指定画图顺序

diagnosis

将特征分为三组：mean, se, worst

#For a merge + slice:
data_mean=data.iloc[:,2:12]
data_se=data.iloc[:,12:22]
data_worst=data.iloc[:,22:]

print(data_mean.columns)
print(data_se.columns)
print(data_worst.columns)

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean'],
      dtype='object')
Index(['radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se'],
      dtype='object')
Index(['radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
       'smoothness_worst', 'compactness_worst', 'concavity_worst',
       'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

各组特征可视化 -- 直方图

#Plot histograms of CUT1 variables
data_mean.hist(bins=10, figsize=(15, 10),grid=False,color = "pink")
data_se.hist(bins=10, figsize=(15, 10),grid=False,color = "orange")
data_worst.hist(bins=10, figsize=(15, 10),grid=False,color = "blue")

data_mean

data_se

data_worst

我们可以看到，也许属性 凹度，凹点 可能具有指数分布。我们还可以看到，纹理，平滑，对称属性可能具有高斯或接近高斯分布。许多机器学习技术假设输入变量的高斯单变量分布。

概率密度曲线

#Density Plots
plt = data_mean.plot(kind= 'density', subplots=True, 
                     layout=(4,3), sharex=False, 
                     sharey=False, fontsize=15, figsize=(15,10))

plt = data_se.plot(kind= 'density', subplots=True, 
                     layout=(4,3), sharex=False, 
                     sharey=False, fontsize=15, figsize=(15,10))

plt = data_worst.plot(kind= 'density', subplots=True, 
                     layout=(4,3), sharex=False, 
                     sharey=False, fontsize=15, figsize=(15,10))

data_mean

data_se

data_worst

周长、半径、面积、凹度、密度可能具有指数分布； 纹理、平滑、对称属性可能具有高斯或接近高斯分布。

中心极限定理告诉我们当样本数趋向于无穷大时，样本的分布会接近正态分布，但有些变量本身的分布就不是正态的，那么对于一些有正态假设的检验，估计的模型来说，就需要事先对变量做分布变换

另一方面极大或极小的值经过变换后跟正常值差距缩小，减少了极值对模型的扰动

指数分布的特征经对数（log()）变换之后可以呈高斯分布，

# transform exponential distribution to Gaussian univariate distribution
data_mean['area_mean'].plot(kind = "hist", figsize=(8,6))

np.log1p(data_mean['area_mean']).plot(kind = "hist", figsize=(8,6))

np.log10(data_mean['area_mean']).plot(kind = "hist", figsize=(8,6))

area_mean

np.log1p(data_mean)

np.log10(data_mean)

通过箱线图可视化数据分布情况和异常值

# box and whisker plots
plt=data_mean.plot(kind= 'box' , subplots=True, layout=(4,4), 
                   sharex=False, sharey=False,fontsize=12)

plt=data_se.plot(kind= 'box' , subplots=True, layout=(4,4), 
                 sharex=False, sharey=False,fontsize=12)

plt=data_worst.plot(kind= 'box' , subplots=True, layout=(4,4), 
                 sharex=False, sharey=False,fontsize=12)

data_mean

data_se

data_worst

2.3 Multimodal Data Visualizations

Scatter plots
Correlation matrix

# plot correlation matrix
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

plt.style.use('fivethirtyeight')
sns.set_style("white")
# Compute the correlation matrix
corr = data_mean.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
data, ax = plt.subplots(figsize=(8, 8))
plt.title('Breast Cancer Feature Correlation')
# Generate a custom diverging colormap
cmap = sns.diverging_palette(260, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, vmax=1.2, square='square', cmap=cmap, mask=mask, 
            ax=ax,annot=True, fmt='.2g',linewidths=2)

corr

我们可以看到平均值参数在 1-0.75 之间存在很强的正相关关系；。
组织核的平均面积与半径和参数的均值呈强正相关；
一些参数中度正相关（r在0.5-0.75之间）是凹度和面积，凹度和周长等; 同样，我们看到 fractal_dimension 与半径、纹理、参数平均值之间存在一些强烈的负相关。

data = pd.read_csv("data/data.csv")
g = sns.PairGrid(data[data.columns.tolist()[1:12]],
                 hue ='diagnosis')
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter, s = 3)

PairGrid

可以看到，大多数特征对于肿瘤良恶性的区分度还是很大的。

小结：

细胞半径、周长、面积、紧密度、凹度和凹点的平均值可用于癌症的分类。这些参数的较大值倾向于显示与恶性肿瘤的相关性。
质地、平滑度、对称性或分维数的平均值并未显示出较好的诊断偏好。
在任何直方图中，都没有明显的异常值需要进一步清理。

Notebook 03: 预处理与特征工程

数据加载

%matplotlib inline
import matplotlib.pyplot as plt
#Load libraries for data processing
import pandas as pd 
import numpy as np
from scipy.stats import norm
# visualization
import seaborn as sns 
plt.style.use('fivethirtyeight')
sns.set_style("white")
plt.rcParams['figure.figsize'] = (8,4) 
#plt.rcParams['axes.titlesize'] = 'large'
data = pd.read_csv('data/data.csv', index_col=False)

划分训练集和测试集

#Assign predictors to a variable of ndarray (matrix) type
X = data.iloc[:,2:32]
y = data.iloc[:,1].apply(lambda x: 1 if x == "M" else 0)

from sklearn.model_selection import train_test_split
##Split data set in train 70% and test 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((426, 30), (426,), (143, 30), (143,))

数据标准化

from sklearn.preprocessing import StandardScaler
# Normalize the  data (center around 0 and scale to remove the variance).
scaler =StandardScaler()
Xs = scaler.fit_transform(X)

PCA降维

from sklearn.decomposition import PCA
# 从 30维 降到 10维
pca = PCA(n_components=10)
fit = pca.fit(Xs)

X_pca = pca.transform(Xs)

取前两个PC 进行画图，查看降维后特征的区分度

PCA_df = pd.DataFrame()
PCA_df['PCA_1'] = X_pca[:,0]
PCA_df['PCA_2'] = X_pca[:,1]
## 可视化
plt.figure(figsize=(8,6))
plt.plot(PCA_df['PCA_1'][data.diagnosis == 'M'],
         PCA_df['PCA_2'][data.diagnosis == 'M'],
         'o', alpha = 0.7, color = 'r')
plt.plot(PCA_df['PCA_1'][data.diagnosis == 'B'],
         PCA_df['PCA_2'][data.diagnosis == 'B'],
         'o', alpha = 0.7, color = 'b')

plt.xlabel('PCA_1')
plt.ylabel('PCA_2')
plt.legend(['Malignant','Benign'])
plt.show()

PCA

通过拐点，确定选择前几个主成分用于后续建模

#The amount of variance that each PC explains
var = pca.explained_variance_ratio_
### 通过拐点确定选择前几个PC
plt.plot(var)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')

leg = plt.legend(['Eigenvalues from PCA'], 
                 loc='best', 
                 borderpad=0.3,
                 shadow=False,
                 markerscale=0.4)

leg.get_frame().set_alpha(0.4)
leg.draggable(state=True)
plt.show()

elbow plot

Notebook 04 利用SVM建模

支持向量机 (SVM) 学习算法将用于构建预测模型。 SVM 是最流行的分类算法之一，并且具有转换非线性数据的优雅方式，因此可以使用线性算法将线性模型拟合到数据（Cortes 和 Vapnik 1995）

支持向量机的和函数非常强大，使得模型在各种数据集上表现良好。

SVM 允许复杂的决策边界，即使数据只有几个特征。
它们在低维和高维数据（即很少和很多特征）上工作得很好，但在大样本上不能很好地扩展。

在包含多达 10,000 个样本的数据上运行 SVM 可能效果很好，但处理大小为 100,000 或更大的数据集在运行时和内存使用方面可能具有挑战性.

SVM 需要进行很好地数据预处理和调整SVM参数。这就是为什么如今大多数人在许多应用中转而使用基于树的模型，例如随机森林或梯度提升数（几乎不需要预处理）。
SVM 模型难以理解；可能很难理解为什么做出特定预测，模型的可解释性可能不太好。

4.1 SVM 的重要参数

SVM 中的重要参数是

正则化系数: C，
核的选择: 线性（linear）、径向基函数（rbf）或多项式（poly）
RBF 特定的参数:
gamma 和 C 用于控制模型的复杂性，两者中较大的值会导致模型更复杂。因此，两个参数的良好设置通常是强相关的，C 和 gamma 应该一起调整。

4.2 数据处理

加载模块与数据集

# load package
%matplotlib inline
import matplotlib.pyplot as plt
#Load libraries for data processing
import pandas as pd 
import numpy as np
from scipy.stats import norm
## Supervised learning.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn import metrics, preprocessing
from sklearn.metrics import classification_report
# visualization
import seaborn as sns 
plt.style.use('fivethirtyeight')
sns.set_style("white")
plt.rcParams['figure.figsize'] = (8,4) 
# load dataset
data = pd.read_csv('data/data.csv')

数据预处理

# split features and label
X = data.iloc[:,2:32] # features
y = data.iloc[:,1] # label
# transform the class labels from their original string representation (M and B) into integers
le = LabelEncoder()
y = le.fit_transform(y)
# Normalize the  data (center around 0 and scale to remove the variance).
scaler =StandardScaler()
Xs = scaler.fit_transform(X)

4.2 交叉验证

训练集：测试集 = 7：3

# 5. Divide records in training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(Xs, y, stratify=y,
                                                    test_size=0.3, 
                                                    random_state=33)

# 6. Create an SVM classifier and train it on 70% of the data set.
clf = SVC(probability=True)
clf.fit(X_train, y_train)
 #7. Analyze accuracy of predictions on 30% of the holdout test sample.
classifier_score = clf.score(X_test, y_test)*100
print ('The classifier accuracy score is {:03.2f}% \n'.format(classifier_score))

The classifier accuracy score is 96.49%

交叉验证

n_folds = 5
cv_error = np.average(cross_val_score(SVC(), Xs, y, cv=n_folds)) * 100
print('The {}-fold cross-validation accuracy score for this classifier is {:.2f} % \n'.format(n_folds, cv_error))

The 5-fold cross-validation accuracy score for this classifier is 97.36 %

可以看到交叉验证的结果要比随机划分的结果略好，说明数据的选择对模型还是很重要的

SVM pipline
之前已经知道前3个PC可以很好地预测肿瘤的良恶性，所以，可以把特征选择与模型串联到一起，组成一个pipline，便于进行模型的训练和预测。

from sklearn.feature_selection import SelectKBest, f_regression
# clf2 is a pipline
clf2 = make_pipeline(SelectKBest(f_regression, k=3),
                     SVC(probability=True))

scores = cross_val_score(clf2, Xs, y, cv=3)

# Get average of 3-fold cross-validation score using an SVC estimator.
n_folds = 3
cv_error = np.average(cross_val_score(SVC(), Xs, y, cv=n_folds)) * 100
print('The {}-fold cross-validation accuracy score for this classifier is {:.2f} %\n'.format(n_folds, cv_error))

The 3-fold cross-validation accuracy score for this classifier is 97.36 %

4.3 模型评估

Accuracy: Overall, how often is the classifier correct?
- Accuracy = (TP+TN)/total
Misclassification Rate: Overall, how often is it wrong?
- Error Rate = (FP+FN)/total
True Positive Rate: When it's actually yes, how often does it predict 1?
- TPR = TP/actual yes, also known as "Sensitivity"or "Recall"
False Positive Rate: When it's actually 0, how often does it predict 1?
- FPR = FP/actual no
Specificity: When it's actually 0, how often does it predict 0? also know as true positive rate
- Specificity = TN/actual no = 1 - FPR
Precision: When it predicts 1, how often is it correct?
- Precision = TP/predicted yes
Prevalence: How often does the yes condition actually occur in our sample?
- Prevalence = actual yes/total
ROC 曲线

def ROC_plot(y, yproba):
    from sklearn.metrics import roc_curve, auc
    plt.figure(figsize=(10,8))
    fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, label='ROC fold (area = %0.2f)' % (roc_auc))
    plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Random')
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.axes().set_aspect(1)

probas_ = clf.predict_proba(X_test)
ROC_plot(y, probas_[:1])

ROC

Notebook 05 svm 调参

跟前面一样，数据读取，特征和target分割，数据标准化

data = pd.read_csv('data/data.csv', index_col=False)

X = data.iloc[:,2:32] # features
y = data.iloc[:,1] # label
# transform the class labels from their original string representation (M and B) into integers
le = LabelEncoder()
y = le.fit_transform(y)
# Normalize the  data (center around 0 and scale to remove the variance).
scaler =StandardScaler()
Xs = scaler.fit_transform(X)

在这里，作者利用PCA来进行降维
原数据有30个特征，这里选取前10个主成分

from sklearn.decomposition import PCA
# feature extraction
pca = PCA(n_components=10)
fit = pca.fit(Xs)
X_pca = pca.transform(Xs)

训练集和验证集分割，模型训练，模型评估

# Divide records in training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, 
                                                    test_size=0.3, 
                                                    random_state=2, 
                                                    stratify=y)

# Create an SVM classifier and train it on 70% of the data set.
clf = SVC(probability=True)
clf.fit(X_train, y_train)
y_pred = clf.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred ))

classification_report

测试集预测结果的混淆矩阵可视化

## plot confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(cm, cmap=plt.cm.Reds, alpha=0.3)
for i in range(cm.shape[0]):
     for j in range(cm.shape[1]):
        ax.text(x=j, y=i,
                s=cm[i, j], 
                va='center', ha='center')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()

confusion matrix

gridsearchcv 选择参数组合

# Train classifiers.
kernel_values = ['linear','rbf']
param_grid = {'C': np.logspace(-3, 1, 100),
              'gamma': np.logspace(-3, 2, 100),
              'kernel': kernel_values}

grid = GridSearchCV(SVC(), scoring="roc_auc",
                    param_grid=param_grid, 
                    cv=5)
grid.fit(X_train, y_train)

print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

最佳参数组合

The best parameters are {'C': 10.0, 'gamma': 0.01830738280295368, 'kernel': 'rbf'} with a score of 1.00

最佳参数下，模型性能评估

clf = SVC(**grid.best_params_,
         probability=True,
         random_state=33)

y_pred = clf.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred ))

SVM 不同核函数可视化

def meshgrid(feat1,feat2):
    x_min, x_max = feat1.min() - 1, feat1.max() + 1
    y_min, y_max = feat2.min() - 1, feat2.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                    np.arange(y_min, y_max, 0.1))
    return xx, yy

Xtrain = X_train[:, :2]
xx,yy = meshgrid(Xtrain[:, 0],Xtrain[:, 1])

from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# title for the plots
titles = ['SVC with linear kernel',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel']

svm = SVC(kernel='linear',C=1,random_state=0).fit(Xtrain, y_train)
rbf_svc = SVC(kernel='rbf',gamma=0.7, C=1, random_state=0).fit(Xtrain, y_train)
poly_svc = SVC(kernel='poly',degree=3, C=1, random_state=0).fit(Xtrain, y_train)

for i, clf in enumerate((svm, rbf_svc, poly_svc)):
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.1, hspace=0.1)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
    # Plot also the training points
    plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=y_train, cmap=plt.cm.coolwarm)
    plt.xlabel('radius_mean')
    plt.ylabel('texture_mean')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()

svm

Notebook 05 不同模型之间的比较

def bxplots(results,names):
    fig = plt.figure()
    fig.suptitle('Algorithm Comparison')
    ax = fig.add_subplot(111)
    plt.boxplot(results)
    ax.set_xticklabels(names)
    plt.show()

def piplinecompare(models, X_train, y_train):
    results = []
    names = []
    for name, model in models:
        kfold = KFold(n=len(X_train), n_folds=10, random_state=7)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='roc_auc')
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
    return results,names

比较不同模型之间的性能（原始数据/归一化后的数据）

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA',LinearDiscriminantAnalysis()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('CART',DecisionTreeClassifier()))
models.append(('NB',GaussianNB()))
models.append(('SVM',SVC()))

# Standardize the dataset
pipelines = []
pipelines.append(( 'ScaledLR' , Pipeline([( 'Scaler' , StandardScaler()),( 'LR' ,
    LogisticRegression())])))
pipelines.append(( 'ScaledLDA' , Pipeline([( 'Scaler' , StandardScaler()),( 'LDA' ,
    LinearDiscriminantAnalysis())])))
pipelines.append(( 'ScaledKNN' , Pipeline([( 'Scaler' , StandardScaler()),( 'KNN' ,
    KNeighborsClassifier())])))
pipelines.append(( 'ScaledCART' , Pipeline([( 'Scaler' , StandardScaler()),( 'CART' ,
    DecisionTreeClassifier())])))
pipelines.append(( 'ScaledNB' , Pipeline([( 'Scaler' , StandardScaler()),( 'NB' ,
    GaussianNB())])))
pipelines.append(( 'ScaledSVM' , Pipeline([( 'Scaler' , StandardScaler()),( 'SVM' , SVC())])))


results,names = piplinecompare(models, X_train, y_train)
bxplots(results,names)

results1,names1 = piplinecompare(pipelines, X_train, y_train)
bxplots(results1,names1)

model

scalered

可以发现，树模型对与数据的是否标准化无影响
LDA,NB 算法有轻微影响
LR, KNN,SVM 在进行建模之前，必须要进行合理的数据标准化，因为这对于模型训练有很大的影响.

github代码学习 --- 乳腺癌分类预测

github代码学习 --- 乳腺癌分类预测

Notebook 01: 加载数据集，识别分析问题

1.1 了解背景

1.2 认识数据

Notebook 02: EDA 数据探索性分析

2.1 Descriptive statistics

2.2 Data Visualizations

2.3 Multimodal Data Visualizations

Notebook 03: 预处理与特征工程

Notebook 04 利用SVM建模

4.1 SVM 的重要参数

4.2 数据处理

4.2 交叉验证

4.3 模型评估

Notebook 05 svm 调参

Notebook 05 不同模型之间的比较

推荐阅读更多精彩内容