项目目的
- 肝硬化是由长期肝脏损伤引起的肝脏瘢痕形成(纤维化)
- 肝硬化最终可能导致肝功能衰竭,危及生命
- 治疗可能阻止肝硬化恶化,本项目意在根据获取的结构化临床数据预测肝硬化
数据描述
总计418名患者,其中312例患者具有全部资料,106例患者只做了一些基本的测量。数据中有18种可以使用的属性:
- 标识:唯一标识
- N_Days:从登记到死亡、移植或研究分析时间(以较早者为准)之间的天数
- 状态:患者C(受审查)、CL(因肝脏tx而受审查)或D(死亡)的状态
- 药物:D-青霉胺或安慰剂类药物
- 年龄:以[天]为单位的年龄
- 性别:M(男性)或F(女性)
- 腹水:存在腹水N(否)或Y(是)
- 肝肿大:存在肝肿大N(否)或Y(是)
- 蜘蛛状血管瘤:是否存在蜘蛛状血管瘤N(否)或Y(是)
- 水肿:存在水肿N(无水肿且无水肿利尿剂治疗)、S(存在无利尿剂的水肿,或通过利尿剂消退的水肿)或Y(尽管有利尿剂治疗但仍存在水肿)
- 胆红素:血清胆红素[mg/dl]
- 胆固醇:血清胆固醇,单位为[mg/dl]
- 白蛋白:白蛋白单位[gm/dl]
- 铜:尿铜单位[微克/天]
- Alk_Phos:碱性磷酸酶,单位为[U/l]
- SGOT: SGOT(单位:毫升)
- 甘油三酯:甘油三酯,单位为[mg/dl]
- 血小板:血小板/立方[ml/1000]
- 凝血酶原:凝血酶原时间,单位为秒[s]
- 分期:疾病的组织学分期(1、2、3或4)
四个肝脏阶段:健康、肥胖、纤维化、肝硬化
我们的目标:根据肝硬化(1)和无肝硬化(0)进行二分类的预测
数据预处理
- 删除“ID”、“N_Days”、“Hepatomegaly”和“Status”栏
- 用中值替换数值特征的缺失值
- 用最频繁或最合理的类别替换分类特征的缺失值
- 分类变量的标签编码
- 二进制化目标值
# Drop ID as we will not need it for prediction
cirrhosis.drop('ID', axis=1, inplace=True)
# Drop N_Days and Status to prevent data leakage
cirrhosis.drop('N_Days', axis=1, inplace=True)
cirrhosis.drop('Status', axis=1, inplace=True)
数据探索性分析
缺失值情况
def get_percentage_missing(series):
""" Calculates percentage of NaN values in DataFrame
:param series: Pandas DataFrame object
:return: float
"""
num = series.isnull().sum()
den = len(series)
return round(num/den, 2)
# Only include columns that contain any NaN values
df_with_any_null_values = cirrhosis[cirrhosis.columns[cirrhosis.isnull().any()].tolist()]
get_percentage_missing(df_with_any_null_values)
数据集中的前312个病例参与了随机试验,并包含大部分完整的数据。另外112例没有参与临床试验,但同意记录基本测量值,并接受生存随访。这些病例中有6例在诊断后不久就失去了随访,因此这里的数据是关于另外106例病例以及312名随机参与者的。
# Drop rows where target value is missing
cirrhosis=cirrhosis.dropna(subset=['Stage'])
# save the names of the variables with missing values (one for object and one for float columns)
df_with_any_null_values = cirrhosis[cirrhosis.columns[cirrhosis.isnull().any()].tolist()]
obj_cols = list(df_with_any_null_values.select_dtypes(include='object'))
float_cols = list(df_with_any_null_values.select_dtypes(include='float64'))
obj_cols_with_nan = get_percentage_missing(df_with_any_null_values[obj_cols]).keys()
float_cols_with_nan = get_percentage_missing(df_with_any_null_values[float_cols]).keys()
# explore data in categorical columns with missing values to determine how to replace the missing values
for col in obj_cols_with_nan:
val_counts = {}
for value in cirrhosis[col].value_counts().keys():
val_counts[value] = cirrhosis[col].value_counts()[value]
val_counts['NaN'] = cirrhosis[col].isna().sum()
print('Value count for', col, ':', val_counts)
- 药物: 数据在检测药物D-青霉胺的研究框架内收集。100个缺失值来自未参与研究但同意记录其数据的患者。因此用一个额外的类别“无药物”来填充缺失值。
- 腹水: 腹水是一种与肝脏功能障碍密切相关的疾病,通常由肝硬化引起。由于Y与N的比例很小,所以用“N”替换缺失的值。
- 肝肿大: 肝脏肿大是一种非特异性医学体征,有多种病因,大致可分为感染、肝肿瘤或代谢性疾病。这是一个相当不明确的条件,Y和N的比例几乎相等。所以决定放弃这个预测因子。
- 蜘蛛: 蜘蛛血管瘤是一种皮肤上肿胀的蜘蛛状血管。很常见且通常为良性,约10-15%的健康成人和幼儿中可见。但是如果出现三个以上的蜘蛛痣很可能是异常的,也可能是肝病和/或丙型肝炎的征兆。Y和N的比例偏向N,但不是很大。然而由于它很常见,除非被大量发现,否则通常不是肝病的可靠指标,所以决定用“N”填充缺失值。
cirrhosis['Drug'] = cirrhosis['Drug'].fillna('Placebo')
cirrhosis['Ascites'] = cirrhosis['Ascites'].fillna('N')
cirrhosis.drop('Hepatomegaly', axis=1, inplace=True)
cirrhosis['Spiders'] = cirrhosis['Spiders'].fillna('N')
# Missing float values are all filled with the median
cirrhosis['Cholesterol'] = cirrhosis['Cholesterol'].fillna(cirrhosis['Cholesterol'].median())
cirrhosis['Copper'] = cirrhosis['Copper'].fillna(cirrhosis['Copper'].median())
cirrhosis['Alk_Phos'] = cirrhosis['Alk_Phos'].fillna(cirrhosis['Alk_Phos'].median())
cirrhosis['SGOT'] = cirrhosis['SGOT'].fillna(cirrhosis['SGOT'].median())
cirrhosis['Tryglicerides'] = cirrhosis['Tryglicerides'].fillna(cirrhosis['Tryglicerides'].median())
cirrhosis['Platelets'] = cirrhosis['Platelets'].fillna(cirrhosis['Platelets'].median())
cirrhosis['Prothrombin'] = cirrhosis['Prothrombin'].fillna(cirrhosis['Prothrombin'].median())
正负样本比例
# Proportion of cirrhosis/no cirrhosis
round(len(cirrhosis[cirrhosis['Stage']==4])/len(cirrhosis),2)
# 结果为:0.35
异常值检测
# Outlier detection
float_cols = list(cirrhosis.select_dtypes(include='float64'))
for col in float_cols:
plt.figure()
sns.boxplot(cirrhosis[col])
准备好的数据
# Binary Classification
cirrhosis['Stage'] = np.where(cirrhosis['Stage'] == 4, 1, 0)
# Check that all missing values are gone
df_with_any_null_values = cirrhosis[cirrhosis.columns[cirrhosis.isnull().any()].tolist()]
get_percentage_missing(df_with_any_null_values)
# convert data to be usable in sklearn classifiers
le = LabelEncoder()
obj_cols = cirrhosis.select_dtypes(include=('object')).columns
for col in obj_cols:
cirrhosis[col] = le.fit_transform(cirrhosis[col])
cirrhosis_y = cirrhosis.iloc[:,-1].values
cirrhosis_X = cirrhosis.iloc[:,:-1]
cirrhosis_X = cirrhosis_X.values
# Create dictionary to compare accuracies and computation time between classifiers
comparison_dict = {}
# Folds for Cross-validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
初步尝试
- 对所有模型进行了5折交叉验证,并比较了平均准确度以及5折交叉验证的计算时间
- 在所有模型上都使用了网格搜索
- 预处理数据的第一个简单模型:支持向量分类器
- 5倍以上的平均测试精度:0.68
- 每折的计算时间为0.28秒
C_list = np.logspace(-5,3, num=10, base=10)
kernel_list = ['poly', 'rbf', 'sigmoid']
param_grid = dict(C=C_list, kernel=kernel_list)
clf = GridSearchCV(svm.SVC(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_final = svm.SVC(C=clf.best_params_['C'],kernel=clf.best_params_['kernel'])
clf_final.fit(X_train, Y_train)
print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
acc_train.append(clf_final.score(X_train, Y_train))
acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['SVC'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
降维
PCA
pca = PCA(n_components=3)
pca.fit(cirrhosis_X)
cirrhosis_X_pca = pca.transform(cirrhosis_X)
plt.figure(figsize=(25,8))
sns.scatterplot(x=cirrhosis_X_pca[:,0], y=cirrhosis_X_pca[:, 1], hue=cirrhosis_y, palette='terrain')
plt.title('Principal Components vs Class distribution', fontsize=16)
plt.ylabel('Principal Component 2', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=16)
_=plt.xticks(rotation='vertical')
Kernel PCA
一般来说,主成分分析(Principal Components Analysis, PCA)适用于数据的线性降维。而核主成分分析(Kernel PCA, KPCA)可实现数据的非线性降维,用于处理线性不可分的数据集。
KPCA的大致思路是:对于输入空间(Input space)中的矩阵,我们先用一个非线性映射把中的所有样本映射到一个高维甚至是无穷维的空间(称为特征空间,Feature space),(使其线性可分),然后在这个高维空间进行PCA降维。详见附录。
gamma_list = np.logspace(-5, 5, num=5, base=10)
alpha_list = np.logspace(-5, 2, num=5, base=10)
kernel_list = ['linear', 'rbf', 'cosine']
for gamma in gamma_list:
for alpha in alpha_list:
for kernel in kernel_list:
kernel_pca = KernelPCA(
n_components=2, kernel=kernel, gamma=gamma, fit_inverse_transform=True, alpha=alpha
)
kernel_pca.fit(cirrhosis_X)
cirrhosis_X_kpca = kernel_pca.transform(cirrhosis_X)
fig = plt.figure()
plt.scatter(cirrhosis_X_kpca[:,0], cirrhosis_X_kpca[:, 1], marker='o', c=cirrhosis_y, s=25, edgecolor='k')
plt.title('2-component KPCA on BC with {gamma}, {alpha}, {kernel}'.format(gamma=np.round(gamma,4), alpha=np.round(alpha,4), kernel=kernel))
plt.xlabel('PC_1')
plt.ylabel('PC_2')
plt.show()
kernel_pca = KernelPCA(n_components=2, kernel='rbf', gamma=1, fit_inverse_transform=True, alpha=0.1)
kernel_pca.fit(cirrhosis_X)
cirrhosis_X_pca = kernel_pca.transform(cirrhosis_X)
plt.figure(figsize=(25,8))
sns.scatterplot(x=cirrhosis_X_pca[:,0], y=cirrhosis_X_pca[:, 1], hue=cirrhosis_y, palette='terrain')
plt.title('Principal Components vs Class distribution', fontsize=16)
plt.ylabel('Principal Component 2', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=16)
_=plt.xticks(rotation='vertical')
从结果看:
- 找不到任何参数可以导致可分离数据的组合
- 在PCA和KPCA之上实施SVM产生了预期的一般结果:
- 测试精度均为0.68,计算时间分别为0.06秒和0.08秒
- LDA和QDA表现更好(均为0.71),但数据仍然不可分离
PCA+SVM
C_list = np.logspace(-5, 3, num=10, base=10)
kernel_list = ['poly', 'rbf', 'sigmoid']
param_grid = dict(C=C_list, kernel=kernel_list)
pca = PCA(n_components=4)
cirrhosis_X_pca = pca.fit_transform(cirrhosis_X)
clf = GridSearchCV(svm.SVC(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X_pca, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X_pca, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
pca = PCA(n_components=4)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
clf_final = svm.SVC(kernel=clf.best_params_['kernel'], C=clf.best_params_['C'])
clf_final.fit(X_train_pca, Y_train)
print('Train accuracy:', round(clf_final.score(X_train_pca, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test_pca, Y_test),2))
acc_train.append(clf_final.score(X_train_pca, Y_train))
acc_test.append(clf_final.score(X_test_pca, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['PCA+SVC'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
Kernel PCA+SVM
C_list = np.logspace(-5, 3, num=10, base=10)
kernel_list = ['poly', 'rbf', 'sigmoid']
param_grid = dict(C=C_list, kernel=kernel_list)
kernel_pca = KernelPCA(n_components=4)
cirrhosis_X_pca = kernel_pca.fit_transform(cirrhosis_X)
clf = GridSearchCV(svm.SVC(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X_pca, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X_pca, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
kernel_pca = KernelPCA(n_components=4)
kernel_pca.fit(X_train)
X_train_pca = kernel_pca.transform(X_train)
X_test_pca = kernel_pca.transform(X_test)
clf_final = svm.SVC(kernel=clf.best_params_['kernel'], C=clf.best_params_['C'])
clf_final.fit(X_train_pca, Y_train)
print('Train accuracy:', round(clf_final.score(X_train_pca, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test_pca, Y_test),2))
acc_train.append(clf_final.score(X_train_pca, Y_train))
acc_test.append(clf_final.score(X_test_pca, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['KPCA+SVC'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
LDA+SVM
C_list = np.logspace(-5, 3, num=10, base=10)
kernel_list = ['poly', 'rbf', 'sigmoid']
param_grid = dict(C=C_list, kernel=kernel_list)
lda = LinearDiscriminantAnalysis(n_components=1)
cirrhosis_X_lda=lda.fit_transform(cirrhosis_X, cirrhosis_y)
clf = GridSearchCV(svm.SVC(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X_lda, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X_lda, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
lda = LinearDiscriminantAnalysis(n_components=1)
lda.fit(X_train, Y_train)
X_lda_train = lda.transform(X_train)
X_lda_test = lda.transform(X_test)
clf_final = svm.SVC(kernel=clf.best_params_['kernel'], C=clf.best_params_['C'])
clf_final.fit(X_lda_train, Y_train)
print('Train accuracy:', round(clf_final.score(X_lda_train, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_lda_test, Y_test),2))
acc_train.append(clf_final.score(X_lda_train, Y_train))
acc_test.append(clf_final.score(X_lda_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['LDA+SVC'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
plt.figure()
plt.scatter(X_lda_train[Y_train==0, 0], X_lda_train[Y_train == 0, 0])
plt.scatter(X_lda_train[Y_train==1, 0], X_lda_train[Y_train == 1, 0])
plt.legend(loc="best", shadow=False, scatterpoints=1)
plt.title("LDA seperation")
plt.show()
QDA
reg_list = np.logspace(-4, 1, num=5, base=10)
tol_list = np.logspace(-4, 1, num=5, base=10)
param_grid = dict(tol=tol_list, reg_param=reg_list)
clf = GridSearchCV(QuadraticDiscriminantAnalysis(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_final = QuadraticDiscriminantAnalysis(tol=clf.best_params_['tol'],reg_param=clf.best_params_['reg_param'])
clf_final.fit(X_train, Y_train)
print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
acc_train.append(clf_final.score(X_train, Y_train))
acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['QDA'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
Boosting
• AdaBoost: accuracy 0.72, computation time 0.12 seconds
• XGBoost: accuracy 0.76, computation time 0.24 seconds
n_estimator_list = [5, 10, 20, 50, 70, 100, 150, 200]
lr_list = [0.0001,0.001, 0.01, 0.1,0.5]
param_grid = dict(n_estimators=n_estimator_list, learning_rate=lr_list)
clf = GridSearchCV(AdaBoostClassifier(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_final = AdaBoostClassifier( n_estimators=clf.best_params_['n_estimators'],
learning_rate=clf.best_params_['learning_rate'])
clf_final.fit(X_train, Y_train)
print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
acc_train.append(clf_final.score(X_train, Y_train))
acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['AdaBoost'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
max_depth_list = [2, 3, 4, 5, 10, 15]
lr_list = [0.0001,0.001, 0.01, 0.1,0.5]
param_grid = dict(max_depth=max_depth_list, learning_rate=lr_list)
clf = GridSearchCV( XGBClassifier(eval_metric='logloss'), param_grid=param_grid, n_jobs=-1,
refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_final = XGBClassifier( max_depth=clf.best_params_['max_depth'],
learning_rate=clf.best_params_['learning_rate'],
eval_metric='logloss')
clf_final.fit(X_train, Y_train)
print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
acc_train.append(clf_final.score(X_train, Y_train))
acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['XGBoost'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
其他的方法
KNN
KNN如何选择一个最佳的值取决于数据
一般情况下,在分类时较大的值能够减小噪声的影响, 但会使类别之间的界限变得模糊。一个较好的值能通过各种启发式技术(见超参数优化)来获取。
噪声和非相关性特征的存在,或特征尺度与它们的重要性不一致会使近邻算法的准确性严重降低。对于选取和缩放特征来改善分类已经做了很多研究。一个普遍的做法是利用进化算法优化功能扩展,还有一种较普遍的方法是利用训练样本的互信息进行选择特征。
在二元(两类)分类问题中,选取为奇数有助于避免两个分类平票的情形。在此问题下,选取最佳经验值的方法是自助法。
from sklearn.neighbors import KNeighborsClassifier
n_list = [2, 3, 4, 5, 6, 7, 8, 9, 10]
param_grid = dict(n_neighbors=n_list)
clf = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_final = KNeighborsClassifier( n_neighbors=clf.best_params_['n_neighbors'])
clf_final.fit(X_train, Y_train)
print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
acc_train.append(clf_final.score(X_train, Y_train))
acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['KNeighbors'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
K近邻算法的优缺点:不同类别的样本点,分布在空间的不同区域。近邻是基于空间距离较近的样本类别来进行分类,本质上是对于特征空间的划分。
- 优点:精度高、对异常值不敏感、无数据输入假定。
- 缺点:计算复杂度高、空间复杂度高。
- 适用数据范围:数值型和标称型。
Neural Network
from sklearn.neural_network import MLPClassifier
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf=MLPClassifier(hidden_layer_sizes=(32,32),activation="tanh",solver="sgd",
learning_rate='adaptive', learning_rate_init=0.001,
n_iter_no_change=10,max_iter=1000,alpha=0.0001)
clf.fit(X_train, Y_train)
print('Train accuracy:', round(clf.score(X_train, Y_train),2))
print('Test accuracy:', round(clf.score(X_test, Y_test),2))
acc_train.append(clf.score(X_train, Y_train))
acc_test.append(clf.score(X_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['NN'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
Gaussian Naive Bayes
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf = GaussianNB()
clf.fit(X_train, Y_train)
print('Train accuracy:', round(clf.score(X_train, Y_train),2))
print('Test accuracy:', round(clf.score(X_test, Y_test),2))
acc_train.append(clf.score(X_train, Y_train))
acc_test.append(clf.score(X_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['Gaus_NaiveBayes'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
Random Forest
from sklearn.ensemble import RandomForestClassifier
max_depth_list = np.arange(start=2,stop=20,step=4)
n_estimators_list = np.arange(start=10,stop=500,step=90)
param_grid = dict(max_depth=max_depth_list,
n_estimators=n_estimators_list)
clf_rf = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf_rf.fit(cirrhosis_X, cirrhosis_y)
rf_params = clf_rf.best_params_
print('Best parameters:', clf_rf.best_params_)
print('Train accuracy:', round(clf_rf.score(cirrhosis_X, cirrhosis_y),2))
acc_train = []
acc_test = []
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_final = RandomForestClassifier( max_depth=rf_params['max_depth'],
n_estimators=rf_params['n_estimators'])
clf_final.fit(X_train, Y_train)
print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
acc_train.append(clf_final.score(X_train, Y_train))
acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['RF'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
特征选择
随机森林获得特征重要性
随机森林特征重要性原理
- 袋外数据错误率评估——RF的数据是boostrap的有放回采样,形成了袋外数据。因此可以采用袋外数据(OOB)错误率进行特征重要性的评估。袋外数据错误率定义为:袋外数据自变量值发生轻微扰动后的分类正确率与扰动前分类正确率的平均减少量。
- 利用Gini系数计算特征的重要性——单棵树上特征的重要性定义为:特征在所有非叶节在分裂时加权不纯度的减少,减少的越多说明特征越重要。
随机森林得到的特征重要性计算方法
1、对于随机森林中的每一颗决策树,使用相应的OOB(袋外数据)数据来计算它的袋外数据误差,记为。
2、随机地对袋外数据OOB所有样本的特征加入噪声干扰(就可以随机的改变样本在特征X处的值),再次计算它的袋外数据误差,记为。
3、假设随机森林中有棵树,那么对于特征的重要性 ,之所以可以用这个表达式来作为相应特征的重要性的度量值是因为:若给某个特征随机加入噪声之后,袋外的准确率大幅度降低,则说明这个特征对于样本的分类结果影响很大,也就是说它的重要程度比较高。
feature_names=cirrhosis.drop(columns="Stage").columns.values
importances = clf_final.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf_final.estimators_], axis=0)
forest_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
ax.tick_params(axis="x",labelrotation=45)
fig.tight_layout()
plt.savefig("RF_selection")
acc_train = []
acc_test = []
thres=0.05
cirrhosis_X_red=cirrhosis_X[:,np.where(importances>=thres)[0]]
max_depth_list = [2, 5, 10, 15]
lr_list = [0.0001,0.001, 0.01, 0.1,0.5]
param_grid = dict(max_depth=max_depth_list, learning_rate=lr_list)
clf = GridSearchCV(XGBClassifier(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5)
clf.fit(cirrhosis_X_red, cirrhosis_y)
print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X_red, cirrhosis_y),2))
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_rf = RandomForestClassifier( max_depth=rf_params['max_depth'],
n_estimators=rf_params['n_estimators'])
clf_rf.fit(X_train, Y_train)
importances = clf_rf.feature_importances_
X_train_rf = X_train[:,np.where(importances>=thres)[0]]
X_test_rf = X_test[:,np.where(importances>=thres)[0]]
print("Selected features in fold ",fold,": ",np.where(importances>=thres)[0])
clf_final = XGBClassifier( max_depth=clf.best_params_['max_depth'],
learning_rate=clf.best_params_['learning_rate'])
clf_final=XGBClassifier(max_depth=5,learning_rate=0.1)
clf_final.fit(X_train_rf, Y_train)
print('Train accuracy:', round(clf_final.score(X_train_rf, Y_train),2))
print('Test accuracy:', round(clf_final.score(X_test_rf, Y_test),2))
acc_train.append(clf_final.score(X_train_rf, Y_train))
acc_test.append(clf_final.score(X_test_rf, Y_test))
tok = time.time()
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['RF+XGBoost'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
Feature Selection with LASSO
alpha_list = [k*10**(-i) for i in reversed(range(6)) for k in [1,3,5,8]]
acc_train = []
acc_test = []
times = []
selected_features = []
for alpha in alpha_list:
tik = time.time()
print('alpha:', alpha)
clf_lasso = Lasso(alpha=alpha)
acc_train_alpha = []
acc_test_alpha = []
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_lasso.fit(X_train, Y_train)
support = (clf_lasso.coef_!=0.0)
print('New shape of train data for fold',fold, ':', X_train[:, support].shape)
selected_features.append([i for i, val in enumerate(support) if val])
preds_lasso_train = (clf_lasso.predict(X_train)>0.5)
acc_train_alpha.append(round(((Y_train==preds_lasso_train).sum()/len(Y_train)),2))
preds_lasso_test = (clf_lasso.predict(X_test)>0.5)
acc_test_alpha.append(round(((Y_test==preds_lasso_test).sum()/len(Y_test)),2))
tok = time.time()
times.append( round((tok-tik)/kfold.n_splits,2))
acc_train.append(np.mean(acc_train_alpha))
acc_test.append(np.mean(acc_test_alpha))
selected_features=sum(selected_features, [])
plt.plot(alpha_list, acc_test)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Accuracy')
plt.title('Test accuracy with Logistic Regression+LASSO Feature Selection')
![image.png](https://upload-images.jianshu.io/upload_images/24891076-ebc6bba1a412d955.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
#Text(0.5, 1.0, 'Test accuracy with Logistic Regression+LASSO Feature Selection')
print('Maximal test accuracy with Logistic Regression + LASSO:', round(np.max(acc_test),2), 'achieved at alpha =', alpha_list[np.argmax(acc_test)])
comparison_dict['Lasso']=[round(np.max(acc_test),2),times[np.argmax(acc_test)]]
# Maximal test accuracy with Logistic Regression + LASSO: 0.73 achieved at alpha = 0.0005
y, x, _ = plt.hist(selected_features, bins=range(16))
print('Most frequently selected features:', np.argpartition(y, -5)[-5:])
plt.show()
plt.savefig("Sel_feat_lasso")
#Most frequently selected features: [ 7 13 1 9 10]
Feature Selection LASSO+SVC
C_list = [k*10**(i) for i in range(-2,4) for k in [1,3,5,8]]
acc_train = []
acc_test = []
times = []
selected_features = []
for C in C_list:
tik = time.time()
print('C:', C)
clf_svc = LinearSVC(C=C, penalty='l1', dual=False)
acc_train_C = []
acc_test_C = []
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_svc.fit(X_train, Y_train)
support = (clf_svc.coef_!=0.0).flatten()
print('New shape of train data for fold',fold, ':', X_train[:, support].shape)
selected_features.append([i for i, val in enumerate(support) if val])
preds_lasso_train = (clf_svc.predict(X_train)>0.5)
acc_train_alpha.append(round(((Y_train==preds_lasso_train).sum()/len(Y_train)),2))
preds_lasso_test = (clf_svc.predict(X_test)>0.5)
acc_test_alpha.append(round(((Y_test==preds_lasso_test).sum()/len(Y_test)),2))
acc_train.append(np.mean(acc_train_alpha))
acc_test.append(np.mean(acc_test_alpha))
tok = time.time()
times.append(round((tok-tik)/kfold.n_splits,2))
selected_features=sum(selected_features, [])
plt.plot(alpha_list, acc_test)
plt.xscale('log')
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title('Test accuracy with LASSO+SVC Feature Selection')
# Text(0.5, 1.0, 'Test accuracy with LASSO+SVC Feature Selection')
print('Maximal test accuracy with Linear SVM + LASSO:', np.max(acc_test), 'achieved at C =', alpha_list[np.argmax(acc_test)])
comparison_dict['Lasso+LinearSVC']=[round(np.max(acc_test),2), times[np.argmax(acc_test)]]
# Maximal test accuracy with Linear SVM + LASSO: 0.72392 achieved at C = 8
y, x, _ = plt.hist(selected_features, bins=range(16))
print('Most frequently selected features:', np.argpartition(y, -5)[-5:])
# Most frequently selected features: [13 10 1 9 7]
Feature Selection ElasticNet
alpha_list = [k*10**(-i) for i in reversed(range(6)) for k in [1,3,5,8]]
acc_train = []
acc_test = []
times = []
selected_features = []
for alpha in alpha_list:
tik = time.time()
print('alpha:', alpha)
clf_en = ElasticNet(alpha=alpha, l1_ratio=0.7)
acc_train_alpha = []
acc_test_alpha = []
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
clf_en.fit(X_train, Y_train)
support = (clf_en.coef_!=0.0)
print('New shape of train data for fold',fold, ':', X_train[:, support].shape)
selected_features.append([i for i, val in enumerate(support) if val])
preds_lasso_train = (clf_en.predict(X_train)>0.5)
acc_train_alpha.append(round(((Y_train==preds_lasso_train).sum()/len(Y_train)),2))
preds_lasso_test = (clf_en.predict(X_test)>0.5)
acc_test_alpha.append(round(((Y_test==preds_lasso_test).sum()/len(Y_test)),2))
tok = time.time()
acc_train.append(np.mean(acc_train_alpha))
acc_test.append(np.mean(acc_test_alpha))
times.append( round((tok-tik)/kfold.n_splits,2))
selected_features=sum(selected_features, [])
plt.plot(alpha_list, acc_test)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Accuracy')
plt.title('Test accuracy with ElasticNet Feature Selection')
# Text(0.5, 1.0, 'Test accuracy with ElasticNet Feature Selection')
print('Maximal test accuracy with Elastic Net:', np.max(acc_test), 'achieved at alpha =', alpha_list[np.argmax(acc_test)])
comparison_dict['Elasticnet']=[round(np.max(acc_test),2), times[np.argmax(acc_test)]]
# Maximal test accuracy with Elastic Net: 0.732 achieved at alpha = 0.008
y, x, _ = plt.hist(selected_features, bins=range(16))
print('Most frequently selected features:', np.argpartition(y, -5)[-5:])
plt.show()
plt.savefig("Sel_feat_Elastic")
# Most frequently selected features: [ 7 13 1 9 10]
综上:
- LASSO+逻辑回归,LASSO+线性SVC和弹性网络实现最大精度为0.73
- 最常选择的特征: 年龄、胆固醇、铜、碱性磷酸酶、血小板
集成方法
- 将完整数据集分成 80% 训练和 20% 测试,在训练集上进行 5 折验证并在测试集上测试改进的方法;
- 使用 5 折验证训练了 5 个不同的 XGBoost 分类器;
- 使用之前通过网格搜索确定的最佳性能参数;
- 最终模型预测为 1 如果 3 个或更多分类器预测为肝硬化,否则为 0;
- 迄今为止最好的准确度:0.77;
- sklearn可以实现加权多数投票。给表现最好的分类器最高权重达到了 80% 的准确率。
Majority Vote on XGBoost
为了提高XGBoost的性能,在五个不同的训练集上训练XGBoost,并收集这五个分类器。然后对所有五个经过训练的分类器使用多数投票,并在一个单独的测试集上测试其性能,该测试集是模型在训练期间从未见过的。
# Train-Test-Split to be used in majority vote
cirrhosis_X_train, cirrhosis_X_test, cirrhosis_y_train, cirrhosis_y_test = train_test_split(cirrhosis_X, cirrhosis_y, test_size=0.2, random_state=1)
acc_train = []
acc_test = []
max_depth_list = [2, 3, 4, 5, 10, 15, 20, 50, 100, 200]
lr_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
model_list = []
tik=time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X_train, cirrhosis_y_train)):
X_train, X_test = cirrhosis_X_train[train_ids], cirrhosis_X_train[test_ids]
Y_train, Y_test = cirrhosis_y_train[train_ids], cirrhosis_y_train[test_ids]
clf = XGBClassifier(max_depth=2, eta=0.1)
clf.fit(X_train, Y_train)
model_list.append(clf)
print('Train accuracy:', round(clf.score(X_train, Y_train),2))
print('Test accuracy:', round(clf.score(X_test, Y_test),2))
acc_train.append(clf.score(X_train, Y_train))
acc_test.append(clf.score(X_test, Y_test))
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)
print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
majority_vote = []
for i in range(len(cirrhosis_y_test)):
proportions = 0
for model in model_list:
proportions += model.predict(cirrhosis_X_test)[i]
proportions/=5
majority_vote.append(int(proportions>=3/5))
majority_vote = np.asarray(majority_vote)
test_acc=round(sum(majority_vote==cirrhosis_y_test)/len(cirrhosis_y_test),2)
tok=time.time()
comparison_dict['MajorityVoteXGB'] = [test_acc, round(tok-tik,2)]
print('Accuracy on seperate test set of majority vote:', comparison_dict['MajorityVoteXGB'][0])
Ensemble Vote Classifier on XGBoost
这是多数投票的实现,允许每个模型的权重调整。得到了迄今为止最好的预测。
https://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/
from mlxtend.classifier import EnsembleVoteClassifier
tik=time.time()
eclf = EnsembleVoteClassifier(clfs=model_list, weights=[1,1,2,1,0.5])
eclf.fit(cirrhosis_X_train, cirrhosis_y_train)
eclf.predict(cirrhosis_X_test)
eclf.score(cirrhosis_X_test, cirrhosis_y_test)
tok=time.time()
comparison_dict['ECLF_XGBoost'] = [round(eclf.score(cirrhosis_X_test, cirrhosis_y_test),2), round(tok-tik,2)]
print('Accuracy on seperate test set of majority vote:', comparison_dict['ECLF_XGBoost'][0])
# Accuracy on seperate test set of majority vote: 0.8
方法对比
##Add second y-axis
fig=plt.figure(figsize=(30, 15))
ax1 = fig.add_subplot(111)
ax2=ax1.twinx()
dict_sort=dict(sorted(comparison_dict.items(), key=lambda item: item[1][0]))
methods = list(dict_sort.keys())
accuracies = [item[0] for item in dict_sort.values()]
times = [item[1] for item in dict_sort.values()]
p1=ax1.bar(methods, accuracies,alpha=.5,label="Accuracy")
p2=ax2.plot(methods, times,lw=5,color="r",label="Computational time")[0]
ax1.set_ylabel("Accuracies",fontsize=25)
ax1.set_ylim(0.6)
ax2.set_ylabel("Computational times",fontsize=25)
ax1.set_xticklabels(methods, rotation=45, ha='right',fontsize=25,rotation_mode='anchor')
labs=[p1.get_label(),p2.get_label()]
fig.legend([p1,p2],
labels=labs,
loc="upper center",
prop={'size': 25}
)
plt.tight_layout()
plt.show()
fig.savefig("accuracy_comparison.png")
总结
- 逻辑回归+惩罚等可解释模型表现不佳,但速度非常快;
- 最好的模型是集成模型(随机森林、提升、集成投票),它们需要更长的训练时间并且根本无法解释;
- 其他论文指出:在不同的、更完整的数据集上可以获得更好的准确度(参见 [1],印度肝病患者数据集,仅检测“肝病”或“无肝病”); 通常使用决策树、随机森林和 SVM 等方法,其准确率可能达到 90% 左右;
- 使用更大、更完整的数据集,集成方法在预测肝硬化方面可能接近 [1] 的准确度。 集成方法很慢,这可能会对非常大的数据集产生影响。
[1] P M Dattatreya et al. “Machine Learning Techniques in Analysis and Prediction of Liver Disease”. In: IJIRT 8.2 (2021).