基于集成技术的肝硬化预测

项目目的

  • 肝硬化是由长期肝脏损伤引起的肝脏瘢痕形成(纤维化)
  • 肝硬化最终可能导致肝功能衰竭,危及生命
  • 治疗可能阻止肝硬化恶化,本项目意在根据获取的结构化临床数据预测肝硬化

数据描述

总计418名患者,其中312例患者具有全部资料,106例患者只做了一些基本的测量。数据中有18种可以使用的属性:


数据集样例
  1. 标识:唯一标识
  2. N_Days:从登记到死亡、移植或研究分析时间(以较早者为准)之间的天数
  3. 状态:患者C(受审查)、CL(因肝脏tx而受审查)或D(死亡)的状态
  4. 药物:D-青霉胺或安慰剂类药物
  5. 年龄:以[天]为单位的年龄
  6. 性别:M(男性)或F(女性)
  7. 腹水:存在腹水N(否)或Y(是)
  8. 肝肿大:存在肝肿大N(否)或Y(是)
  9. 蜘蛛状血管瘤:是否存在蜘蛛状血管瘤N(否)或Y(是)
  10. 水肿:存在水肿N(无水肿且无水肿利尿剂治疗)、S(存在无利尿剂的水肿,或通过利尿剂消退的水肿)或Y(尽管有利尿剂治疗但仍存在水肿)
  11. 胆红素:血清胆红素[mg/dl]
  12. 胆固醇:血清胆固醇,单位为[mg/dl]
  13. 白蛋白:白蛋白单位[gm/dl]
  14. 铜:尿铜单位[微克/天]
  15. Alk_Phos:碱性磷酸酶,单位为[U/l]
  16. SGOT: SGOT(单位:毫升)
  17. 甘油三酯:甘油三酯,单位为[mg/dl]
  18. 血小板:血小板/立方[ml/1000]
  19. 凝血酶原:凝血酶原时间,单位为秒[s]
  20. 分期:疾病的组织学分期(1、2、3或4)

四个肝脏阶段:健康、肥胖、纤维化、肝硬化
我们的目标:根据肝硬化(1)和无肝硬化(0)进行二分类的预测

数据预处理

  • 删除“ID”、“N_Days”、“Hepatomegaly”和“Status”栏
  • 用中值替换数值特征的缺失值
  • 用最频繁或最合理的类别替换分类特征的缺失值
  • 分类变量的标签编码
  • 二进制化目标值
# Drop ID as we will not need it for prediction
cirrhosis.drop('ID', axis=1, inplace=True)
# Drop N_Days and Status to prevent data leakage
cirrhosis.drop('N_Days', axis=1, inplace=True)
cirrhosis.drop('Status', axis=1, inplace=True)

数据探索性分析

缺失值情况

def get_percentage_missing(series):
    """ Calculates percentage of NaN values in DataFrame
    :param series: Pandas DataFrame object
    :return: float
    """
    num = series.isnull().sum()
    den = len(series)
    return round(num/den, 2)

# Only include columns that contain any NaN values
df_with_any_null_values = cirrhosis[cirrhosis.columns[cirrhosis.isnull().any()].tolist()]

get_percentage_missing(df_with_any_null_values)
缺失值比例

数据集中的前312个病例参与了随机试验,并包含大部分完整的数据。另外112例没有参与临床试验,但同意记录基本测量值,并接受生存随访。这些病例中有6例在诊断后不久就失去了随访,因此这里的数据是关于另外106例病例以及312名随机参与者的。

# Drop rows where target value is missing
cirrhosis=cirrhosis.dropna(subset=['Stage'])
# save the names of the variables with missing values (one for object and one for float columns)
df_with_any_null_values = cirrhosis[cirrhosis.columns[cirrhosis.isnull().any()].tolist()]

obj_cols = list(df_with_any_null_values.select_dtypes(include='object'))
float_cols = list(df_with_any_null_values.select_dtypes(include='float64'))

obj_cols_with_nan = get_percentage_missing(df_with_any_null_values[obj_cols]).keys()
float_cols_with_nan = get_percentage_missing(df_with_any_null_values[float_cols]).keys()

# explore data in categorical columns with missing values to determine how to replace the missing values
for col in obj_cols_with_nan:
    val_counts = {}
    for value in cirrhosis[col].value_counts().keys():
        val_counts[value] = cirrhosis[col].value_counts()[value]
    val_counts['NaN'] = cirrhosis[col].isna().sum()
    print('Value count for', col, ':', val_counts)
  • 药物: 数据在检测药物D-青霉胺的研究框架内收集。100个缺失值来自未参与研究但同意记录其数据的患者。因此用一个额外的类别“无药物”来填充缺失值。
  • 腹水: 腹水是一种与肝脏功能障碍密切相关的疾病,通常由肝硬化引起。由于Y与N的比例很小,所以用“N”替换缺失的值。
  • 肝肿大: 肝脏肿大是一种非特异性医学体征,有多种病因,大致可分为感染、肝肿瘤或代谢性疾病。这是一个相当不明确的条件,Y和N的比例几乎相等。所以决定放弃这个预测因子。
  • 蜘蛛: 蜘蛛血管瘤是一种皮肤上肿胀的蜘蛛状血管。很常见且通常为良性,约10-15%的健康成人和幼儿中可见。但是如果出现三个以上的蜘蛛痣很可能是异常的,也可能是肝病和/或丙型肝炎的征兆。Y和N的比例偏向N,但不是很大。然而由于它很常见,除非被大量发现,否则通常不是肝病的可靠指标,所以决定用“N”填充缺失值。
cirrhosis['Drug'] = cirrhosis['Drug'].fillna('Placebo')
cirrhosis['Ascites'] = cirrhosis['Ascites'].fillna('N')  
cirrhosis.drop('Hepatomegaly', axis=1, inplace=True)
cirrhosis['Spiders'] = cirrhosis['Spiders'].fillna('N')

# Missing float values are all filled with the median
cirrhosis['Cholesterol'] = cirrhosis['Cholesterol'].fillna(cirrhosis['Cholesterol'].median())
cirrhosis['Copper'] = cirrhosis['Copper'].fillna(cirrhosis['Copper'].median())
cirrhosis['Alk_Phos'] = cirrhosis['Alk_Phos'].fillna(cirrhosis['Alk_Phos'].median())
cirrhosis['SGOT'] = cirrhosis['SGOT'].fillna(cirrhosis['SGOT'].median())
cirrhosis['Tryglicerides'] = cirrhosis['Tryglicerides'].fillna(cirrhosis['Tryglicerides'].median())
cirrhosis['Platelets'] = cirrhosis['Platelets'].fillna(cirrhosis['Platelets'].median())
cirrhosis['Prothrombin'] = cirrhosis['Prothrombin'].fillna(cirrhosis['Prothrombin'].median())
缺失值处理后

正负样本比例

# Proportion of cirrhosis/no cirrhosis
round(len(cirrhosis[cirrhosis['Stage']==4])/len(cirrhosis),2)
# 结果为:0.35

异常值检测

# Outlier detection
float_cols = list(cirrhosis.select_dtypes(include='float64'))

for col in float_cols:
    plt.figure()    
    sns.boxplot(cirrhosis[col])

准备好的数据

# Binary Classification
cirrhosis['Stage'] = np.where(cirrhosis['Stage'] == 4, 1, 0)

# Check that all missing values are gone
df_with_any_null_values = cirrhosis[cirrhosis.columns[cirrhosis.isnull().any()].tolist()]
get_percentage_missing(df_with_any_null_values)

# convert data to be usable in sklearn classifiers
le = LabelEncoder()
obj_cols = cirrhosis.select_dtypes(include=('object')).columns

for col in obj_cols:
    cirrhosis[col] = le.fit_transform(cirrhosis[col])

cirrhosis_y = cirrhosis.iloc[:,-1].values
cirrhosis_X = cirrhosis.iloc[:,:-1]
cirrhosis_X = cirrhosis_X.values

# Create dictionary to compare accuracies and computation time between classifiers
comparison_dict = {}

# Folds for Cross-validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

初步尝试

  • 对所有模型进行了5折交叉验证,并比较了平均准确度以及5折交叉验证的计算时间
  • 在所有模型上都使用了网格搜索
  • 预处理数据的第一个简单模型:支持向量分类器
  • 5倍以上的平均测试精度:0.68
  • 每折的计算时间为0.28秒
C_list = np.logspace(-5,3, num=10, base=10)
kernel_list = ['poly', 'rbf', 'sigmoid']
param_grid = dict(C=C_list, kernel=kernel_list)

clf = GridSearchCV(svm.SVC(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
    
    clf_final = svm.SVC(C=clf.best_params_['C'],kernel=clf.best_params_['kernel'])
    clf_final.fit(X_train, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
    
    acc_train.append(clf_final.score(X_train, Y_train))
    acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()    

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['SVC'] = [final_acc_test, round((tok-tik)/kfold.n_splits,2)]
SVM初步结果

降维

PCA

pca = PCA(n_components=3)
pca.fit(cirrhosis_X)
cirrhosis_X_pca = pca.transform(cirrhosis_X)

plt.figure(figsize=(25,8))
sns.scatterplot(x=cirrhosis_X_pca[:,0], y=cirrhosis_X_pca[:, 1], hue=cirrhosis_y, palette='terrain')
plt.title('Principal Components vs Class distribution', fontsize=16)
plt.ylabel('Principal Component 2', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=16)
_=plt.xticks(rotation='vertical')
PCA可视化

Kernel PCA

一般来说,主成分分析(Principal Components Analysis, PCA)适用于数据的线性降维。而核主成分分析(Kernel PCA, KPCA)可实现数据的非线性降维,用于处理线性不可分的数据集。
KPCA的大致思路是:对于输入空间(Input space)中的矩阵X,我们先用一个非线性映射把X中的所有样本映射到一个高维甚至是无穷维的空间(称为特征空间,Feature space),(使其线性可分),然后在这个高维空间进行PCA降维。详见附录。

gamma_list = np.logspace(-5, 5, num=5, base=10)
alpha_list = np.logspace(-5, 2, num=5, base=10)
kernel_list = ['linear', 'rbf', 'cosine']

for gamma in gamma_list:
    for alpha in alpha_list:
        for kernel in kernel_list:    
            kernel_pca = KernelPCA(
                    n_components=2, kernel=kernel, gamma=gamma, fit_inverse_transform=True, alpha=alpha
                )

            kernel_pca.fit(cirrhosis_X)
            cirrhosis_X_kpca = kernel_pca.transform(cirrhosis_X)
            
            fig = plt.figure()
            plt.scatter(cirrhosis_X_kpca[:,0], cirrhosis_X_kpca[:, 1], marker='o', c=cirrhosis_y, s=25, edgecolor='k') 
            plt.title('2-component KPCA on BC with {gamma}, {alpha}, {kernel}'.format(gamma=np.round(gamma,4), alpha=np.round(alpha,4), kernel=kernel))
            plt.xlabel('PC_1')
            plt.ylabel('PC_2')
            plt.show()
kernel_pca = KernelPCA(n_components=2, kernel='rbf', gamma=1, fit_inverse_transform=True, alpha=0.1)
kernel_pca.fit(cirrhosis_X)
cirrhosis_X_pca = kernel_pca.transform(cirrhosis_X)

plt.figure(figsize=(25,8))
sns.scatterplot(x=cirrhosis_X_pca[:,0], y=cirrhosis_X_pca[:, 1], hue=cirrhosis_y, palette='terrain')
plt.title('Principal Components vs Class distribution', fontsize=16)
plt.ylabel('Principal Component 2', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=16)
_=plt.xticks(rotation='vertical')
Kernel PCA可视化

从结果看:

  • 找不到任何参数可以导致可分离数据的组合
  • 在PCA和KPCA之上实施SVM产生了预期的一般结果:
  • 测试精度均为0.68,计算时间分别为0.06秒和0.08秒
  • LDA和QDA表现更好(均为0.71),但数据仍然不可分离

PCA+SVM

C_list = np.logspace(-5, 3, num=10, base=10)
kernel_list = ['poly', 'rbf', 'sigmoid']
param_grid = dict(C=C_list, kernel=kernel_list)

pca = PCA(n_components=4)
cirrhosis_X_pca = pca.fit_transform(cirrhosis_X)
clf = GridSearchCV(svm.SVC(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X_pca, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X_pca, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]

    pca = PCA(n_components=4)
    pca.fit(X_train)
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    clf_final = svm.SVC(kernel=clf.best_params_['kernel'], C=clf.best_params_['C'])
    clf_final.fit(X_train_pca, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train_pca, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test_pca, Y_test),2))
    
    acc_train.append(clf_final.score(X_train_pca, Y_train))
    acc_test.append(clf_final.score(X_test_pca, Y_test))
tok = time.time()  

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['PCA+SVC'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

Kernel PCA+SVM

C_list = np.logspace(-5, 3, num=10, base=10)
kernel_list = ['poly', 'rbf', 'sigmoid']
param_grid = dict(C=C_list, kernel=kernel_list)

kernel_pca = KernelPCA(n_components=4)
cirrhosis_X_pca = kernel_pca.fit_transform(cirrhosis_X)
clf = GridSearchCV(svm.SVC(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X_pca, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X_pca, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]

    kernel_pca = KernelPCA(n_components=4)
    kernel_pca.fit(X_train)
    X_train_pca = kernel_pca.transform(X_train)
    X_test_pca = kernel_pca.transform(X_test)
    
    clf_final = svm.SVC(kernel=clf.best_params_['kernel'], C=clf.best_params_['C'])
    clf_final.fit(X_train_pca, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train_pca, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test_pca, Y_test),2))
    
    acc_train.append(clf_final.score(X_train_pca, Y_train))
    acc_test.append(clf_final.score(X_test_pca, Y_test))
tok = time.time()  

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['KPCA+SVC'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

LDA+SVM

C_list = np.logspace(-5, 3, num=10, base=10)
kernel_list = ['poly', 'rbf', 'sigmoid']
param_grid = dict(C=C_list, kernel=kernel_list)

lda = LinearDiscriminantAnalysis(n_components=1)
cirrhosis_X_lda=lda.fit_transform(cirrhosis_X, cirrhosis_y)
clf = GridSearchCV(svm.SVC(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X_lda, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X_lda, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]

    lda = LinearDiscriminantAnalysis(n_components=1)
    lda.fit(X_train, Y_train)
    X_lda_train = lda.transform(X_train)
    X_lda_test = lda.transform(X_test)
    
    clf_final = svm.SVC(kernel=clf.best_params_['kernel'], C=clf.best_params_['C'])
    clf_final.fit(X_lda_train, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_lda_train, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_lda_test, Y_test),2))
    
    acc_train.append(clf_final.score(X_lda_train, Y_train))
    acc_test.append(clf_final.score(X_lda_test, Y_test))
tok = time.time()   

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['LDA+SVC'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

plt.figure()
plt.scatter(X_lda_train[Y_train==0, 0], X_lda_train[Y_train == 0, 0])
plt.scatter(X_lda_train[Y_train==1, 0], X_lda_train[Y_train == 1, 0])
plt.legend(loc="best", shadow=False, scatterpoints=1)
plt.title("LDA seperation")
plt.show()

QDA

reg_list = np.logspace(-4, 1, num=5, base=10)
tol_list = np.logspace(-4, 1, num=5, base=10)
param_grid = dict(tol=tol_list, reg_param=reg_list)

clf = GridSearchCV(QuadraticDiscriminantAnalysis(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]

    clf_final = QuadraticDiscriminantAnalysis(tol=clf.best_params_['tol'],reg_param=clf.best_params_['reg_param'])
    clf_final.fit(X_train, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
    
    acc_train.append(clf_final.score(X_train, Y_train))
    acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()   

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['QDA'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

Boosting

• AdaBoost: accuracy 0.72, computation time 0.12 seconds
• XGBoost: accuracy 0.76, computation time 0.24 seconds

n_estimator_list = [5, 10, 20, 50, 70, 100, 150, 200]
lr_list = [0.0001,0.001, 0.01, 0.1,0.5]
param_grid = dict(n_estimators=n_estimator_list, learning_rate=lr_list)

clf = GridSearchCV(AdaBoostClassifier(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
    
    clf_final = AdaBoostClassifier( n_estimators=clf.best_params_['n_estimators'],
                                    learning_rate=clf.best_params_['learning_rate'])
    clf_final.fit(X_train, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
    
    acc_train.append(clf_final.score(X_train, Y_train))
    acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()    

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['AdaBoost'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]
max_depth_list = [2, 3, 4, 5, 10, 15]
lr_list = [0.0001,0.001, 0.01, 0.1,0.5]

param_grid = dict(max_depth=max_depth_list, learning_rate=lr_list)

clf = GridSearchCV( XGBClassifier(eval_metric='logloss'), param_grid=param_grid, n_jobs=-1, 
                    refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
    
    clf_final = XGBClassifier(  max_depth=clf.best_params_['max_depth'],
                                learning_rate=clf.best_params_['learning_rate'],
                                eval_metric='logloss')
    clf_final.fit(X_train, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
    
    acc_train.append(clf_final.score(X_train, Y_train))
    acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()    

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['XGBoost'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

其他的方法

模型的比较

KNN

KNN如何选择一个最佳的K值取决于数据
一般情况下,在分类时较大的K值能够减小噪声的影响, 但会使类别之间的界限变得模糊。一个较好的K值能通过各种启发式技术(见超参数优化)来获取。
噪声和非相关性特征的存在,或特征尺度与它们的重要性不一致会使K近邻算法的准确性严重降低。对于选取和缩放特征来改善分类已经做了很多研究。一个普遍的做法是利用进化算法优化功能扩展,还有一种较普遍的方法是利用训练样本的互信息进行选择特征
在二元(两类)分类问题中,K选取为奇数有助于避免两个分类平票的情形。在此问题下,选取最佳经验K值的方法是自助法

from sklearn.neighbors import KNeighborsClassifier

n_list = [2, 3, 4, 5, 6, 7, 8, 9, 10]
param_grid = dict(n_neighbors=n_list)

clf = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf.fit(cirrhosis_X, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
    
    clf_final = KNeighborsClassifier( n_neighbors=clf.best_params_['n_neighbors'])
    clf_final.fit(X_train, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
    
    acc_train.append(clf_final.score(X_train, Y_train))
    acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()    

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['KNeighbors'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

K近邻算法的优缺点:不同类别的样本点,分布在空间的不同区域。近邻是基于空间距离较近的样本类别来进行分类,本质上是对于特征空间的划分。

  • 优点:精度高、对异常值不敏感、无数据输入假定。
  • 缺点:计算复杂度高、空间复杂度高。
  • 适用数据范围:数值型和标称型。

Neural Network

from sklearn.neural_network import MLPClassifier

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
    
    clf=MLPClassifier(hidden_layer_sizes=(32,32),activation="tanh",solver="sgd",
                      learning_rate='adaptive', learning_rate_init=0.001,
                      n_iter_no_change=10,max_iter=1000,alpha=0.0001)
    clf.fit(X_train, Y_train)
    
    print('Train accuracy:', round(clf.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf.score(X_test, Y_test),2))
    
    acc_train.append(clf.score(X_train, Y_train))
    acc_test.append(clf.score(X_test, Y_test))
tok = time.time()

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['NN'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

Gaussian Naive Bayes

acc_train = []
acc_test = []
              
tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
    
    clf = GaussianNB()
    clf.fit(X_train, Y_train)
    
    print('Train accuracy:', round(clf.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf.score(X_test, Y_test),2))
    
    acc_train.append(clf.score(X_train, Y_train))
    acc_test.append(clf.score(X_test, Y_test))
tok = time.time()

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['Gaus_NaiveBayes'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

Random Forest

from sklearn.ensemble import RandomForestClassifier

max_depth_list =  np.arange(start=2,stop=20,step=4)
n_estimators_list = np.arange(start=10,stop=500,step=90)
param_grid = dict(max_depth=max_depth_list,
                  n_estimators=n_estimators_list)

clf_rf = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5,verbose=1)
clf_rf.fit(cirrhosis_X, cirrhosis_y)
rf_params = clf_rf.best_params_
print('Best parameters:', clf_rf.best_params_)
print('Train accuracy:', round(clf_rf.score(cirrhosis_X, cirrhosis_y),2))

acc_train = []
acc_test = []

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
    
    clf_final = RandomForestClassifier( max_depth=rf_params['max_depth'],
                                        n_estimators=rf_params['n_estimators'])
    clf_final.fit(X_train, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test, Y_test),2))
    
    acc_train.append(clf_final.score(X_train, Y_train))
    acc_test.append(clf_final.score(X_test, Y_test))
tok = time.time()    

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['RF'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

特征选择

随机森林获得特征重要性

随机森林特征重要性原理

  • 袋外数据错误率评估——RF的数据是boostrap的有放回采样,形成了袋外数据。因此可以采用袋外数据(OOB)错误率进行特征重要性的评估。袋外数据错误率定义为:袋外数据自变量值发生轻微扰动后的分类正确率与扰动前分类正确率的平均减少量。
  • 利用Gini系数计算特征的重要性——单棵树上特征的重要性定义为:特征在所有非叶节在分裂时加权不纯度的减少,减少的越多说明特征越重要。

随机森林得到的特征重要性计算方法
1、对于随机森林中的每一颗决策树,使用相应的OOB(袋外数据)数据来计算它的袋外数据误差,记为error_{OOB1}
2、随机地对袋外数据OOB所有样本的特征X加入噪声干扰(就可以随机的改变样本在特征X处的值),再次计算它的袋外数据误差,记为error_{OOB2}
3、假设随机森林中有N_{tree}棵树,那么对于特征X的重要性=\sum_{}^{}\frac{error_{OOB2}-error_{OOB1}}{N_{tree}} ,之所以可以用这个表达式来作为相应特征的重要性的度量值是因为:若给某个特征随机加入噪声之后,袋外的准确率大幅度降低,则说明这个特征对于样本的分类结果影响很大,也就是说它的重要程度比较高。

feature_names=cirrhosis.drop(columns="Stage").columns.values
importances = clf_final.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf_final.estimators_], axis=0)
forest_importances = pd.Series(importances, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
ax.tick_params(axis="x",labelrotation=45)
fig.tight_layout()
plt.savefig("RF_selection")

随机森林的特征重要度
acc_train = []
acc_test = []
thres=0.05
cirrhosis_X_red=cirrhosis_X[:,np.where(importances>=thres)[0]]

max_depth_list = [2, 5, 10, 15]
lr_list = [0.0001,0.001, 0.01, 0.1,0.5]
param_grid = dict(max_depth=max_depth_list, learning_rate=lr_list)

clf = GridSearchCV(XGBClassifier(), param_grid=param_grid, n_jobs=-1, refit=True, cv=5)
clf.fit(cirrhosis_X_red, cirrhosis_y)

print('Best parameters:', clf.best_params_)
print('Train accuracy:', round(clf.score(cirrhosis_X_red, cirrhosis_y),2))

tik = time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
    X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
    Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]

    clf_rf = RandomForestClassifier( max_depth=rf_params['max_depth'],
                                        n_estimators=rf_params['n_estimators'])
    clf_rf.fit(X_train, Y_train)
    importances = clf_rf.feature_importances_
    X_train_rf = X_train[:,np.where(importances>=thres)[0]]
    X_test_rf = X_test[:,np.where(importances>=thres)[0]]
    print("Selected features in fold ",fold,": ",np.where(importances>=thres)[0])
    clf_final = XGBClassifier(  max_depth=clf.best_params_['max_depth'],
                                learning_rate=clf.best_params_['learning_rate'])
    clf_final=XGBClassifier(max_depth=5,learning_rate=0.1)
    clf_final.fit(X_train_rf, Y_train)
    
    print('Train accuracy:', round(clf_final.score(X_train_rf, Y_train),2))
    print('Test accuracy:', round(clf_final.score(X_test_rf, Y_test),2))
    
    acc_train.append(clf_final.score(X_train_rf, Y_train))
    acc_test.append(clf_final.score(X_test_rf, Y_test))
tok = time.time()    

final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)
print('Total k-fold computation time;', round(tok-tik, 2))
comparison_dict['RF+XGBoost'] = [final_acc_test,  round((tok-tik)/kfold.n_splits,2)]

Feature Selection with LASSO

alpha_list = [k*10**(-i) for i in reversed(range(6)) for k in [1,3,5,8]]

acc_train = []
acc_test = []
times = []
selected_features = []

for alpha in alpha_list:
    tik = time.time()
    print('alpha:', alpha)
    clf_lasso = Lasso(alpha=alpha)
    acc_train_alpha = []
    acc_test_alpha = []
    for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
        X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
        Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
        
        clf_lasso.fit(X_train, Y_train)
        
        support = (clf_lasso.coef_!=0.0)
        print('New shape of train data for fold',fold, ':', X_train[:, support].shape)
        selected_features.append([i for i, val in enumerate(support) if val])
        
        preds_lasso_train = (clf_lasso.predict(X_train)>0.5)
        acc_train_alpha.append(round(((Y_train==preds_lasso_train).sum()/len(Y_train)),2))
        preds_lasso_test = (clf_lasso.predict(X_test)>0.5)
        acc_test_alpha.append(round(((Y_test==preds_lasso_test).sum()/len(Y_test)),2))
    tok = time.time()
    times.append( round((tok-tik)/kfold.n_splits,2))
    acc_train.append(np.mean(acc_train_alpha))
    acc_test.append(np.mean(acc_test_alpha))

selected_features=sum(selected_features, [])
plt.plot(alpha_list, acc_test)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Accuracy')
plt.title('Test accuracy with Logistic Regression+LASSO Feature Selection')
![image.png](https://upload-images.jianshu.io/upload_images/24891076-ebc6bba1a412d955.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
#Text(0.5, 1.0, 'Test accuracy with Logistic Regression+LASSO Feature Selection')
print('Maximal test accuracy with Logistic Regression + LASSO:', round(np.max(acc_test),2), 'achieved at alpha =', alpha_list[np.argmax(acc_test)])
comparison_dict['Lasso']=[round(np.max(acc_test),2),times[np.argmax(acc_test)]]
# Maximal test accuracy with Logistic Regression + LASSO: 0.73 achieved at alpha = 0.0005

y, x, _ = plt.hist(selected_features, bins=range(16))
print('Most frequently selected features:', np.argpartition(y, -5)[-5:])
plt.show()
plt.savefig("Sel_feat_lasso")
#Most frequently selected features: [ 7 13  1  9 10]

Feature Selection LASSO+SVC

C_list = [k*10**(i) for i in range(-2,4) for k in [1,3,5,8]]

acc_train = []
acc_test = []
times = []
selected_features = []

for C in C_list:
    tik = time.time()   
    print('C:', C)
    clf_svc = LinearSVC(C=C, penalty='l1', dual=False)
    acc_train_C = []
    acc_test_C = []
    
    for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
        X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
        Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
        
        clf_svc.fit(X_train, Y_train)
        
        support = (clf_svc.coef_!=0.0).flatten()
        print('New shape of train data for fold',fold, ':', X_train[:, support].shape)
        selected_features.append([i for i, val in enumerate(support) if val])
        
        preds_lasso_train = (clf_svc.predict(X_train)>0.5)
        acc_train_alpha.append(round(((Y_train==preds_lasso_train).sum()/len(Y_train)),2))
        preds_lasso_test = (clf_svc.predict(X_test)>0.5)
        acc_test_alpha.append(round(((Y_test==preds_lasso_test).sum()/len(Y_test)),2))
        
    acc_train.append(np.mean(acc_train_alpha))
    acc_test.append(np.mean(acc_test_alpha))
    tok = time.time()
    times.append(round((tok-tik)/kfold.n_splits,2))
    
selected_features=sum(selected_features, [])
plt.plot(alpha_list, acc_test)
plt.xscale('log')
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title('Test accuracy with LASSO+SVC Feature Selection')
# Text(0.5, 1.0, 'Test accuracy with LASSO+SVC Feature Selection')
print('Maximal test accuracy with Linear SVM + LASSO:', np.max(acc_test), 'achieved at C =', alpha_list[np.argmax(acc_test)])
comparison_dict['Lasso+LinearSVC']=[round(np.max(acc_test),2), times[np.argmax(acc_test)]]
# Maximal test accuracy with Linear SVM + LASSO: 0.72392 achieved at C = 8
y, x, _ = plt.hist(selected_features, bins=range(16))
print('Most frequently selected features:', np.argpartition(y, -5)[-5:])
# Most frequently selected features: [13 10  1  9  7]

Feature Selection ElasticNet

alpha_list = [k*10**(-i) for i in reversed(range(6)) for k in [1,3,5,8]]

acc_train = []
acc_test = []
times = []
selected_features = []

for alpha in alpha_list:
    tik = time.time()
    print('alpha:', alpha)
    clf_en = ElasticNet(alpha=alpha, l1_ratio=0.7)
    acc_train_alpha = []
    acc_test_alpha = []
    for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X, cirrhosis_y)):
        X_train, X_test = cirrhosis_X[train_ids], cirrhosis_X[test_ids]
        Y_train, Y_test = cirrhosis_y[train_ids], cirrhosis_y[test_ids]
        
        clf_en.fit(X_train, Y_train)
        
        support = (clf_en.coef_!=0.0)
        print('New shape of train data for fold',fold, ':', X_train[:, support].shape)
        selected_features.append([i for i, val in enumerate(support) if val])
        
        preds_lasso_train = (clf_en.predict(X_train)>0.5)
        acc_train_alpha.append(round(((Y_train==preds_lasso_train).sum()/len(Y_train)),2))
        preds_lasso_test = (clf_en.predict(X_test)>0.5)
        acc_test_alpha.append(round(((Y_test==preds_lasso_test).sum()/len(Y_test)),2))
    tok = time.time() 
    acc_train.append(np.mean(acc_train_alpha))
    acc_test.append(np.mean(acc_test_alpha))
    times.append( round((tok-tik)/kfold.n_splits,2))

selected_features=sum(selected_features, [])
plt.plot(alpha_list, acc_test)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Accuracy')
plt.title('Test accuracy with ElasticNet Feature Selection')
# Text(0.5, 1.0, 'Test accuracy with ElasticNet Feature Selection')
print('Maximal test accuracy with Elastic Net:', np.max(acc_test), 'achieved at alpha =', alpha_list[np.argmax(acc_test)])
comparison_dict['Elasticnet']=[round(np.max(acc_test),2), times[np.argmax(acc_test)]]
# Maximal test accuracy with Elastic Net: 0.732 achieved at alpha = 0.008
y, x, _ = plt.hist(selected_features, bins=range(16))
print('Most frequently selected features:', np.argpartition(y, -5)[-5:])
plt.show()
plt.savefig("Sel_feat_Elastic")
# Most frequently selected features: [ 7 13  1  9 10]

综上:

  • LASSO+逻辑回归,LASSO+线性SVC和弹性网络实现最大精度为0.73
  • 最常选择的特征: 年龄、胆固醇、铜、碱性磷酸酶、血小板

集成方法

  • 将完整数据集分成 80% 训练和 20% 测试,在训练集上进行 5 折验证并在测试集上测试改进的方法;
  • 使用 5 折验证训练了 5 个不同的 XGBoost 分类器;
  • 使用之前通过网格搜索确定的最佳性能参数;
  • 最终模型预测为 1 如果 3 个或更多分类器预测为肝硬化,否则为 0;
  • 迄今为止最好的准确度:0.77;
  • sklearn可以实现加权多数投票。给表现最好的分类器最高权重达到了 80% 的准确率。

Majority Vote on XGBoost

为了提高XGBoost的性能,在五个不同的训练集上训练XGBoost,并收集这五个分类器。然后对所有五个经过训练的分类器使用多数投票,并在一个单独的测试集上测试其性能,该测试集是模型在训练期间从未见过的。

# Train-Test-Split to be used in majority vote
cirrhosis_X_train, cirrhosis_X_test, cirrhosis_y_train, cirrhosis_y_test = train_test_split(cirrhosis_X, cirrhosis_y, test_size=0.2, random_state=1)

acc_train = []
acc_test = []

max_depth_list = [2, 3, 4, 5, 10, 15, 20, 50, 100, 200]
lr_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]

model_list = []
tik=time.time()
for fold, (train_ids, test_ids) in enumerate(kfold.split(cirrhosis_X_train, cirrhosis_y_train)):
    X_train, X_test = cirrhosis_X_train[train_ids], cirrhosis_X_train[test_ids]
    Y_train, Y_test = cirrhosis_y_train[train_ids], cirrhosis_y_train[test_ids]
    
    clf = XGBClassifier(max_depth=2, eta=0.1)
    clf.fit(X_train, Y_train)
    model_list.append(clf)
    
    print('Train accuracy:', round(clf.score(X_train, Y_train),2))
    print('Test accuracy:', round(clf.score(X_test, Y_test),2))
    
    acc_train.append(clf.score(X_train, Y_train))
    acc_test.append(clf.score(X_test, Y_test))
final_acc_test = round(np.mean(acc_test),2)
final_acc_train = round(np.mean(acc_train),2)

print('Average train accuracy over 5 folds:', final_acc_train)
print('Average test accuracy over 5 folds:', final_acc_test)

majority_vote = []
for i in range(len(cirrhosis_y_test)):
    proportions = 0
    for model in model_list:
        proportions += model.predict(cirrhosis_X_test)[i]
    proportions/=5
    majority_vote.append(int(proportions>=3/5))
majority_vote = np.asarray(majority_vote)
test_acc=round(sum(majority_vote==cirrhosis_y_test)/len(cirrhosis_y_test),2)
tok=time.time()

comparison_dict['MajorityVoteXGB'] = [test_acc, round(tok-tik,2)]
print('Accuracy on seperate test set of majority vote:', comparison_dict['MajorityVoteXGB'][0])

Ensemble Vote Classifier on XGBoost

这是多数投票的实现,允许每个模型的权重调整。得到了迄今为止最好的预测。
https://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/

from mlxtend.classifier import EnsembleVoteClassifier

tik=time.time()
eclf = EnsembleVoteClassifier(clfs=model_list, weights=[1,1,2,1,0.5])
eclf.fit(cirrhosis_X_train, cirrhosis_y_train)
eclf.predict(cirrhosis_X_test)
eclf.score(cirrhosis_X_test, cirrhosis_y_test)
tok=time.time()
comparison_dict['ECLF_XGBoost'] = [round(eclf.score(cirrhosis_X_test, cirrhosis_y_test),2), round(tok-tik,2)]
print('Accuracy on seperate test set of majority vote:', comparison_dict['ECLF_XGBoost'][0])
# Accuracy on seperate test set of majority vote: 0.8

方法对比

##Add second y-axis
fig=plt.figure(figsize=(30, 15))
ax1 = fig.add_subplot(111)
ax2=ax1.twinx()

dict_sort=dict(sorted(comparison_dict.items(), key=lambda item: item[1][0]))
methods = list(dict_sort.keys())
accuracies = [item[0] for item in dict_sort.values()]
times = [item[1] for item in dict_sort.values()]
p1=ax1.bar(methods, accuracies,alpha=.5,label="Accuracy")
p2=ax2.plot(methods, times,lw=5,color="r",label="Computational time")[0]
ax1.set_ylabel("Accuracies",fontsize=25)
ax1.set_ylim(0.6)
ax2.set_ylabel("Computational times",fontsize=25)
ax1.set_xticklabels(methods, rotation=45, ha='right',fontsize=25,rotation_mode='anchor')
labs=[p1.get_label(),p2.get_label()]
fig.legend([p1,p2],   
           labels=labs,   
           loc="upper center",   
           prop={'size': 25}
           )

plt.tight_layout() 
plt.show()
fig.savefig("accuracy_comparison.png")

总结

  • 逻辑回归+惩罚等可解释模型表现不佳,但速度非常快;
  • 最好的模型是集成模型(随机森林、提升、集成投票),它们需要更长的训练时间并且根本无法解释;
  • 其他论文指出:在不同的、更完整的数据集上可以获得更好的准确度(参见 [1],印度肝病患者数据集,仅检测“肝病”或“无肝病”); 通常使用决策树、随机森林和 SVM 等方法,其准确率可能达到 90% 左右;
  • 使用更大、更完整的数据集,集成方法在预测肝硬化方面可能接近 [1] 的准确度。 集成方法很慢,这可能会对非常大的数据集产生影响。

[1] P M Dattatreya et al. “Machine Learning Techniques in Analysis and Prediction of Liver Disease”. In: IJIRT 8.2 (2021).

附录


KPCA

LDA和QDA

AdaBoost

GBDT

XGBoost

调参参考
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,287评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,346评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,277评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,132评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,147评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,106评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,019评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,862评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,301评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,521评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,682评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,405评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,996评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,651评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,803评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,674评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,563评论 2 352

推荐阅读更多精彩内容

  • #消化系统 #胃食管返流病 ##病因和发病机制 _1、最主要的机制:抗反流屏障尤其是食管下括约肌LES的异常_ ①...
    梦得得得阅读 1,715评论 0 2
  • 乳糜性腹水 乳糜性腹水为乳状腹腔积液,富含甘油三酯。其是由胸部或肠道淋巴液进入腹腔所致[1]。 乳糜性腹水是一种少...
    小曾姑娘阅读 1,686评论 0 2
  • 癫痫病都有哪些原因? 癫痫病是一种脑部疾病,其特征是脑部持续存在,容易导致癫痫反复发作,癫痫病是由多种原因引起的大...
    癫痫之家阅读 249评论 0 0
  • 尼莫地平 又名尼莫同、尼达尔 Nimodipine 属于高脂溶性钙拮抗剂,极易透过血脑屏障。 具有神经/血管双重保...
    咏而归_3a57阅读 371评论 0 0
  • 第三个问题,伤透是不是可以自愈呢?也就自己的自我疗愈呢?如果回答是的话呢?可以回答一个111,不是的话呢?222。...
    般若_d071阅读 363评论 0 0