相信数据挖掘爱好者们都听说过kaggle这个竞赛平台,相比国内的天池大数据平台而言,Kaggle中的项目更多,而且更加轻量化,更适合缺乏强力硬件支持的初学者练手。
今天就来讲一下如何在对机器学习算法一知半解的情况下在Kaggle入门项目——Titanic生存预测中刷进前10%。
在Kaggle的Kernels板块中,有很多人分享的项目算法,有R语言和Python语言两种,感兴趣的同学们可以去观摩一下。其中Python语言中绝大部分是使用Jupyter Notebook完成的。
1. 数据总览
Titanic生存预测中提供了两组数据:train.csv 和test.csv,分别是训练集和测试集。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
train_data = pd.read_csv('I://model/titanic/train.csv')
test_data = pd.read_csv('I://model/titanic/test.csv')
train_data.info()
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
存活比例
train_data['Survived'].value_counts().plot.pie(autopct='%1.1f%%')
2. 数据关系分析
(1)性别与生存的关系
train_data.groupby(['Sex','Survived'])['Survived'].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()
(2)船舱等级与生存的关系
train_data[['Pclass','Survived']].groupby(['Pclass']).mean().plot.bar(color=['r','g','b'])
train_data[['Sex','Pclass','Survived']].groupby(['Pclass','Sex']).mean().plot.bar()
train_data.groupby(['Sex','Pclass','Survived'])['Survived'].count()
Sex Pclass Survived
female 1 0 3
1 91
2 0 6
1 70
3 0 72
1 72
male 1 0 77
1 45
2 0 91
1 17
3 0 300
1 47
Name: Survived, dtype: int64
从上图和表中明显可以看到,虽然泰坦尼克号逃生总体符合妇女优先,但是对各个等级船舱还是有区别的,而且一等舱中的男子凭借自身的社会地位强行混入了救生艇。如白星航运公司主席伊斯梅(他否决了配备48艘救生艇的想法,认为少点也没关系)则抛下他的乘客、他的船员、他的船,在最后一刻跳进可折叠式救生艇C(共有39名乘客)。
(3)年龄与存活的关系
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=train_data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=train_data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()
(4)称呼与存活关系
在数据的Name项中包含了对该乘客的称呼,如Mr、Miss、Mrs等,这些信息包含了乘客的年龄、性别、也有可能包含社会地位,如Dr、Lady、Major、Master等称呼。
这一项不方便用图表展示,但是在特征工程中,我们会将其加入到特征中。
(5)登船港口与存活关系
泰坦尼克号从英国的南安普顿港出发,途径法国瑟堡和爱尔兰昆士敦,一部分在瑟堡或昆士敦下船的人逃过了一劫。
sns.countplot('Embarked',hue='Survived',data=train_data)
plt.title('Embarked and Survived')
(6)船上亲友人数与存活关系
f,ax=plt.subplots(1,2,figsize=(18,8))
train_data[['Parch','Survived']].groupby(['Parch']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Parch and Survived')
train_data[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar(ax=ax[1])
ax[1].set_title('SibSp and Survived')
从图中可以看到,孤身一人存活率很低,但是如果亲友太多,难以估计周全,也很危险。
(7)其他因素
剩余因素还有船票价格、船舱号和船票号,这三个因素都可能会影响乘客在船中的位置从而影响逃生顺序,但是因为这三个因素与生存之间看不出明显规律,所以在后期模型融合时,将这些因素交给模型来决定其重要性。
3. 特征工程
首先将train和test合并一起进行特征工程处理:
train_data_org = pd.read_csv('train.csv')
test_data_org = pd.read_csv('test.csv')
test_data_org['Survived'] = 0
combined_train_test = train_data_org.append(test_data_org)
特征工程即从各项参数中提取出可能影响到最终结果的特征,作为模型的预测依据。特征工程一般应先从含有缺失值即NaN的项开始。
(1)Embarked
先填充缺失值,对缺失的Embarked以众数来填补
if combined_train_test['Embarked'].isnull().sum() != 0:
combined_train_test['Embarked'].fillna(combined_train_test['Embarked'].mode().iloc[0], inplace=True)
再将Embarked的三个上船港口分为3列,每一列均只包含0和1两个值
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'],prefix=combined_train_test[['Embarked']].columns[0])
combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)
(2)Sex
无缺失值,直接分列
sex_dummies_df = pd.get_dummies(combined_train_test['Sex'], prefix=combined_train_test[['Sex']].columns[0])
combined_train_test = pd.concat([combined_train_test, sex_dummies_df], axis=1)
(3)Name
从名字中提取出称呼:
combined_train_test['Title'] = combined_train_test['Name'].str.extract('.+,(.+)').str.extract( '^(.+?)\.').str.strip()
将各式称呼统一:
title_Dict = {}
title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
title_Dict.update(dict.fromkeys(['Jonkheer', 'Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
title_Dict.update(dict.fromkeys(['Master'], 'Master'))
combined_train_test['Title'] = combined_train_test['Title'].map(title_Dict)
分列
title_dummies_df = pd.get_dummies(combined_train_test['Title'], prefix=combined_train_test[['Title']].columns[0])
combined_train_test = pd.concat([combined_train_test, title_dummies_df], axis=1)
(4)Fare
填充NaN,按一二三等舱各自的均价来填充。
if combined_train_test['Fare'].isnull().sum() != 0:
combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass').transform('mean'))
泰坦尼克号中有家庭团体票(分析Ticket号可以得到),所以需要将团体票分到每个人。
combined_train_test['Group_Ticket'] = combined_train_test['Fare'].groupby(by=combined_train_test['Ticket']).transform('count')
combined_train_test['Fare'] = combined_train_test['Fare'] / combined_train_test['Group_Ticket']
combined_train_test.drop(['Group_Ticket'], axis=1, inplace=True)
票价分级
def fare_category(fare):
if fare <= 4:
return 0
elif fare <= 10:
return 1
elif fare <= 30:
return 2
elif fare <= 45:
return 3
else:
return 4
combined_train_test['Fare_Category'] = combined_train_test['Fare'].map(fare_category)
分列(这一项分列与不分列均可)
fare_cat_dummies_df = pd.get_dummies(combined_train_test['Fare_Category'],prefix=combined_train_test[['Fare_Category']].columns[0])
combined_train_test = pd.concat([combined_train_test, fare_cat_dummies_df], axis=1)
(5)Pclass
Pclass项本身已经不需要处理,为了更好地利用这一项,我们假设一二三等舱各自内部的票价也与逃生方式相关,从而分出高价一等舱、低价一等舱……这样的分类。
Pclass_1_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([1]).values[0]
Pclass_2_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([2]).values[0]
Pclass_3_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([3]).values[0]
# 建立Pclass_Fare Category
combined_train_test['Pclass_Fare_Category'] = combined_train_test.apply(pclass_fare_category, args=(Pclass_1_mean_fare, Pclass_2_mean_fare, Pclass_3_mean_fare), axis=1)
p_fare = LabelEncoder()
p_fare.fit(np.array(['Pclass_1_Low_Fare', 'Pclass_1_High_Fare', 'Pclass_2_Low_Fare', 'Pclass_2_High_Fare', 'Pclass_3_Low_Fare','Pclass_3_High_Fare']))#给每一项添加标签
combined_train_test['Pclass_Fare_Category'] = p_fare.transform(combined_train_test['Pclass_Fare_Category'])#转换成数值
(6)Parch and SibSp
这两组数据都能显著影响到Survived,但是影响方式不完全相同,所以将这两项合并成FamilySize组的同时保留这两项。
combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1
combined_train_test['Family_Size_Category'] = combined_train_test['Family_Size'].map(family_size_category)
le_family = LabelEncoder()
le_family.fit(np.array(['Single', 'Small_Family', 'Large_Family']))
combined_train_test['Family_Size_Category'] = le_family.transform(combined_train_test['Family_Size_Category'])
fam_size_cat_dummies_df = pd.get_dummies(combined_train_test['Family_Size_Category'],
prefix=combined_train_test[['Family_Size_Category']].columns[0])
combined_train_test = pd.concat([combined_train_test, fam_size_cat_dummies_df], axis=1)
(7)Age
因为Age项缺失较多,所以不能直接将其填充为众数或者平均数。常见有两种填充法,一是根据Title项中的Mr、Master、Miss等称呼的平均年龄填充,或者综合几项(Sex、Title、Pclass)的Age均值。二是利用其他组特征量,采用机器学习算法来预测Age,本例采用的是第二种方法。
将Age完整的项作为训练集、将Age缺失的项作为测试集。
missing_age_df = pd.DataFrame(combined_train_test[['Age', 'Parch', 'Sex', 'SibSp', 'Family_Size', 'Family_Size_Category',
'Title', 'Fare', 'Fare_Category', 'Pclass', 'Embarked']])
missing_age_df = pd.get_dummies(missing_age_df,columns=['Title', 'Family_Size_Category', 'Fare_Category', 'Sex', 'Pclass' ,'Embarked'])
missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]
建立融合模型
def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
#模型1
gbm_reg = ensemble.GradientBoostingRegressor(random_state=42)
gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [3],'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
print('GB Train Error for "Age" Feature Regressor:'+ str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test['Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_GB'][:4])
#模型2
lrf_reg = LinearRegression()
lrf_reg_param_grid = {'fit_intercept': [True], 'normalize': [True]}
lrf_reg_grid = model_selection.GridSearchCV(lrf_reg, lrf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
lrf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best LR Params:' + str(lrf_reg_grid.best_params_))
print('Age feature Best LR Score:' + str(lrf_reg_grid.best_score_))
print('LR Train Error for "Age" Feature Regressor' + str(lrf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test['Age_LRF'] = lrf_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_LRF'][:4])
#将两个模型预测后的均值作为最终预测结果
print('shape1',missing_age_test['Age'].shape,missing_age_test[['Age_GB','Age_LRF']].mode(axis=1).shape)
#missing_age_test['Age'] = missing_age_test[['Age_GB','Age_LRF']].mode(axis=1)
missing_age_test['Age'] = np.mean([missing_age_test['Age_GB'],missing_age_test['Age_LRF']])
print(missing_age_test['Age'][:4])
drop_col_not_req(missing_age_test, ['Age_GB', 'Age_LRF'])
return missing_age_test
填充Age
combined_train_test.loc[(combined_train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train,missing_age_test)
(8)Ticket
将Ticket中的字母与数字分开,分为Ticket_Letter和Ticket_Number两项。
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split().str[0]
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket_Letter'].apply(lambda x:np.nan if x.isnumeric() else x)
combined_train_test['Ticket_Number'] = combined_train_test['Ticket'].apply(lambda x: pd.to_numeric(x,errors='coerce'))
combined_train_test['Ticket_Number'].fillna(0,inplace=True)
combined_train_test = pd.get_dummies(combined_train_test,columns=['Ticket','Ticket_Letter'])
(9)Cabin
Cabin项缺失太多,只能将有无Cain作为特征值进行建模
combined_train_test['Cabin_Letter'] = combined_train_test['Cabin'].apply(lambda x:str(x)[0] if pd.notnull(x) else x)
combined_train_test = pd.get_dummies(combined_train_test,columns=['Cabin','Cabin_Letter'])
完成之后再将train和test分开:
train_data = combined_train_test[:891]
test_data = combined_train_test[891:]
titanic_train_data_X = train_data.drop(['Survived'],axis=1)
titanic_train_data_Y = train_data['Survived']
titanic_test_data_X = test_data.drop(['Survived'],axis=1)
4. 模型融合
模型融合分两步进行:
(1)用几个模型筛选出较为重要的特征:
def get_top_n_features(titanic_train_data_X, titanic_train_data_Y, top_n_features):
# 随机森林
rf_est = RandomForestClassifier(random_state=42)
rf_param_grid = {'n_estimators': [500], 'min_samples_split': [2, 3], 'max_depth': [20]}
rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=25, cv=10, verbose=1)
rf_grid.fit(titanic_train_data_X,titanic_train_data_Y)
#将feature按Importance排序
feature_imp_sorted_rf = pd.DataFrame({'feature': list(titanic_train_data_X), 'importance': rf_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
print('Sample 25 Features from RF Classifier')
print(str(features_top_n_rf[:25]))
# AdaBoost
ada_est = ensemble.AdaBoostClassifier(random_state=42)
ada_param_grid = {'n_estimators': [500], 'learning_rate': [0.5, 0.6]}
ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=25, cv=10, verbose=1)
ada_grid.fit(titanic_train_data_X, titanic_train_data_Y)
#排序
feature_imp_sorted_ada = pd.DataFrame({'feature': list(titanic_train_data_X),'importance': ada_grid.best_estimator_.feature_importances_}).sort_values( 'importance', ascending=False)
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
# ExtraTree
et_est = ensemble.ExtraTreesClassifier(random_state=42)
et_param_grid = {'n_estimators': [500], 'min_samples_split': [3, 4], 'max_depth': [15]}
et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=25, cv=10, verbose=1)
et_grid.fit(titanic_train_data_X, titanic_train_data_Y)
#排序
feature_imp_sorted_et = pd.DataFrame({'feature': list(titanic_train_data_X), 'importance': et_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_et = feature_imp_sorted_et.head(top_n_features)['feature']
print('Sample 25 Features from ET Classifier:')
print(str(features_top_n_et[:25]))
# 将三个模型挑选出来的前features_top_n_et合并
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et], ignore_index=True).drop_duplicates()
return features_top_n
(2)根据筛选出的特征值挑选训练集和测试集
feature_to_pick = 250
feature_top_n = get_top_n_features(titanic_train_data_X,titanic_train_data_Y,feature_to_pick)
titanic_train_data_X = titanic_train_data_X[feature_top_n]
del titanic_train_data_X['Ticket_Number']#后来发现删除Ticket_Number后效果更好了
titanic_test_data_X = titanic_test_data_X[feature_top_n]
del titanic_test_data_X['Ticket_Number']
(3)利用votingClassifer建立最终预测模型
rf_est = ensemble.RandomForestClassifier(n_estimators = 750, criterion = 'gini', max_features = 'sqrt',
max_depth = 3, min_samples_split = 4, min_samples_leaf = 2,
n_jobs = 50, random_state = 42, verbose = 1)
gbm_est = ensemble.GradientBoostingClassifier(n_estimators=900, learning_rate=0.0008, loss='exponential',
min_samples_split=3, min_samples_leaf=2, max_features='sqrt',
max_depth=3, random_state=42, verbose=1)
et_est = ensemble.ExtraTreesClassifier(n_estimators=750, max_features='sqrt', max_depth=35, n_jobs=50,
criterion='entropy', random_state=42, verbose=1)
voting_est = ensemble.VotingClassifier(estimators = [('rf', rf_est),('gbm', gbm_est),('et', et_est)],
voting = 'soft', weights = [3,5,2],
n_jobs = 50)
voting_est.fit(titanic_train_data_X,titanic_train_data_Y)
ps:不想用VotingClassifier的也可以自己根据这几个模型的测试准确率给几个模型的结果自定义权重,将最终的加权平均值作为预测结果,本人亲测自定义权重的效果不必VotingClassifier差。
(4)预测及生成提交文件
titanic_test_data_X['Survived'] = voting_est.predict(titanic_test_data_X)
submission = pd.DataFrame({'PassengerId':test_data_org.loc[:,'PassengerId'],
'Survived':titanic_test_data_X.loc[:,'Survived']})
submission.to_csv('submission_result.csv',index=False,sep=',')
至此全部结束。
以上代码运行部分参考了Kernels的分享内容,最好运行结果为80.8%,排名8%,代码比较繁琐,共有400余行,对各种因素考虑得比较周全,各种函数写法也相当正规,适合给新手学习之用。
之前自己瞎写的代码比较难看,最终也只得到80.3%,排名12%的结果,这里就不作分享了。
有兴趣转行机器学习的朋友可以加群:
完整代码路径:https://github.com/Arctanxy/Titanic_Voting_Classifier/blob/master/VotingClassifier.py