本文以在Kaggel上参与的Titanic生存预测项目为例,简单聊聊自己对于机器学习的理解。机器学习涉及的内容比较复杂,主要包括有监督的学习和无监督的学习,有监督的学习一般包括分类问题和回归问题,无监督学习则包括聚类问题和数据降维等。本文讨论的是有监督学习中的分类问题。
本文主要从数据的预处理、模型的选择及优化两个方面来展开。
(一)数据预处理(特征工程)
本文的数据预处理是指在使用算法模型之前,对数据进行一些整理工作。一般在较大的项目中这个步骤叫特征工程,而数据的预处理是特征工程中的一部分内容。在本文的数据预处理步骤中,主要包括对于一些缺失值、异常值的处理,数据的规范化、离散化等。主要过程在下列代码中说明。
1.读取相应的文件
train = pd.read_csv('desktop/titanic/train.csv')
test = pd.read_csv('desktop/titanic/test.csv')
data_full = [train,test]
2.缺失值的处理
for dataset in data_full:
dataset['Embarked'] = dataset['Embarked'].fillna('S') # 用众数替代空值
dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].median())
dataset.drop(['Cabin'],axis=1,inplace=True)
3.数据的离散化 按年龄段划分
for dataset in data_full:
dataset.loc[dataset['Age']<=16,'Age']=0
dataset.loc[(dataset['Age']>16) & (dataset['Age']<=32),'Age'] = 1
dataset.loc[(dataset['Age']>32) & (dataset['Age']<=48),'Age'] = 2
dataset.loc[(dataset['Age']>48) & (dataset['Age']<=64),'Age'] = 3
dataset.loc[(dataset['Age']>64),'Age'] = 4
4.数据的转化 将字符型数据转化为数值型数据
for dataset in data_full:
dataset.Name = dataset.Name.str.extract('([A-Za-z]+)\.')
for dataset in data_full:
dataset['Name'].replace(['Lady', 'Countess', 'Capt','Mlle','Ms','Mme', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare',inplace=True)
for dataset in data_full:
dataset['Sex'] = dataset['Sex'].map({"male":0,"female":1})
dataset['Embarked'] = dataset['Embarked'].map({"S":0,"C":1,"Q":2})
dataset['Fare'] = dataset['Fare'].astype(int)
for dataset in data_full:
dataset['Name']=dataset['Name'].map({"Mr":1,"Miss":2,"Miss":3,"Master":4,"Rare":5})
5.填补空值
for dataset in data_full:
dataset['Name'] = dataset['Name'].fillna(5)
6.转化数据类型
for dataset in data_full:
dataset['Name'] = dataset['Name'].astype(int)
7.删去无用数据
for dataset in data_full:
dataset.drop(['Ticket'],axis=1,inplace = True)
train['Age'] = train['Age'].fillna(value=train['Age'].median())
test['Age'] = test['Age'].fillna(value=train['Age'].median())
8.合并特征值
for dataset in data_full:
dataset['SibSp']=dataset['SibSp']+dataset['Parch']
for dataset in data_full:
dataset = dataset.drop('Parch',axis=1)
(二)模型的选择和优化
1.工具包的导入
from sklearn.linear_model import LogisticRegression #回归模型
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC # 支持向量机
from sklearn.naive_bayes import MultinomialNB # 朴素贝叶斯
from sklearn.ensemble import RandomForestClassifier # 随机森林
from sklearn.tree import DecisionTreeClassifier # 决策树
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier #梯度提升决策树
2.融合模型的函数实现
初始训练集
X_train = train[selected_features].values
y_train = train['Survived'].ravel()
初始测试集
X_test = test[selected_features].values
stacking模型融合代码
from sklearn.model_selection import KFold
ntrain = train.shape[0]
ntest = test.shape[0]
kf = KFold(n_splits=5)
def get_oof(clf,X_train,y_train,X_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((5,ntest))
for i,(train_index,test_index) in enumerate(kf.split(X_train)):
kf_X_train = X_train[train_index]
kf_y_train = y_train[train_index]
kf_X_test = X_train[test_index]
clf.fit(kf_X_train,kf_y_train)
oof_train[test_index] = clf.predict(kf_X_test)
oof_test_skf[i,:] = clf.predict(X_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1,1), oof_test.reshape(-1,1)
#算法模型的实例化
lsvc = LinearSVC() # 1支持向量机
lgre = LogisticRegression(max_iter=10000) #线性回归
xgbc = XGBClassifier() #XGBoost
dtr = ExtraTreeClassifier() # 2决策树
ran = RandomForestClassifier() # 3随机森林
ada = AdaBoostClassifier() #4adaboost
grad = GradientBoostingClassifier() #5梯度提升
#融合函数的调用
lsvc_oof_train, lsvc_oof_test = get_oof(lsvc, X_train, y_train, X_test)
dtr_oof_train,dtr_oof_test = get_oof(dtr,X_train,y_train,X_test)
ran_oof_train,ran_oof_test = get_oof(ran,X_train,y_train,X_test)
ada_oof_train,ada_oof_test = get_oof(ada,X_train,y_train,X_test)
grad_oof_train,grad_oof_test = get_oof(grad,X_train,y_train,X_test)
新的测试集
x_train1 = np.concatenate(( lsvc_oof_train,dtr_oof_train, ran_oof_train, ada_oof_train,grad_oof_train), axis=1)
x_test1 = np.concatenate((lsvc_oof_test,dtr_oof_test,ran_oof_test,ada_oof_test,grad_oof_test ), axis=1)
#XGBoost模型的参数调整
gbm = XGBClassifier(
#learning_rate = 0.02,
n_estimators= 2000,
max_depth= 4,
min_child_weight= 2,
gamma=0.9,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread= -1,
scale_pos_weight=1)
#模型拟合
gbm.fit(x_train1,y_train)
pre = gbm.predict(x_test1)
# 数据的存储
pd.DataFrame({ 'PassengerId': test.PassengerId, 'Survived': pre }).set_index('PassengerId').to_csv('desktop/titanic/202038a.csv')