Titanic泰坦尼克——kaggle赛题

本文中最重要的是学习到了如何进行特征工程的处理,其他内容还有

  • 中位数填充缺失值
  • 将数据中的字符串改成数值型
  • 建模过程

导入相关库

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() # setting seaborn default for plots
train = pd.read_csv("/Users/peter/data-visualization/train.csv")
test = pd.read_csv("/Users/peter/data-visualization/test.csv")

查看数据信息及缺失值

train.head(3)
image
train.info()  # age 字段非常缺失(714)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
test.head()
image
print(train.shape)
print(test.shape)  # 少了预测的结果列
(891, 12)
(418, 11)
test.info()   # age 字段缺失
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
train.isnull().sum()  # 查看缺失总数
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Bar Chart for Categorical Features

  • Pclass
  • Sex
  • SibSp ( # of siblings and spouse)
  • Parch ( # of parents and children)
  • Embarked
  • Cabin
def  bar_chart(feature):
    # 定义两个列字段:survived, dead
    survived = train[train['Survived'] == 1][feature].value_counts()
    dead = train[train['Survived'] == 0][feature].value_counts()
    df = pd.DataFrame([survived, dead])
    df.index = ['Survived', 'Dead']
    df.plot(kind='bar', stacked=True, figsize=(10,5))
bar_chart("Sex")
image
bar_chart('Pclass')
image
bar_chart('SibSp')
image
bar_chart('Parch')
image
# 先找出存活的所有数据,再找出属性P(1,2,3)中存活的人,然后统计属性P的分类人数
train[train['Survived']==1]['Pclass'].value_counts()
1    136
3    119
2     87
Name: Pclass, dtype: int64
train[train['Survived']==1]['Pclass']
1      1
2      3
3      1
8      3
9      2
      ..
875    3
879    1
880    2
887    1
889    1
Name: Pclass, Length: 342, dtype: int64

特征工程

Feature engineering is the process of using domain knowledge of the data
to create features (feature vectors) that make machine learning algorithms work.

特征工程的处理:如何将原始数据中的字符串数据转换成数值类型

Name

train_test_data = [train, test]  # 将测试集和训练集合并
for dataset in train_test_data:
    dataset["Title"] = dataset["Name"].str.extract('([A-Za-z]+)\.', expand=False)  # str.extract  从正则表达式中返回第一个匹配中字符
train["Title"].value_counts()  # 统计个数 train["Title"].value_counts()
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Major         2
Col           2
Mlle          2
Lady          1
Capt          1
Sir           1
Countess      1
Mme           1
Don           1
Jonkheer      1
Ms            1
Name: Title, dtype: int64
test["Title"].value_counts()  # 统计个数
Mr        240
Miss       78
Mrs        72
Master     21
Rev         2
Col         2
Dr          1
Ms          1
Dona        1
Name: Title, dtype: int64

Title map

  • Mr : 0
  • Miss : 1
  • Mrs: 2
  • Others: 3
title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, "Master": 3, "Dr": 3, 
                 "Rev": 3, "Col": 3, "Major": 3, "Mlle": 3,"Countess": 3,
                 "Ms": 3, "Lady": 3, "Jonkheer": 3, "Don": 3, "Dona" : 3,
                 "Mme": 3,"Capt": 3,"Sir": 3 }
dataset['Title']  # 取出每个名字的title_map部分
0          Mr
1         Mrs
2          Mr
3          Mr
4         Mrs
        ...  
413        Mr
414      Dona
415        Mr
416        Mr
417    Master
Name: Title, Length: 418, dtype: object
for dataset in train_test_data:
    # 将title_map和定义的title_mapping 进行对比
    dataset['Title'] = dataset['Title'].map(title_mapping)  
dataset.head(3)  # 最后面新增了一列:title
image

image
bar_chart("Title")
image
# 删除某个非必须属性
train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)
train.head()
image

Sex

  • male:0
  • female:1
sex_mapping = {"male": 0, "female": 1}
for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping)  # 从dataset中取出Sex属性的值再和 map函数中定义的字典进行对比,找出符合要求的,再赋值给Sex 属性
bar_chart('Sex')
image

Age

Age字段中有很多缺失值,用中位数进行填充

fillna函数后中位数进行填充
# 某个字段用中位数进行填充 fillna 函数
# transform之前要指定操作的列(Age),它只能对某个列进行操作,用于求最值、方差、中位数等
train['Age'].fillna(train.groupby('Title')['Age'].transform("median"), inplace=True)  

test['Age'].fillna(train.groupby('Title')['Age'].transform("median"), inplace=True)  
train.groupby("Title")["Age"].transform("median")
0      30.0
1      35.0
2      21.0
3      35.0
4      30.0
       ... 
886     9.0
887    21.0
888    21.0
889    30.0
890    30.0
Name: Age, Length: 891, dtype: float64
facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
 
plt.show() 
image
facet = sns.FacetGrid(train, hue="Survived", aspect=4)
facet.map(sns.kdeplot, "Age", shade=True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.xlim(0, 20)   # 分段画出来 plt.xlim(20, 30),(30, 40),(40, 60)
(0, 20)
image
train.info()   # Age 字段已经填充
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
Title          891 non-null int64
dtypes: float64(2), int64(7), object(3)
memory usage: 83.7+ KB
如何将一个属性变成一个分类变量形式
# 将 Age 年龄变成 Categorical Variable 分类变量
# 根据不同的年龄段划分成不同的数值
for dataset in train_test_data:
    dataset.loc[dataset['Age'] <= 16, 'Age'] = 0,
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 26), 'Age'] = 1,
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <= 36), 'Age'] = 2,
    dataset.loc[(dataset['Age'] > 36) & (dataset['Age'] <= 62), 'Age'] = 3,
    dataset.loc[ dataset['Age'] > 62, 'Age'] = 4
train.head()
image
bar_chart("Age")
image

Embarked

根据属性的多种不同取值来绘制图形
train[train['Pclass']==1]['Embarked']  # 找出P属性中值为1的每个 Embarked 属性值
1      C
3      S
6      S
11     S
23     S
      ..
871    S
872    S
879    C
887    S
889    C
Name: Embarked, Length: 216, dtype: object
# 找出P属性中值为1的每个 Embarked 属性值,再进行分类统计
Pclass1 = train[train['Pclass']==1]['Embarked'].value_counts()
Pclass2 = train[train['Pclass']==2]['Embarked'].value_counts()
Pclass3 = train[train['Pclass']==3]['Embarked'].value_counts()

df = pd.DataFrame([Pclass1, Pclass2, Pclass3])  # 将新生成的3个数据组成DF数据框
df.index = ['1st class','2nd class', '3rd class']  # 添加上索引
df.plot(kind='bar',stacked=True, figsize=(10,5))   # 通过叠加的方式绘制柱状堆叠图
image
df
image
# fill out missing embark with S embark
for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')  # 用S来填充缺失值
    
如何将属性中的字符串转成数值型?
embarked_mapping = {"S": 0, "C": 1, "Q": 2}
for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].map(embarked_mapping)  # map函数进行匹配

Fare

缺失值填充中位数
# fill missing Fare with median fare for each Pclass

train["Fare"].fillna(train.groupby("Pclass")["Fare"].transform("median"), inplace=True)   # 对Fare属性进行缺失值填充:通过 Pclass 属性分组,指定Fare的中位数填充缺失值
test["Fare"].fillna(test.groupby("Pclass")["Fare"].transform("median"), inplace=True)
绘制图形
facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Fare',shade= True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
 
plt.show()  
image
将Fare属性分段
for dataset in train_test_data:
    dataset.loc[ dataset['Fare'] <= 17, 'Fare'] = 0,
    dataset.loc[(dataset['Fare'] > 17) & (dataset['Fare'] <= 30), 'Fare'] = 1,
    dataset.loc[(dataset['Fare'] > 30) & (dataset['Fare'] <= 100), 'Fare'] = 2,
    dataset.loc[ dataset['Fare'] > 100, 'Fare'] = 3

Cabin

train.Cabin.value_counts()
B96 B98        4
C23 C25 C27    4
G6             4
E101           3
F33            3
              ..
C62 C64        1
D28            1
D46            1
B41            1
E17            1
Name: Cabin, Length: 147, dtype: int64
for dataset in train_test_data:
    # 取出Cabin中的第一个字母
    dataset['Cabin'] = dataset['Cabin'].str[:1]   # Cabin 已经变成了单个字母
train[train['Pclass']==1]['Cabin'].value_counts()
C    59
B    47
D    29
E    25
A    15
T     1
Name: Cabin, dtype: int64
# 统计每个Pclass中的每个字母各出现多少次
Pclass1 = train[train['Pclass']==1]['Cabin'].value_counts()
Pclass2 = train[train['Pclass']==2]['Cabin'].value_counts()
Pclass3 = train[train['Pclass']==3]['Cabin'].value_counts()

# 生成数据框和行索引,绘图
df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class','2nd class', '3rd class']
df.plot(kind='bar',stacked=True, figsize=(10,5))
image
cabin_mapping = {"A": 0, "B": 0.4, 
                 "C": 0.8, "D": 1.2, 
                 "E": 1.6, "F": 2, 
                 "G": 2.4, "T": 2.8}  # 每个字母匹配不同的数字
for dataset in train_test_data:
    dataset['Cabin'] = dataset['Cabin'].map(cabin_mapping)  # 将Cabin中的字母变成数字
# fill missing Fare with median fare for each Pclass
train["Cabin"].fillna(train.groupby("Pclass")["Cabin"].transform("median"), inplace=True)
test["Cabin"].fillna(test.groupby("Pclass")["Cabin"].transform("median"), inplace=True)

FamilySize

添加属性familysize
train["FamilySize"] = train["SibSp"] + train["Parch"] + 1
test["FamilySize"] = test["SibSp"] + test["Parch"] + 1
facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'FamilySize',shade= True)
facet.set(xlim=(0, train['FamilySize'].max()))
facet.add_legend()
plt.xlim(0)
(0, 11.0)
image
train.head()  # 最后添加了familysize属性
image
family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
for dataset in train_test_data:
    dataset['FamilySize'] = dataset['FamilySize'].map(family_mapping)   
删除某些属性
# 使用drop删除不需要的属性
features_drop = ['Ticket', 'SibSp', 'Parch']
train = train.drop(features_drop, axis=1)
test = test.drop(features_drop, axis=1)
train = train.drop(['PassengerId'], axis=1)
train_data = train.drop('Survived', axis=1)  # train_data是删除3个属性后的数据
target = train['Survived']  

train_data.shape, target.shape
((891, 8), (891,))
train_data.head(10)
image

建模

导入各种模型

# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier  # K近邻
from sklearn.tree import DecisionTreeClassifier   # 决策树
from sklearn.ensemble import RandomForestClassifier   # 随机森林
from sklearn.naive_bayes import GaussianNB   # 贝叶斯分类器
from sklearn.svm import SVC  # 支持向量机
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Sex           891 non-null int64
Age           891 non-null float64
Fare          891 non-null float64
Cabin         891 non-null float64
Embarked      891 non-null int64
Title         891 non-null int64
FamilySize    891 non-null float64
dtypes: float64(4), int64(5)
memory usage: 62.8 KB

交叉验证

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
# KNN 
clf = KNeighborsClassifier(n_neighbors = 13)
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
[0.82222222 0.76404494 0.80898876 0.83146067 0.87640449 0.82022472
 0.85393258 0.79775281 0.84269663 0.84269663]
round(np.mean(score)*100, 2)  # KNN score

82.6
## 决策树
clf = DecisionTreeClassifier()
scoring = 'accuracy'
score = cross_val_score(clf, train_data, 
                        target, cv=k_fold, 
                        n_jobs=1, scoring=scoring)
print(score)
[0.76666667 0.82022472 0.76404494 0.7752809  0.88764045 0.76404494
 0.83146067 0.82022472 0.74157303 0.79775281]
round(np.mean(score)*100, 2)

79.69
### 随机森林
clf = RandomForestClassifier(n_estimators=13)  # 13 个评估器
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
[0.8        0.82022472 0.79775281 0.76404494 0.86516854 0.82022472
 0.80898876 0.80898876 0.75280899 0.80898876]
round(np.mean(score)*100, 2)  # Random Forest Score

80.47
#### 贝叶斯
clf = GaussianNB()
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
[0.85555556 0.73033708 0.75280899 0.75280899 0.70786517 0.80898876
 0.76404494 0.80898876 0.86516854 0.83146067]
round(np.mean(score)*100, 2)

78.78
##### SVM
clf = SVC()
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
[0.83333333 0.80898876 0.83146067 0.82022472 0.84269663 0.82022472
 0.84269663 0.85393258 0.83146067 0.86516854]
round(np.mean(score)*100,2)

83.5

testing

从上面的结果中观察到使用支持向量机的效果是最好的。

clf = SVC()
clf.fit(train_data, target)

test_data = test.drop("PassengerId", axis=1).copy()
prediction = clf.predict(test_data)
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": prediction
    })

submission.to_csv('submission.csv', index=False)  # 将最终的结果文件写入csv
submission = pd.read_csv('submission.csv')
submission.head()  # 读取文件的前5行数据
image
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,417评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,921评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,850评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,945评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,069评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,188评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,239评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,994评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,409评论 1 304
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,735评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,898评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,578评论 4 336
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,205评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,916评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,156评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,722评论 2 363
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,781评论 2 351