泰坦尼克号背景介绍

泰坦尼克号是一艘奥林匹克级邮轮，于1912年4月首航时撞上冰山后沉没。泰坦尼克号由位于北爱尔兰贝尔法斯特的哈兰·沃尔夫船厂兴建，是当时最大的客运轮船，由于其规模相当一艘现代航空母舰，因而号称“上帝也沉没不了的巨型邮轮”。在泰坦尼克号的首航中，从英国南安普敦出发，途经法国瑟堡-奥克特维尔以及爱尔兰昆士敦，计划横渡大西洋前往美国纽约市。但因为人为错误，于1912年4月14日船上时间夜里11点40分撞上冰山；2小时40分钟后，即4月15日凌晨02点20分，船裂成两半后沉入大西洋，死亡人数超越1500人，堪称20世纪最大的海难事件，同时也是最广为人知的海难之一。

对泰坦尼克号数据进行分析，哪些因素，会导致乘客的生还率更高。

1.提出问题：什么因素会影响乘客的生还率？

影响乘客生还率的因素很多，这里只讨论乘客的性别、年龄、以及舱位是否对生还率产生影响。

1.性别对生还率的影响
2.年龄对生还率的影响
3.舱位对生还率的影响
4.年龄和性别共同对生还率的影响
5.年龄和舱位共同对生还率的影响
6.性别和舱位共同对生还率的影响
7.年龄、性别、舱位共同对生还率的影响
这里，年龄、性别、舱位是自变量，生还率是因变量。
（年龄、舱位是数值变量；性别是分类变量）

2.导入包

import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline #图片在Python中显示
%config inlinebackend.fiture_format='retina' #设置图片清晰度

3.导入数据

titanic_df=pd.read_csv('titanic.csv') #titanic数据包与编辑的notebook在同一文件夹
titanic_df.head() #查看数据的前5行

数据前5行

4.熟悉数据

数据字段说明

PassengerId: 乘客的id
Survival: 是否幸存 0 = No, 1 = Yes
Pclass: 舱位 class 1 = 1st, 2 = 2nd, 3 = 3rd
Name: 姓名
Sex: 性别
Age: 年龄
SibSp: 船上兄弟姐妹以及配偶的个数
Parch: 船上父母以及者子女的个数
Ticket: 船票号码
Fare: 票价
Cabin: 船舱号码
Embarked: 登船码头 C = Cherbourg, Q = Queenstown, S = Southampton
查看数据信息

titanic_df.describe()

数值变量信息

发现：从数据的摘要信息中可以看出，乘客的生还率大约为38%。乘客的年龄比较年轻，平均年龄在30岁左右。3等舱的乘客最多，占50%。

titanic_df.describe(include=[np.object])

分类变量信息

发现：乘客中男性较多，大约占60%

5.数据清洗

titanic_df.info() #查看数据缺失值

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

从上面的信息中可以看到，乘客的年龄、客舱、登船港口信息不全。

处理缺失值

#Embarked有两个缺失值，这里用众数'S'填充，因为这里缺失的值相比而言非常的少
#所以对分析结果产生不了多大的影响
titanic_df['Embarked']=titanic_df['Embarked'].fillna('S')

#Cabin 列的缺失值太多，所以不考虑Cabin列的值，但仍不建议删除掉

#年龄信息，是将要在下面的分析中用到的，所以需要对它的缺失值进行处理。
#这里用年龄的中位数填充，这会缩小年龄之间的差异性
titanic_df.Age.describe() # 在处理之前，查看Age列的统计值,与处理之后的数据进行对比

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64


# 重新载入原始数据
titanic_df=pd.read_csv("titanic.csv")

# 计算所有人年龄的均值
age_median1 = titanic_df.Age.median()

# 使用fillna填充缺失值，inplace=True表示在原数据titanic_df上直接进行修改
titanic_df.Age.fillna(age_median1, inplace=True)

# 查看Age列的统计值
titanic_df.Age.describe()

count    891.000000
mean      29.361582  #年龄平均值变小
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

* 考虑性别因素，分别用男女乘客各自年龄的中位数来填补

# 重新载入原始数据
titanic_df=pd.read_csv("titanic.csv")

# 分组计算男女年龄的中位数， 得到一个Series数据，索引为Sex
age_median2 = titanic_df.groupby('Sex').Age.median()

# 设置Sex为索引
titanic_df.set_index('Sex', inplace=True)
# 使用fillna填充缺失值，根据索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引，即取消Sex索引
titanic_df.reset_index(inplace=True)

# 查看Age列的统计值
titanic_df.Age.describe()

count    891.000000
mean      29.441268
std       13.018747
min        0.420000
25%       22.000000
50%       29.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

* 同时考虑性别和舱位因素

# 重新载入原始数据
titanic_df=pd.read_csv("titanic.csv")

# 分组计算不同舱位男女年龄的中位数， 得到一个Series数据，索引为Pclass,Sex
age_median3 = titanic_df.groupby(['Pclass', 'Sex']).Age.median()

# 设置Pclass, Sex为索引， inplace=True表示在原数据titanic_df上直接进行修改
titanic_df.set_index(['Pclass','Sex'], inplace=True)
# 使用fillna填充缺失值，根据索引值填充
titanic_df.Age.fillna(age_median3, inplace=True)
# 重置索引，即取消Pclass,Sex索引
titanic_df.reset_index(inplace=True)

# 查看Age列的统计值
titanic_df.Age.describe()

count    891.000000
mean      29.112424
std       13.304424
min        0.420000
25%       21.500000
50%       26.000000
75%       36.000000
max       80.000000
Name: Age, dtype: float64

6.数据探索

# 获取生还乘客的数据
survived_passenger_df=titanic_df[titanic_df.Survived==1]
survived_passenger_df.head(10)

# 这里定义几个常用的方法


# 打印均值
def print_describe(name,label):
    print '全体乘客的'+label+':'
    print '平均值:'+str(titanic_df[name].mean())
    print '最小值:'+str(titanic_df[name].min())
    print '最大值:'+str(titanic_df[name].max())
    print ''
    print '生还乘客的'+label+':'
    print '平均值:'+str(survived_passenger_df[name].mean())
    print '最小值:'+str(survived_passenger_df[name].min())
    print '最大值:'+str(survived_passenger_df[name].max())

    
# 按照name对乘客进行分组后，计算每组的人数 
def group_passenger_count(data,name):
    # 按照name对乘客进行分组后，每个组的人数
    return data.groupby(name)['PassengerId'].count()
 

# 计算每个组的生还率    
def group_passenger_survival_rate(name):
    # 按照name对全体乘客进行分组后，每个组的人数
    group_all_passenger_count=group_passenger_count(titanic_df,name)
    # 按照name对生还乘客进行分组后，每个组的人数
    group_survived_passenger_count=group_passenger_count(survived_passenger_df,name)
    # 每个组的生还率
    return group_survived_passenger_count/group_all_passenger_count
    
    
# 输出饼图
def print_pie(group_data,title):
    # 按照name对乘客进行分组后，每个组的人数
    group_data.plot.pie(title=title,figsize=(6, 6),autopct='%3.1f%%',startangle = 90,legend=True)

    
# 输出柱状图
def print_bar(data,title):
    bar=data.plot.bar(title=title)
    for p in bar.patches:
        bar.annotate('%3.1f%%' % (p.get_height()*100), (p.get_x() * 1.005, p.get_height() * 1.005))

性别对生还率的影响

titanic_df[['Survived','Sex']].groupby('Sex').mean
titanic_df.pivot_table(values='Survived',index='Sex',aggfunc=np.mean)

#全体乘客的性别比例图
by_Sex=titanic_df.groupby('Sex')['Sex'].count()
plt.pie(by_Sex,labels=['femal','male'],autopct='%1.2f%%')
plt.axis('equal')
plt.show()

#生还乘客性别比例图

titanic_df1=titanic_df[titanic_df.Survived==1]
titanic_survived=titanic_df1[['Survived','Sex']].groupby('Sex').count()
plt.pie(titanic_survived,labels=['male','femal'],autopct='%.2f%%')
plt.axis('equal')
plt.show()

Paste_Image.png

#不同性别的生还率
sns.barplot(data=titanic_df,x='Sex',y='Survived',ci=None)

全部乘客中，只有35.24%的女性，而生还的乘客中，女性占到了68.13% 。
女性的生还率达到了74.2%，而男性的生还率只有18.9% 。
我们可以看出，女性的生还率更高。

年龄对生还概率的影响

titanic_sex1=titanic_df['Age'][titanic_df['Survived']==1]
titanic_sex0=titanic_df['Age'][titanic_df['Survived']==0]
plt.hist([titanic_sex1,titanic_sex0],
          stacked=True #两幅图重合
          label=['Rescued','Not Saved']) #标签
plt.legend() #显示标签
plt.title('Age-Survived') #显示标题

Paste_Image.png

#定义函数

titanic_survived=titanic_df[titanic_df['Survived']==1]
def describe_value(data,label):
    print('全体乘客的:'+label) 
    print('最大值:' ,titanic_df[data].max()) 
    print( '最小值:',titanic_df[data].min())
    print('平均值:',titanic_df[data].mean()) 
  
    print ('生还乘客的:'+label)
    print ('最大值:' ,titanic_survived[data].max())
    print ('最小值:' ,titanic_survived[data].min())
    print ('平均值:' ,titanic_survived[data].mean())

describe_value('Age','年纪')

全体乘客的:年纪
最大值: 80.0
最小值: 0.42
平均值: 29.11242424242424

生还乘客的:年纪
最大值: 80.0
最小值: 0.42
平均值: 28.108684210526317

可以看出两者的你年龄均值非常接近
从直方图中看，两者的分布也非常接近
两者中间都有一个柱子凸起，是因为年龄的缺失值是用均值填充的

#对年龄进行均匀分组，按照10岁一组进行划分

bins=np.arange(0,90,10)
titanic_df['Age_band']=pd.cut(titanic_df.Age,bins)

#每个年龄段里面，男、女的人数
titanic_ageband=titanic_df.groupby(['Age_band','Survived'])['Age_band'].count()

#每个年龄段的生还率
titanic_ageband_survived=titanic_df.groupby('Age_band')['Survived'].mean()

titanic_ageband_survived.plot.bar(title='Survived rate by age')
plt.ylabel('Survival rate')
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='g',linestyle='--')

Paste_Image.png

发现：这里可以看出有几个年龄段对是生还率有明显的影响，如 0-10岁和 30-40岁

#可视化每个年龄段里面的男、女人数
titanic_ageband.unstack().plot(kind='bar',stacked=True)
plt.title('Survived count by age')
plt.ylabel('Survived count')

Paste_Image.png

sns.barplot(data=titanic_df,x='Age_band',y='Survived',ci=None)
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='g',linestyle='--')

Paste_Image.png

可以看出，在0-10和30-40岁年龄段的生还率高于平均水平，而20-30和60-70岁年龄段的生还率率低于平均水平

得出结论：0-10岁和30-40岁的生还率高于平均值，20-30岁和60-70岁的生还率低于平均值

0-10岁的生还率最高，用均值填充缺失的年龄值可能造成，年龄差异的缩小

舱位与生还概率的关系

#全体乘客的舱位比例图
by_pclass=titanic_df.groupby('Pclass')['Pclass'].count()
plt.pie(by_pclass,labels=(1,2,3),autopct='%1.1f%%')
plt.axis('equal')
plt.show()

Paste_Image.png

可以看出三等级的人数占了总体人数的一半多

#生还乘客舱位比例图
titanic_df3=titanic_df[titanic_df.Survived==1]
titanic_survived=titanic_df3[['Survived','Pclass']].groupby('Pclass').count()
plt.pie(titanic-survived,labels=(1,2,3),autopct='%1.1f%%')
plt.axis('equal')
plt.show()

Paste_Image.png

全体乘客中三等级人数占了一半多，但是生还乘客中，三等级的比例还有没有1等级的多

得出结论：等级对生还率有影响

#不同舱位的生还率
titanic_df['Pclass','Survived'].groupby('Pclass').mean()
titanic_df.pivot_table(values='Survived',index='Pclass',aggfunc=np.mean)

sns.barplot(data=titanic_df,x='Pclass',y='Survived',ci=None)

Paste_Image.png

可以看出1等级的生还率明显大于2、3等级

结论："1"等级的生还率>“2”等级>"3"等级 ; "1"等级的生还率最高

年龄和性别与生还率的关系

#乘客中年龄和性别的人数统计
titanic_df4=titanic_df.pivot_table(values='Survived',index='Age_band',columns='Sex',aggfunc='count')
titanic_df4.plot(kind='bar')

Paste_Image.png

#生还乘客中年龄和性别的人数统计
titanic_df5=titanic_df[titanic_df.Survived==1]
titanic_survived=titanic_df5.pivot_table(values='Survived',index='Age_band',columns='Sex',aggfunc='count')
titanic_survived.plot(kind='bar')

Paste_Image.png

#年龄和性别的生还率
sns.barplot(data=titanic_df,x='Age_band',y='Survived',hue='Sex',ci=None)
plt.axhline(y=0.4,color='r',linestyle='--')

Paste_Image.png

从图中看出：年龄段在20-40岁之间人数最多，但这个年龄段的生还率却不是最高的，反而年龄较小（0~10岁）和年龄较大（50-70岁）之间的生还率是最高的。
男性的人数明显多于女性，但女性的生还率明显高于男性，且女性的生还率都在40%以上。
综上可以看出，性别对生还率的影响大于年龄的影响。

年龄和舱位与生还率的关系

#乘客中年龄和舱位的人数统计
titanic_df6=titanic_df.pivot_table(values='Survived',index='Age_band',columns='Pclass',aggfunc='count')
titanic_df6.plot(kind='bar')

Paste_Image.png

#生还乘客中年龄和舱位的人数统计
titanic_df7=titanic_df[titanic_df.Survived==1]
titanic_survived=titanic_df7.pivot_table(values='Survived',index='Age_band',columns='Pclass',aggfunc='count')
titanic_survived.plot(kind='bar')

Paste_Image.png

#乘客中年龄和舱位的生还概率
sns.barplot(data=titanic_df,x='Age_band',y='Survived',hue='Pclass',ci=None)
plt.axhline(y=0.4,color='g',linestyle='--')

Paste_Image.png

从图中看出：3舱的人数最多，但3舱的生还率最小。而在0-50岁的年龄区间，1、2舱舱的生还率都大于40%。
同一个年龄段，除了0-10岁和60-70岁，区间外，1舱的生还率最高。
不同年龄段，也是1舱的生还率>2舱>3舱

性别和舱位与生还率的关系

#乘客中性别和舱位的人数统计
titanic_df8=titanic_df.pivot_table(values='Survived',index='Pclass',columns='Sex',aggfunc='count')
titanic_df8.plot(kind='bar')

Paste_Image.png

#生还乘客中性别和舱位的人数统计
titanic_df9=titanic_df[titanic_df.Survived==1]
titanic_survived=titanic_df9.pivot_table(values='Survived',index='Pclass',columns='Sex',aggfunc='count')
titanic_survived.plot(kind='bar')

Paste_Image.png

#性别、舱位与生还率的关系
sns.barplot(data=titanic_df,x='Pclass',y='Survived',hue='Sex',ci=None)

Paste_Image.png

从图中可以看出，1号舱与2号舱的人数差不多，且都小于3号舱人数，且3个船舱中男性人数均多于女性人数。
从生还人数中来看，女性生还人数高于男性生还人数，且1号舱的生还人数高于2、3号舱生还人数。
从生还率来看，1、2号舱女性生还率最高，达到90%，3号舱女性生还率大约为50%。男性的生还率普遍低于40%，但男性1号舱的生还率高于男性2号、3号生还率。
所以，性别和舱位均对生还率产生影响。

年龄、性别、舱位与生还率关系

#乘客中年龄、性别、舱位人数统计
titanic_df10=titanic_df.pivot_table(values='Survived',index=['Age_band','Pclass'],columns='Sex',aggfunc='count')
titanic_df10.plot(kind='bar')

Paste_Image.png

#生还乘客中年龄、性别、舱位的人数统计
titanic_df11=titanic_df[titanic_df.Survived==1]
titanic_survived=titanic_df11.pivot_table(values='Survived',index=['Age_band','Pclass'],columns='Sex',aggfunc='count')
titanic_survived.plot(kind='bar')

Paste_Image.png

#年龄、性别、舱位与生还率关系
sns.FacetGrid(titanic_df,'AgeBand',aspect=1.5).map(sns.pointplot,'Pclass','Survived','Sex',hue_order=['male','female'],palette='deep',ci=None)