1.distribution ——如何让分布更直观
1.1 单变量分布
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
np.random.seed(sum(map(ord, "distributions")))
灰度图
x = np.random.normal(size=100)
sns.distplot(x, kde=False)
sns.distplot(x, kde=False, bins=20)
sns.distplot(x, kde=False, bins=20, rug=True)
核密度估计
通过观测估计概率密度函数的形状。 有什么用呢?待定系数法求概率密度函数~
核密度估计的步骤:
- 每一个观测附近用一个正态分布曲线近似
- 叠加所有观测的正太分布曲线
- 归一化
在seaborn中怎么画呢?
sns.kdeplot(x)
bandwidth的概念:用于近似的正态分布曲线的宽度。
sns.kdeplot(x)
sns.kdeplot(x, bw=.2, label="bw: 0.2")
sns.kdeplot(x, bw=2, label="bw: 2")
plt.legend()
模型参数拟合
x = np.random.gamma(6, size=200)
sns.distplot(x, kde=False, fit=stats.gamma)
1.2 双变量分布
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
散点图
sns.jointplot(x="x", y="y", data=df)
六角箱图
x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("ticks"):
sns.jointplot(x=x, y=y, kind="hex")
核密度估计
sns.jointplot(x="x", y="y", data=df, kind="kde")
f, ax = plt.subplots(figsize=(6, 6))
sns.kdeplot(df.x, df.y, ax=ax)
sns.rugplot(df.x, color="g", ax=ax)
sns.rugplot(df.y, vertical=True, ax=ax)
f, ax = plt.subplots(figsize=(6, 6))
cmap = sns.cubehelix_palette(as_cmap=True, dark=1, light=0)
sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=60, shade=True)
g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")
1.3 数据集中的两两关系
iris = sns.load_dataset("iris")
iris.head()
sns.pairplot(iris);
属性两两间的关系 + 属性的灰度图
g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, cmap="Blues_d", n_levels=20)
2.regression ——探索变量间的关系
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
np.random.seed(sum(map(ord, "regression")))
tips = sns.load_dataset("tips")
tips.head()
tips[tips['size']==1]
2.1 绘制线性回归模型
最简单的方式:散点图 + 线性回归 + 95%置信区间
sns.lmplot(x="total_bill", y="tip", data=tips)
对于变量离线取值,散点图就显得有些尴尬了。。。
sns.lmplot(x="size", y="tip", data=tips)
方法1:加个小的抖动
sns.lmplot(x="size", y="tip", data=tips, x_jitter=.05)
方法2:离散取值上用均值和置信区间代替散点
sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean)
2.2 拟合不同模型
有些时候线性拟合效果不错,有些时候差强人意~
anscombe = sns.load_dataset("anscombe")
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"), ci=None, scatter_kws={"s": 80})
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), ci=None, scatter_kws={"s": 80})
试试高阶拟合~
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), order=2, ci=None, scatter_kws={"s": 80})
异常值肿么办?
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"), robust=True, ci=None, scatter_kws={"s": 80})
二值变量如何拟合?
tips["big_tip"] = (tips.tip / tips.total_bill) > .15
sns.lmplot(x="total_bill", y="big_tip", data=tips, y_jitter=.05)
sns.lmplot(x="total_bill", y="big_tip", data=tips, logistic=True, y_jitter=.03, ci=None)
如何评价拟合效果?残差曲线~
sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'I'"), scatter_kws={"s": 80})
拟合的好,就是白噪声的分布 N(0,σ2)拟合的差,就能看出一些模式
2.3 变量间的条件关系摸索
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips)
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, markers=["o", "x"])
尝试增加更多的分类条件
sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips)
sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", row="sex", data=tips)
控制图片的大小和形状
sns.lmplot(x="total_bill", y="tip", col="day", data=tips, col_wrap=2, size=5)
sns.lmplot(x="total_bill", y="tip", col="day", data=tips, aspect=0.5)
3.分类数据的可视化分析
- 观测点的直接展示:swarmplot, stripplot
- 观测近似分布的展示:boxplot, violinplot
- 均值和置信区间的展示:barplot, pointplot
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
np.random.seed(sum(map(ord, "categorical")))
titanic = sns.load_dataset("titanic")
tips = sns.load_dataset("tips")
iris = sns.load_dataset("iris")
3.1 分类散点图
当有一维数据是分类数据时,散点图成为了条带形状。
sns.stripplot(x="day", y="total_bill", data=tips)
散点都在一起看不清楚?还记得抖动的方法咩~
sns.stripplot(x="day", y="total_bill", data=tips, jitter=True)
另外一种处理办法,是生成蜂群图,避免散点重叠~
sns.swarmplot(x="day", y="total_bill", data=tips)
在每一个一级分类内部可能存在二级分类
sns.swarmplot(x="day", y="total_bill", hue="sex", data=tips)
3.2 分类分布图
箱图
上边缘、上四分位数、中位数、下四分位数、下边缘
sns.boxplot(x="day", y="total_bill", hue="time", data=tips)
提琴图
箱图 + KDE(Kernel Distribution Estimation)
sns.violinplot(x="total_bill", y="day", hue="time", data=tips)
sns.violinplot(x="day", y="total_bill", hue="time", data=tips)
sns.violinplot(x="total_bill", y="day", hue="time", data=tips, bw=.1, scale="count", scale_hue=False)
sns.violinplot(x="total_bill", y="day", hue="time", data=tips, bw=.1, scale="count", scale_hue=False)
非对称提琴图
sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True, inner="stick")
3.3 分类统计估计图
统计柱状图
sns.barplot(x="sex", y="survived", hue="class", data=titanic)
灰度柱状图
sns.countplot(x="deck", data=titanic, palette="Greens_d")
点图
sns.pointplot(x="sex", y="survived", hue="class", data=titanic)
修改颜色、标记、线型
sns.pointplot(x="class", y="survived", hue="sex", data=titanic,
palette={"male": "g", "female": "m"},
markers=["^", "o"], linestyles=["-", "--"])
3.4 分类子图
sns.factorplot(x="day", y="total_bill", hue="smoker", col="time", data=tips, kind="swarm")
多分类标准的子图
g = sns.PairGrid(tips,
x_vars=["smoker", "time", "sex"],
y_vars=["total_bill", "tip"],
aspect=.75, size=3.5)
g.map(sns.violinplot, palette="pastel");
4. Titanic Project
#Now let's open it with pandas
import pandas as pd
from pandas import Series,DataFrame
# Set up the Titanic csv file as a DataFrame
titanic_df = pd.read_csv('train.csv')
# Let's see a preview of the data
titanic_df.head()
# We could also get overall info for the dataset
titanic_df.info()
All good data analysis projects begin with trying to answer questions. Now that we know what column category data we have let's think of some questions or insights we would like to obtain from the data. So here's a list of questions we'll try to answer using our new data analysis skills!
First some basic questions:
1.) Who were the passengers on the Titanic? (Ages,Gender,Class,..etc)
2.) What deck were the passengers on and how does that relate to their class?
3.) Where did the passengers come from?
4.) Who was alone and who was with family?
Then we'll dig deeper, with a broader question:
5.) What factors helped someone survive the sinking?
So let's start with the first question: Who were the passengers on the titanic?
# Let's import what we'll need for the analysis and visualization
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Let's first check gender
sns.factorplot('Sex',data=titanic_df,kind="count")
# Now let's seperate the genders by classes, remember we can use the 'hue' arguement here!
sns.factorplot('Pclass',data=titanic_df,hue='Sex',kind="count")
# Now let's seperate the genders by classes, remember we can use the 'hue' arguement here!
sns.factorplot('Sex',data=titanic_df,hue='Pclass',kind="count")
# We'll treat anyone as under 16 as a child, and then use the apply technique with a function to create a new column
# Revisit Lecture 45 for a refresher on how to do this.
# First let's make a function to sort through the sex
def male_female_child(passenger):
# Take the Age and Sex
age,sex = passenger
# Compare the age, otherwise leave the sex
if age < 16:
return 'child'
else:
return sex
# We'll define a new column called 'person', remember to specify axis=1 for columns and not index
titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child,axis=1)
# Let's see if this worked, check out the first ten rows
titanic_df[0:10]
# Let's try the factorplot again!
sns.factorplot('Pclass',data=titanic_df,hue='person',kind="count")
# Quick way to create a histogram using pandas
titanic_df['Age'].hist(bins=70)
# We could also get a quick overall comparison of male,female,child
titanic_df['person'].value_counts()
# Another way to visualize the data is to use FacetGrid to plot multiple kedplots on one plot
# Set the figure equal to a facetgrid with the pandas dataframe as its data source, set the hue, and change the aspect ratio.
fig = sns.FacetGrid(titanic_df, hue="Sex",aspect=5)
# Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice
fig.map(sns.kdeplot,'Age',shade= True)
# Set the x max limit by the oldest passenger
oldest = titanic_df['Age'].max()
#Since we know no one can be negative years old set the x lower limit at 0
fig.set(xlim=(0,oldest))
#Finally add a legend
fig.add_legend()
#We could have done the same thing for the 'person' column to include children:
fig = sns.FacetGrid(titanic_df, hue="person",aspect=4)
fig.map(sns.kdeplot,'Age',shade= True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()
# Let's do the same for class by changing the hue argument:
fig = sns.FacetGrid(titanic_df, hue="Pclass",aspect=4)
fig.map(sns.kdeplot,'Age',shade= True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()
#We've gotten a pretty good picture of who the passengers were based on Sex, Age, and Class. So let's move on to our 2nd question: What deck were the passengers on and how does that relate to their class?
# Let's get a quick look at our dataset again
titanic_df.head()
# First we'll drop the NaN values and create a new object, deck
deck = titanic_df['Cabin'].dropna()
# So let's grab that letter for the deck level with a simple for loop
# Set empty list
levels = []
# Loop to grab first letter
for level in deck:
levels.append(level[0])
# Reset DataFrame and use factor plot
cabin_df = DataFrame(levels)
cabin_df.columns = ['Cabin']
sns.factorplot('Cabin',data=cabin_df,palette='winter_d',kind="count")
# Redefine cabin_df as everything but where the row was equal to 'T'
cabin_df = cabin_df[cabin_df.Cabin != 'T']
#Replot
sns.factorplot('Cabin',data=cabin_df,palette='cool',kind="count")
#Where did the passengers come from?
# Let's take another look at our original data
titanic_df.head()
Note here that the Embarked column has C,Q,and S values. Reading about the project on Kaggle you'll note that these stand for Cherbourg, Queenstown, Southhampton.
# Now we can make a quick factorplot to check out the results, note the x_order argument, used to deal with NaN values
sns.factorplot('Embarked',data=titanic_df,hue='Pclass',order=['C','Q','S'],kind="count")
An interesting find here is that in Queenstown, almost all the passengers that boarded there were 3rd class. It would be intersting to look at the economics of that town in that time period for further investigation.
Now let's take a look at the 4th question:
4.) Who was alone and who was with family?
# Let's start by adding a new column to define alone
# We'll add the parent/child column with the sibsp column
titanic_df['Alone'] = titanic_df.Parch + titanic_df.SibSp
titanic_df['Alone']
Now we know that if the Alone column is anything but 0, then the passenger had family aboard and wasn't alone. So let's change the column now so that if the value is greater than 0, we know the passenger was with his/her family, otherwise they were alone.
# Look for >0 or ==0 to set alone status
titanic_df['Alone'].loc[titanic_df['Alone'] >0] = 'With Family'
titanic_df['Alone'].loc[titanic_df['Alone'] == 0] = 'Alone'
# Note it's okay to ignore an error that sometimes pops up here. For more info check out this link
url_info = 'http://stackoverflow.com/questions/20625582/how-to-deal-with-this-pandas-warning'
# Let's check to make sure it worked
titanic_df.head()
# Now let's get a simple visualization!
sns.factorplot('Alone',data=titanic_df,palette='Blues',kind="count")
Great work! Now that we've throughly analyzed the data let's go ahead and take a look at the most interesting (and open-ended) question: What factors helped someone survive the sinking?
# Let's start by creating a new column for legibility purposes through mapping (Lec 36)
titanic_df["Survivor"] = titanic_df.Survived.map({0: "no", 1: "yes"})
# Let's just get a quick overall view of survied vs died.
sns.factorplot('Survivor',data=titanic_df,palette='Set1',kind="count")
So quite a few more people died than those who survived. Let's see if the class of the passengers had an effect on their survival rate, since the movie Titanic popularized the notion that the 3rd class passengers did not do as well as their 1st and 2nd class counterparts.
# Let's use a factor plot again, but now considering class
sns.factorplot('Pclass','Survived',data=titanic_df)
Look like survival rates for the 3rd class are substantially lower! But maybe this effect is being caused by the large amount of men in the 3rd class in combination with the women and children first policy. Let's use 'hue' to get a clearer picture on this.
# Let's use a factor plot again, but now considering class and gender
sns.factorplot('Pclass','Survived',hue='person',data=titanic_df)
From this data it looks like being a male or being in 3rd class were both not favourable for survival. Even regardless of class the result of being a male in any class dramatically decreases your chances of survival.
But what about age? Did being younger or older have an effect on survival rate?
# Let's use a linear plot on age versus survival
sns.lmplot('Age','Survived',data=titanic_df)
Looks like there is a general trend that the older the passenger was, the less likely they survived. Let's go ahead and use hue to take a look at the effect of class and age.
# Let's use a linear plot on age versus survival using hue for class seperation
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter')
# Let's use a linear plot on age versus survival using hue for class seperation
generations=[10,20,40,60,80]
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter',x_bins=generations)
sns.lmplot('Age','Survived',hue='Sex',data=titanic_df,palette='winter',x_bins=generations)
1.) Did the deck have an effect on the passengers survival rate? Did this answer match up with your intuition?
2.) Did having a family member increase the odds of surviving the crash?
titanic_df.head()
cabin_df
titanic_new=titanic_df[titanic_df.Cabin.notnull()]
titanic_new
# First we'll drop the NaN values and create a new object, deck
titanic_new['level'] = titanic_new[['Cabin']].apply(lambda x:x[0][0],axis=1)
titanic_new.head()
sns.factorplot('level',data=titanic_new,palette='Set1',kind="count",order=['A','B','C','D','E','F','G','T'])
# Let's use a factor plot again, but now considering class and gender
sns.factorplot('level','Survived',order=['A','B','C','D','E','F','G','T'],data=titanic_new)
# Let's use a factor plot again, but now considering class and gender
sns.factorplot('Alone','Survived',data=titanic_df)