visualization-seaborn

1.distribution ——如何让分布更直观

1.1 单变量分布

%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
np.random.seed(sum(map(ord, "distributions")))

灰度图

x = np.random.normal(size=100)
sns.distplot(x, kde=False)

image.png

sns.distplot(x, kde=False, bins=20)

image.png

sns.distplot(x, kde=False, bins=20, rug=True)

image.png

核密度估计
通过观测估计概率密度函数的形状。有什么用呢？待定系数法求概率密度函数~
核密度估计的步骤：

每一个观测附近用一个正态分布曲线近似
叠加所有观测的正太分布曲线
归一化

在seaborn中怎么画呢？

sns.kdeplot(x)

image.png

bandwidth的概念：用于近似的正态分布曲线的宽度。

sns.kdeplot(x)
sns.kdeplot(x, bw=.2, label="bw: 0.2")
sns.kdeplot(x, bw=2, label="bw: 2")
plt.legend()

image.png

模型参数拟合

x = np.random.gamma(6, size=200)
sns.distplot(x, kde=False, fit=stats.gamma)

image.png

1.2 双变量分布

mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])

散点图

sns.jointplot(x="x", y="y", data=df)

image.png

六角箱图

x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("ticks"):
    sns.jointplot(x=x, y=y, kind="hex")

image.png

核密度估计

sns.jointplot(x="x", y="y", data=df, kind="kde")

image.png

f, ax = plt.subplots(figsize=(6, 6))
sns.kdeplot(df.x, df.y, ax=ax)
sns.rugplot(df.x, color="g", ax=ax)
sns.rugplot(df.y, vertical=True, ax=ax)

image.png

f, ax = plt.subplots(figsize=(6, 6))
cmap = sns.cubehelix_palette(as_cmap=True, dark=1, light=0)
sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=60, shade=True)

image.png

g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")

image.png

1.3 数据集中的两两关系

iris = sns.load_dataset("iris")
iris.head()

sns.pairplot(iris);

image.png

属性两两间的关系 + 属性的灰度图

g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, cmap="Blues_d", n_levels=20)

image.png

2.regression ——探索变量间的关系

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
np.random.seed(sum(map(ord, "regression")))
tips = sns.load_dataset("tips")

tips.head()
tips[tips['size']==1]

2.1 绘制线性回归模型

最简单的方式：散点图 + 线性回归 + 95%置信区间

sns.lmplot(x="total_bill", y="tip", data=tips)

image.png

对于变量离线取值，散点图就显得有些尴尬了。。。

sns.lmplot(x="size", y="tip", data=tips)

image.png

方法1：加个小的抖动

sns.lmplot(x="size", y="tip", data=tips, x_jitter=.05)

image.png

方法2：离散取值上用均值和置信区间代替散点

sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean)

image.png

2.2 拟合不同模型

有些时候线性拟合效果不错，有些时候差强人意~

anscombe = sns.load_dataset("anscombe")
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"), ci=None, scatter_kws={"s": 80})

image.png

sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), ci=None, scatter_kws={"s": 80})

image.png

试试高阶拟合~

sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), order=2, ci=None, scatter_kws={"s": 80})

image.png

异常值肿么办？

sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"), robust=True, ci=None, scatter_kws={"s": 80})

image.png

二值变量如何拟合？

tips["big_tip"] = (tips.tip / tips.total_bill) > .15
sns.lmplot(x="total_bill", y="big_tip", data=tips, y_jitter=.05)

image.png

sns.lmplot(x="total_bill", y="big_tip", data=tips, logistic=True, y_jitter=.03, ci=None)

image.png

如何评价拟合效果？残差曲线~

sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'I'"), scatter_kws={"s": 80})

image.png

拟合的好，就是白噪声的分布 N(0,σ2)拟合的差，就能看出一些模式

2.3 变量间的条件关系摸索

sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips)

image.png

sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, markers=["o", "x"])

image.png

尝试增加更多的分类条件

sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips)

image.png

sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", row="sex", data=tips)

image.png

控制图片的大小和形状

sns.lmplot(x="total_bill", y="tip", col="day", data=tips, col_wrap=2, size=5)

image.png

sns.lmplot(x="total_bill", y="tip", col="day", data=tips, aspect=0.5)

image.png

3.分类数据的可视化分析

观测点的直接展示：swarmplot, stripplot
观测近似分布的展示：boxplot, violinplot
均值和置信区间的展示：barplot, pointplot

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
np.random.seed(sum(map(ord, "categorical")))
titanic = sns.load_dataset("titanic")
tips = sns.load_dataset("tips")
iris = sns.load_dataset("iris")

3.1 分类散点图

当有一维数据是分类数据时，散点图成为了条带形状。

sns.stripplot(x="day", y="total_bill", data=tips)

image.png

散点都在一起看不清楚？还记得抖动的方法咩~

sns.stripplot(x="day", y="total_bill", data=tips, jitter=True)

image.png

另外一种处理办法，是生成蜂群图，避免散点重叠~

sns.swarmplot(x="day", y="total_bill", data=tips)

image.png

在每一个一级分类内部可能存在二级分类

sns.swarmplot(x="day", y="total_bill", hue="sex", data=tips)

image.png

3.2 分类分布图

箱图
上边缘、上四分位数、中位数、下四分位数、下边缘

sns.boxplot(x="day", y="total_bill", hue="time", data=tips)

image.png

提琴图
箱图 + KDE(Kernel Distribution Estimation)

sns.violinplot(x="total_bill", y="day", hue="time", data=tips)

image.png

sns.violinplot(x="day", y="total_bill", hue="time", data=tips)

image.png

sns.violinplot(x="total_bill", y="day", hue="time", data=tips, bw=.1, scale="count", scale_hue=False)

image.png

sns.violinplot(x="total_bill", y="day", hue="time", data=tips, bw=.1, scale="count", scale_hue=False)

image.png

非对称提琴图

sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True, inner="stick")

image.png

3.3 分类统计估计图

统计柱状图

sns.barplot(x="sex", y="survived", hue="class", data=titanic)

image.png

灰度柱状图

sns.countplot(x="deck", data=titanic, palette="Greens_d")

image.png

点图

sns.pointplot(x="sex", y="survived", hue="class", data=titanic)

image.png

修改颜色、标记、线型

sns.pointplot(x="class", y="survived", hue="sex", data=titanic,
              palette={"male": "g", "female": "m"},
              markers=["^", "o"], linestyles=["-", "--"])

image.png

3.4 分类子图

sns.factorplot(x="day", y="total_bill", hue="smoker", col="time", data=tips, kind="swarm")

image.png

多分类标准的子图

g = sns.PairGrid(tips,
                 x_vars=["smoker", "time", "sex"],
                 y_vars=["total_bill", "tip"],
                 aspect=.75, size=3.5)
g.map(sns.violinplot, palette="pastel");

image.png

4. Titanic Project

#Now let's open it with pandas
import pandas as pd
from pandas import Series,DataFrame

# Set up the Titanic csv file as a DataFrame
titanic_df = pd.read_csv('train.csv')

# Let's see a preview of the data
titanic_df.head()

# We could also get overall info for the dataset
titanic_df.info()

All good data analysis projects begin with trying to answer questions. Now that we know what column category data we have let's think of some questions or insights we would like to obtain from the data. So here's a list of questions we'll try to answer using our new data analysis skills!
First some basic questions:
1.) Who were the passengers on the Titanic? (Ages,Gender,Class,..etc)
2.) What deck were the passengers on and how does that relate to their class?
3.) Where did the passengers come from?
4.) Who was alone and who was with family?
Then we'll dig deeper, with a broader question:
5.) What factors helped someone survive the sinking?
So let's start with the first question: Who were the passengers on the titanic?

# Let's import what we'll need for the analysis and visualization
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Let's first check gender
sns.factorplot('Sex',data=titanic_df,kind="count")

image.png

# Now let's seperate the genders by classes, remember we can use the 'hue' arguement here!
sns.factorplot('Pclass',data=titanic_df,hue='Sex',kind="count")

image.png

# Now let's seperate the genders by classes, remember we can use the 'hue' arguement here!
sns.factorplot('Sex',data=titanic_df,hue='Pclass',kind="count")

image.png

# We'll treat anyone as under 16 as a child, and then use the apply technique with a function to create a new column

# Revisit Lecture 45 for a refresher on how to do this.

# First let's make a function to sort through the sex 
def male_female_child(passenger):
    # Take the Age and Sex
    age,sex = passenger
    # Compare the age, otherwise leave the sex
    if age < 16:
        return 'child'
    else:
        return sex
    

# We'll define a new column called 'person', remember to specify axis=1 for columns and not index
titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child,axis=1)

# Let's see if this worked, check out the first ten rows
titanic_df[0:10]

# Let's try the factorplot again!
sns.factorplot('Pclass',data=titanic_df,hue='person',kind="count")

image.png

# Quick way to create a histogram using pandas
titanic_df['Age'].hist(bins=70)

image.png

# We could also get a quick overall comparison of male,female,child
titanic_df['person'].value_counts()

# Another way to visualize the data is to use FacetGrid to plot multiple kedplots on one plot

# Set the figure equal to a facetgrid with the pandas dataframe as its data source, set the hue, and change the aspect ratio.
fig = sns.FacetGrid(titanic_df, hue="Sex",aspect=5)

# Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice
fig.map(sns.kdeplot,'Age',shade= True)

# Set the x max limit by the oldest passenger
oldest = titanic_df['Age'].max()

#Since we know no one can be negative years old set the x lower limit at 0
fig.set(xlim=(0,oldest))

#Finally add a legend
fig.add_legend()

image.png

#We could have done the same thing for the 'person' column to include children:

fig = sns.FacetGrid(titanic_df, hue="person",aspect=4)
fig.map(sns.kdeplot,'Age',shade= True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

image.png

# Let's do the same for class by changing the hue argument:
fig = sns.FacetGrid(titanic_df, hue="Pclass",aspect=4)
fig.map(sns.kdeplot,'Age',shade= True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

image.png

#We've gotten a pretty good picture of who the passengers were based on Sex, Age, and Class. So let's move on to our 2nd question: What deck were the passengers on and how does that relate to their class?

# Let's get a quick look at our dataset again
titanic_df.head()

# First we'll drop the NaN values and create a new object, deck
deck = titanic_df['Cabin'].dropna()

# So let's grab that letter for the deck level with a simple for loop

# Set empty list
levels = []

# Loop to grab first letter
for level in deck:
    levels.append(level[0])    

# Reset DataFrame and use factor plot
cabin_df = DataFrame(levels)
cabin_df.columns = ['Cabin']
sns.factorplot('Cabin',data=cabin_df,palette='winter_d',kind="count")

image.png

# Redefine cabin_df as everything but where the row was equal to 'T'
cabin_df = cabin_df[cabin_df.Cabin != 'T']
#Replot
sns.factorplot('Cabin',data=cabin_df,palette='cool',kind="count")

image.png

#Where did the passengers come from?

# Let's take another look at our original data
titanic_df.head()

Note here that the Embarked column has C,Q,and S values. Reading about the project on Kaggle you'll note that these stand for Cherbourg, Queenstown, Southhampton.

# Now we can make a quick factorplot to check out the results, note the x_order argument, used to deal with NaN values
sns.factorplot('Embarked',data=titanic_df,hue='Pclass',order=['C','Q','S'],kind="count")

image.png

An interesting find here is that in Queenstown, almost all the passengers that boarded there were 3rd class. It would be intersting to look at the economics of that town in that time period for further investigation.
Now let's take a look at the 4th question:
4.) Who was alone and who was with family?

# Let's start by adding a new column to define alone

# We'll add the parent/child column with the sibsp column
titanic_df['Alone'] =  titanic_df.Parch + titanic_df.SibSp
titanic_df['Alone']

Now we know that if the Alone column is anything but 0, then the passenger had family aboard and wasn't alone. So let's change the column now so that if the value is greater than 0, we know the passenger was with his/her family, otherwise they were alone.

# Look for >0 or ==0 to set alone status
titanic_df['Alone'].loc[titanic_df['Alone'] >0] = 'With Family'
titanic_df['Alone'].loc[titanic_df['Alone'] == 0] = 'Alone'

# Note it's okay to ignore an  error that sometimes pops up here. For more info check out this link
url_info = 'http://stackoverflow.com/questions/20625582/how-to-deal-with-this-pandas-warning'

# Let's check to make sure it worked
titanic_df.head()

# Now let's get a simple visualization!
sns.factorplot('Alone',data=titanic_df,palette='Blues',kind="count")

image.png

Great work! Now that we've throughly analyzed the data let's go ahead and take a look at the most interesting (and open-ended) question: What factors helped someone survive the sinking?

# Let's start by creating a new column for legibility purposes through mapping (Lec 36)
titanic_df["Survivor"] = titanic_df.Survived.map({0: "no", 1: "yes"})

# Let's just get a quick overall view of survied vs died. 
sns.factorplot('Survivor',data=titanic_df,palette='Set1',kind="count")

image.png

So quite a few more people died than those who survived. Let's see if the class of the passengers had an effect on their survival rate, since the movie Titanic popularized the notion that the 3rd class passengers did not do as well as their 1st and 2nd class counterparts.

# Let's use a factor plot again, but now considering class
sns.factorplot('Pclass','Survived',data=titanic_df)

image.png

Look like survival rates for the 3rd class are substantially lower! But maybe this effect is being caused by the large amount of men in the 3rd class in combination with the women and children first policy. Let's use 'hue' to get a clearer picture on this.

# Let's use a factor plot again, but now considering class and gender
sns.factorplot('Pclass','Survived',hue='person',data=titanic_df)

image.png

From this data it looks like being a male or being in 3rd class were both not favourable for survival. Even regardless of class the result of being a male in any class dramatically decreases your chances of survival.
But what about age? Did being younger or older have an effect on survival rate?

# Let's use a linear plot on age versus survival
sns.lmplot('Age','Survived',data=titanic_df)

image.png

Looks like there is a general trend that the older the passenger was, the less likely they survived. Let's go ahead and use hue to take a look at the effect of class and age.

# Let's use a linear plot on age versus survival using hue for class seperation
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter')

image.png

# Let's use a linear plot on age versus survival using hue for class seperation
generations=[10,20,40,60,80]
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter',x_bins=generations)

image.png

sns.lmplot('Age','Survived',hue='Sex',data=titanic_df,palette='winter',x_bins=generations)

image.png

1.) Did the deck have an effect on the passengers survival rate? Did this answer match up with your intuition?
2.) Did having a family member increase the odds of surviving the crash?

titanic_df.head()
cabin_df

titanic_new=titanic_df[titanic_df.Cabin.notnull()]
titanic_new

# First we'll drop the NaN values and create a new object, deck
titanic_new['level'] = titanic_new[['Cabin']].apply(lambda x:x[0][0],axis=1)

titanic_new.head()

sns.factorplot('level',data=titanic_new,palette='Set1',kind="count",order=['A','B','C','D','E','F','G','T'])

image.png

# Let's use a factor plot again, but now considering class and gender
sns.factorplot('level','Survived',order=['A','B','C','D','E','F','G','T'],data=titanic_new)

image.png

# Let's use a factor plot again, but now considering class and gender
sns.factorplot('Alone','Survived',data=titanic_df)

image.png

最后编辑于：2017.12.09 21:54:10

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,125评论 6赞 498
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,293评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,054评论 0赞 351
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,077评论 1赞 291
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,096评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,062评论 1赞 295
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,988评论 3赞 417
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,817评论 0赞 273
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,266评论 1赞 310
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,486评论 2赞 331
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,646评论 1赞 347
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,375评论 5赞 342
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,974评论 3赞 325
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,621评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,796评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,642评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,538评论 2赞 352

visualization-seaborn

1.distribution ——如何让分布更直观

1.1 单变量分布

1.2 双变量分布

1.3 数据集中的两两关系

2.regression ——探索变量间的关系

2.1 绘制线性回归模型

2.2 拟合不同模型

2.3 变量间的条件关系摸索

3.分类数据的可视化分析

3.1 分类散点图

3.2 分类分布图

3.3 分类统计估计图

3.4 分类子图

4. Titanic Project

推荐阅读更多精彩内容