描述数据分布01

Part1 1.1~1.4

1.1 绘制简单直方图
1.2 基于分组数据绘制分组直方图
1.3 绘制密度曲线
1.4 基于分组数据绘制分组密度曲线
1.5 绘制频数多边形
1.6 绘制基本箱线图
1.7 向箱线图添加槽口
1.8 向箱线图添加均值
1.9 绘制小提琴图
1.10 绘制Wikinson点图
1.11 基于分组数据绘制分组点图
1.12 绘制二维数据的密度图

1.1 绘制简单直方图

这里faithful是记录了“老忠实”喷泉的喷发时长和两次喷发间隔，是一个连续性变量，绘制直方图可以使得在一个等待区间内喷发次数（历史统计了272组数据）

library(ggplot2)
faithful
# eruptions waiting
# 1     3.600      79
# 2     1.800      54
# 3     3.333      74
# 4     2.283      62
# 5     4.533      85
# 6     2.883      55
ggplot(faithful, aes(x=waiting)) +
  geom_histogram()

fig-1.1.1.png

上面是默认情况下
（1）可以使用binwidth参数来修改组距
（2）将数据切分为指定的分组数目
（3）直方图默认的填充颜色是黑色，且没有边框线，这里面fill修改填充颜色，colour修改边框颜色

# 设定组距为5，（fig-1.1.2.png）
ggplot(faithful, aes(x=waiting)) +
  geom_histogram(binwidth=5, fill="white", colour="black")

# 将x的取值切分为15组（fig-1.1.3.png）
binsize = diff(range(faithful$waiting))/15
ggplot(faithful, aes(x=waiting)) +
  geom_histogram(binwidth=binsize, fill="grey", colour="blue")

fig-1.1.2.png

fig-1.1.3.png

基本绘图结果可以保存为变量以便于重复使用，同时boundary参数可以设定分组原点，当boundary=31时，组边界为31,39,47等（fig-1.1.4.png），而在fig-1.1.5.png中分组原点为35，则组边界为35,43,51等。

h = ggplot(faithful, aes(x=waiting))
h + geom_histogram(binwidth=8, fill="white", colour="black", boundary=31)
h + geom_histogram(binwidth=8, fill="white", colour="black", boundary=35)

fig-1.1.4.png

fig-1.1.5.png

1.2 基于分组数据绘制分组直方图

绘制分组直方图使用geom_histogram()函数+分面绘图

# 载入需要使用的数据
library(MASS)
head(birthwt)
# low age lwt race smoke ptl ht ui ftv  bwt
# 85   0  19 182    2     0   0  0  1   0 2523
# 86   0  33 155    3     0   0  0  0   3 2551
# 87   0  20 105    1     1   0  0  0   1 2557
# 88   0  21 108    1     1   0  0  1   2 2594
# 89   0  18 107    1     1   0  0  1   0 2600
# 91   0  21 124    3     0   0  0  0   0 2622

# 使用smoke作为分面变量（fig-1.2.1.png）
ggplot(birthwt, aes(x=bwt)) +
  geom_histogram(fill="white", colour="black") + 
  facet_grid(smoke ~ .)

fig-1.2.1.png

上面的分面标签为0和1，不看代码不清楚是smoke的取值，最好修改分面标签，修改需要修改因子水平的名称
（1）首先列出现有的因子水平
（2）按照相同的顺序赋予新的名字

# 复制一个数据副本，避免错误操作丢失原始数据
birthwt1 = birthwt
# 将smoke转化为因子
birthwt1$smoke = factor(birthwt1$smoke)
levels(birthwt1$smoke)

# 导入plyr包，调用revalue()函数，将0和1
# 替换为No Smoke和Smoke
library(plyr)
birthwt1$smoke = revalue(birthwt1$smoke, c("0"="No Smoke", "1"="Smoke"))
ggplot(birthwt1, aes(x=bwt)) +
  geom_histogram(fill="white", colour="black") + 
  facet_grid(smoke ~ .)

fig-1.2.2.png

分面绘图时，个分面的y轴标度是相同的。当按照出生体重race进行分组时（fig-1.2.3.png）

ggplot(birthwt, aes(x=bwt)) +
  geom_histogram(fill="white", colour="black") + 
  facet_grid(race ~ .)

fig-1.2.3.png

如果不同分组想使用不同的标度，可以使用scales="free"，这个设置只适合y轴，x轴不会改变（fig-1.2.4.png）

ggplot(birthwt, aes(x=bwt)) +
  geom_histogram(fill="white", colour="black") + 
  facet_grid(race ~ ., scales="free")

fig-1.2.4.png

分组绘图的另一种做法是把分组变量映射给fill，但是分组变量必须是因子型或者字符型（这里使用上面的birthwt1变量，smoke已经是因子型）（fig-1.2.5.png）

birthwt1$smoke = factor(birthwt1$smoke)
# 把smoke映射给fill，取消条形堆叠，并使图形半透明，alpha为1的不透明
ggplot(birthwt1, aes(x=bwt, fill=smoke)) +
  geom_histogram(position="identity", alpha=0.4)

fig-1.2.5.png

1.3 绘制密度曲线

使用geom_density()函数，并映射一个连续型变量到x（fig-1.3.1.png）

ggplot(faithful, aes(x=waiting)) +
  geom_density()

fig-1.3.1.png

核密度曲线是基于样本数据对总体分布做出的一个估计。曲线的光滑程度取决于核函数的带宽：带宽越大，曲线越光滑。带宽可以通过adjust参数进行设置，其默认值为1（fig-1.3.2.png）

ggplot(faithful, aes(x=waiting)) +
  geom_line(stat="density", adjust=.25, colour="red") +
  geom_line(stat="density") +
  geom_line(stat="density", adjust=2, colour="blue")

fig-1.3.2.png

可以使用xlim(a, b)设置x轴范围（fig-1.3.3.png）

ggplot(faithful, aes(x=waiting)) +
  geom_density(fill="blue", alpha=.2) +
  xlim(35, 105)

fig-1.3.3.png

可以将密度曲线绘制到更高一层的图层上，通过设置y=..density..可以减小直方图的标度以使其余密度曲线相匹配。

ggplot(faithful, aes(x=waiting, y=..density..)) +
  geom_histogram(fill="cornsilk", colour="grey60", alpha=.2) +
  geom_dentisy() +
  xlim(35, 105)

fig-1.3.4.png

1.4 基于分组数据绘制分组密度曲线

使用geom_density()函数，将分组变量映射给colour或fill等图形属性即可
分组变量必须是因子型或字符串向量！

library(MASS)
birthwt1 = birthwt
birthwt1$smoke = factor(birthwt1$smoke)
# 把变量smoke映射给colour
ggplot(birthwt1, aes(x=bwt, colour=smoke)) +
  geom_density()

fig-1.4.1.png

# 把变量smoke映射给fill，设置alpha使填充色半透明
ggplot(birthwt1, aes(x=bwt, fill=smoke)) +
  geom_density(alpha=.3)

fig-1.4.2.png

birthwt包含婴儿出生体重及一系列导致出生体重过低的危险因子的数据

head(birthwt)
# low age lwt race smoke ptl ht ui ftv  bwt
# 85   0  19 182    2     0   0  0  1   0 2523
# 86   0  33 155    3     0   0  0  0   3 2551
# 87   0  20 105    1     1   0  0  0   1 2557
# 88   0  21 108    1     1   0  0  1   2 2594
# 89   0  18 107    1     1   0  0  1   0 2600
# 91   0  21 124    3     0   0  0  0   0 2622

下面观察变量smoke（抽烟与否）与变量bwt（出生体重，单位是克）的关系。

library(plyr)
birthwt1$smoke = revalue(birthwt1$smoke, c("0"="No Smoke", "1"="Smoke"))
png("fig-1.4.3.png", width=500, height=500)
ggplot(birthwt1, aes(x=bwt)) +
  geom_density(fill="white", colour="black") + 
  facet_grid(smoke ~ .)

fig-1.4.3.png

添加了核密度曲线的分面

ggplot(birthwt1, aes(x=bwt, y=..density..)) +
  geom_histogram(fill="cornsilk", colour="grey60", alpha=.2) +
  geom_density() +
  facet_grid(smoke ~ .)

fig-1.4.4.png

描述数据分布01

目录

1.1 绘制简单直方图

1.2 基于分组数据绘制分组直方图

1.3 绘制密度曲线

1.4 基于分组数据绘制分组密度曲线