数据可视化贯穿数据分析始终
数据终将以某种形式展现出来。
数据可视化是关于数据视觉表现形式的科学技术研究。这种数据的视觉表现形式被定义为,一种以某种概要形式抽提出来的信息,包括相应信息单位的各种属性和变量,可以从不同的维度观察数据,从而对数据进行更深入的观察和分析。不管是数据的前期处理,还是探索性数据分析,以及后面的统计建模,可视化都能更直观地反映数据的状态。
掌握一种数据可视化工具是数据分析师傅技能树上不可或缺的叶绿体---你要成长为一棵树就离不开这个。
什么是ggplot2?
由Hadley Wickham于2005年创建 ,具有理论基础的图形包,基于《the Grammar of Graphics》(Wilkinson, 2005),这也是它名称的由来 。 能媲美商业数据可视化软件的作图效果,使用“图层”的概念,容易上手,可以非常简单地画出复杂的统计图表。
- ggplot2的核心理念是将绘图属性与数据分离。
- ggplot2是按图层作图
- ggplot2保有命令式作图的调整函数,使其更具灵活性
- ggplot2将常见的统计变换融入到了绘图中。
ggplot2可以说是R语言生态中一门新的语言,它有自己的一套语法。
ggplot2 将一张统计图形定义为从数据到几何对象(geometric object, 缩写为geom, 包括点、线、条形等)的图形属性(aesthetic attributes, 缩写为aes, 包括颜色、形状、大小等)的一个映射。此外, 图形中还可能包含数据的统计变换(statistical transformation, 缩写为stats), 最后绘制在某个特定的坐标系(coordinate system, 缩写为coord)中, 而分面(facet, 指将绘图窗口划分为若干个子窗口)则可以用来生成数据中不同子集的图形。总而言之,一张统计图形就是以上这些独立的图形部件所组成的。<ggplot2数据分析与图形艺术>
为什么使用ggplot2?
- 继承R语言本身的优势:开源。
- 掌握一种语法系统可处理几乎所有类型数据
- 快速上手
- 绘图部件齐全:分面、颜色、大小等
- 使用默认参数让作者专注数据
- 美丽大方
- 代码量少(比那些不用代码的软件又不知优秀千倍,代码可是重复使用)
- The documentation is great
怎样使用ggplot2?
我们将以书中的例子来展示这个过程。
library(tidyverse)
mpg
# A tibble: 234 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l~ f 18 29 p comp~
2 audi a4 1.8 1999 4 manual~ f 21 29 p comp~
3 audi a4 2 2008 4 manual~ f 20 31 p comp~
4 audi a4 2 2008 4 auto(a~ f 21 30 p comp~
5 audi a4 2.8 1999 6 auto(l~ f 16 26 p comp~
6 audi a4 2.8 1999 6 manual~ f 18 26 p comp~
7 audi a4 3.1 2008 6 auto(a~ f 18 27 p comp~
8 audi a4 qua~ 1.8 1999 4 manual~ 4 18 26 p comp~
9 audi a4 qua~ 1.8 1999 4 auto(l~ 4 16 25 p comp~
10 audi a4 qua~ 2 2008 4 manual~ 4 20 28 p comp~
这里需要注意的是ggplot2绘图数据的格式,不是你拿个表都能过来做,他需要每一列是一个独立的属性(这个属性对应一个图形元素)。
displ: a car’s engine size, in litres.
hwy: a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
#3.2.2 Creating a ggplot
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
大致上看是一个负相关的关系。
绘图模板
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
图形属性映射
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
p1<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
p2<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
library(gridExtra)
grid.arrange(p1,p2,ncol = 2, nrow = 1)
不幸的是,我们看到SUV的图例形状没有了,仅仅因为大于了6个。
mpg$class<-factor(mpg$class)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))+
scale_shape_manual(values=1:nlevels(mpg$class))
d=data.frame(p=c(0:25,32:127))
ggplot() +
scale_y_continuous(name="") +
scale_x_continuous(name="") +
scale_shape_identity() +
geom_point(data=d, mapping=aes(x=p%%16, y=p%/%16, shape=p), size=5, fill="red") +
geom_text(data=d, mapping=aes(x=p%%16, y=p%/%16+0.25, label=p), size=3)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
分面
p1<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
p2<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
grid.arrange(p1,p2,ncol = 2, nrow = 1)
给不同的分面上不同的颜色。
#Each panel
ggplot(mpg,aes(x=displ, y = hwy)) +
geom_rect(data = mpg,aes(fill = class),xmin = -Inf,xmax = Inf,
ymin = -Inf,ymax = Inf,alpha = 0.3) +
geom_point(shape=1) +
facet_grid(~ class)
几何对象
geom是图形用来表示数据的几何图形对象。人们通常根据图中使用的几何对象类型来描述相应的图。例如,条形图使用条形geoms,折线图使用折线geoms,箱线图使用箱线图geoms,等等。可以使用不同的geoms来绘制相同的数据。左边的图使用点geom,右边的图使用平滑的geom,一条与数据相匹配的平滑线。
# left
p1<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# right
p2<-ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
grid.arrange(p1,p2,ncol = 2, nrow = 1)
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
p1<-ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
p2<-ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
p3<-ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)
grid.arrange(p1,p2,p3,ncol = 3, nrow = 1)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
或者:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
对涂层指定不同的数据集
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
####### 统计变换
bin是分箱的意思,在统计学中,数据分箱是一种把多个连续值分割成多个区间的方法,每一个小区间叫做一个bin(bucket),这就意味着每个bin定义一个数值区间,连续值会落到相应的区间中。
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
?geom_bar
geom_bar(mapping = NULL, data = NULL, stat = "count",
position = "stack", ..., width = NULL, binwidth = NULL,
na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
或者:
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
smoothers fit a model to your data and then plot predictions from the model.
boxplots compute a robust summary of the distribution and then display a specially formatted box.
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
p1<-ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
p2<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
p3<-ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
grid.arrange(p1,p2,p3,ncol = 3, nrow = 1)
位置调整
p1<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
p2<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
p3<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
grid.arrange(p1,p2,p3,ncol = 3, nrow = 1)
p1<-ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
p2<-ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")
p3<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
p4<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
grid.arrange(p1,p2,p3,p4,ncol = 4, nrow = 1)
避免重叠
p1<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "identity")
p2<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
grid.arrange(p1,p2,ncol = 2, nrow = 1)
坐标系
p1<-ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
p2<-ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
grid.arrange(p1,p2,ncol = 2, nrow = 1)
nz <- map_data("nz")
p1<-ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
p2<-ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
grid.arrange(p1,p2,ncol = 2, nrow = 1)
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar1<-bar + coord_flip()
bar2<-bar + coord_polar()
grid.arrange(bar,bar1,bar2,ncol = 3, nrow = 1)
图形分层语法
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
r4ds
R语言可视化之原理概述篇
https://ggplot2.tidyverse.org/reference/
如何使用 ggplot2 ?
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
数据可视化基础——可视化流程
http://varianceexplained.org/r/why-I-use-ggplot2/
https://mandymejia.com/2013/11/13/10-reasons-to-switch-to-ggplot-7/
Why I don't use ggplot2
http://sape.inf.usi.ch/quick-reference/ggplot2/shape
R绘图 第十一篇:统计转换、位置调整、标度和向导(ggplot2)