[R语言] ggplot2包可视化《R for data science》 1

《R for Data Science》第二、三章 Data visualisation 啃书知识点积累

参考书籍

《R for data science》

《R数据科学》

The Layered Grammar of Graphics.

ggplot2: Points

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

A graphing template

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

Aesthetic mappings

# Left
p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

# Right
p2 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))  

p1 + p2
# Warning messages:
# 1: Using alpha for a discrete variable is not advised. 
# 2: The shape palette can deal with a maximum of 6 discrete values
# because more than 6 becomes difficult to discriminate; you have
# 7. Consider specifying shapes manually if you must have them. 
# 3: Removed 62 rows containing missing values (geom_point).

ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.

- How do these aesthetics behave differently for categorical vs. continuous variables

'''
color 有序属性
1. 分类变量映射：对应多种不同颜色
2. 连续变量映射：形成有固定范围的色阶，在色阶内部取色

size 有序属性
1. 分类变量映射：点大小和分类类型逐一对应但不相关，且会警告
2. 连续变量映射：点的大小和连续变量线性相关

shape 无序属性
1. 分类变量映射：对应多种形状，最多同时出现6种，超过则不显示且有警告
2. 连续变量映射：无法映射
'''

- mpg的变量类型

stroke属性

p1 <- ggplot(mpg,aes(x = displ, y = hwy)) +
  geom_point(shape = 1)

p2 <- ggplot(mpg,aes(x = displ, y = hwy)) +
  geom_point(shape = 1,stroke = 2)

p1 + p2

Facet 分面

- 封装型 wrap

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

facet_wrap()参数如下：

# strip.position参数调节标签的朝向
p1 <- ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2, strip.position = 'bottom')

p2 <- ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2, strip.position = 'right')

p1 + p2

- 在分面中呈现总数据

ggplot(mpg, aes(displ, hwy)) +
  geom_point(data = transform(mpg, class = NULL), 
             colour = "grey85") +
  geom_point() +
  facet_wrap(~ class)

- 网格型 grid

# . 的作用表示的是不想在行或者列的维度上进行分面
p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .) # 列 ~ 行

p2 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

p1 + p2

Geometric objects

- 不显示图例和置信区间

p1 <- ggplot(mpg) +
  geom_smooth(aes(x = displ, y = hwy))

p2 <- ggplot(mpg,aes(x = displ, y = hwy, group = drv)) +
  geom_smooth(se = FALSE)

p3 <- ggplot(mpg) +
  geom_smooth(
    aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE)

p1 + p2 + p3

- 配合filter

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

- 细节画图

同样是外白内其他颜色的点，一种重叠后有白色，一种无白色在内

p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(fill=drv),shape=21,color='white',size=2.5,stroke=1.5)

p2 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color='white',size=3.5)+
  geom_point(aes(color=drv),shape=16,size=2.3)

p1 + p2

Statistical transformations

barcharts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
smoothers fit a model to your data and then plot predictions from the model.
boxplots compute a robust summary of the distribution and then display a specially formatted box.

- 几种常用互换

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar()

ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))
# 等价于
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut), stat = 'identity') # 默认stat可以不写

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
# 等价于
ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

# 也可以手动复现
ggplot(diamonds, aes(cut,depth)) + 
  geom_line(size=1) + 
  # 更换data需要重新指名data = xxx
  geom_point(data = diamonds %>%   
               group_by(cut) %>% 
               summarise(median(depth)),
               aes(cut, `median(depth)`), size=2)

- 覆盖默认映射

ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = stat(prop), group = 1, fill = stat(prop)))
# 等价于
p1 <- ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = ..prop.., group = 1, fill = ..prop..))

p2 <- ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = ..prop.., group = color, fill = color))

p1 + p2

- What does geom_col() do? How is it different to geom_bar()?

geom_col() 函数也是用来绘制柱状图，"identity" 表示不做统计变换
geom_bar() 函数默认是 count，表示计数

- Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Position adjustments

position = "identity" 将每个对象直接显示在图中，这样数据会彼此重叠，不适合展示结果
position = "fill" 堆叠百分比条形图
position = "dodge" 并列条形图
position = "stack" 堆叠起来
position = "jitter" 数据随机抖动，一般应用于散点图

用一下刘博的案例

library(ggplot2)
library(patchwork)

v <- data.frame(x = 1:20, 
                y = runif(40,min = 10,max = 20),
                z = rep(c("A","B"),each = 20))
                
p1 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_dodge(), alpha = 0.5) +
  labs(title = "position_dodge()")

p2 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_fill(), alpha = 0.5) +
  labs(title = "position_fill()")

p3 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_stack(), alpha = 0.5) +
  labs(title = "position_stack()")

p4 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_identity(), alpha = 0.5) +
  labs(title = "position_identity()")

p5 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_jitter(), alpha = 0.5) +
  labs(title = "position_jitter(), usually for point")

(p1 + p2 + p3)/(p4 + p5)

geom_jitter() 抖动

geom_jitter() 对数据进行随机抖动
geom_count() 将重叠的位置数目进行计数

p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()
# 等价于
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(position = position_jitter())
# 等价于
p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(position = 'jitter')

# geom_count()
p3 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

Coordinate systems

- coord_flip()

coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

p1 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

p2 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

p1 + p2

- coord_quickmap()

帮助地图设置成正确比例

coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2.

nz <- map_data("nz")

p1 <- ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

p2 <- ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

p1 + p2

- coord_polar()

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

p1 <- bar + coord_flip()
p2 <- bar + coord_polar()

p1 + p2

进一步拓展：

- Turn a stacked bar chart into a pie chart using coord_polar()

p1 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity)) + 
  coord_polar()

p2 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity),
           position = 'fill') + 
  coord_polar()

# theta 参数表示 variable to map angle to (x or y)
# 意思就是根据值计算出所占的比例，然后再映射到角度
p3 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity),
           position = 'fill') + 
  coord_polar(theta = "y")

p1 + p2 + p3

- What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

'''
城市和公路燃油效率之间呈现正相关。
coord_fixed()能够固定x轴和y轴的比例。
geom_abline()是绘制斜线，默认45度，截距适应图形
可以指定intercept截距，slope坡度
'''

p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline(intercept=-5,slope=1) +
  coord_fixed()

p1 + p2

[R语言] ggplot2包 可视化《R for data science》 1

A graphing template

Aesthetic mappings

- How do these aesthetics behave differently for categorical vs. continuous variables

- mpg的变量类型

Facet 分面

- 封装型 wrap

- 在分面中呈现总数据

- 网格型 grid

Geometric objects

- 不显示图例和置信区间

- 配合filter

- 细节画图

Statistical transformations

- 几种常用互换

- 覆盖默认映射

- What does geom_col() do? How is it different to geom_bar()?

- Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Position adjustments

Coordinate systems

- coord_flip()

- coord_quickmap()

- coord_polar()

- Turn a stacked bar chart into a pie chart using coord_polar()

- What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

推荐阅读更多精彩内容

[R语言] ggplot2包可视化《R for data science》 1