tidyverse
是一组处理与可视化R包的集合,其中ggplot2
与dplyr
最广为人知。
核心包有以下一些:
- ggplot2 - 可视化数据
- dplyr - 数据操作语法,可以用它解决大部分数据处理问题
- tidyr - 清理数据
- readr - 读入表格数据
- purrr - 提供一个完整一致的工具集增强R的函数编程
- tibble - 新一代数据框
- stringr - 提供函数集用来处理字符数据
- forcats - 提供有用工具用来处理因子问题
有几个包没接触过,R包太多了,这些强力包还是有必要接触和学习下使用,碰到问题事半功倍。
安装tidyverse
:
install.packages("tidyverse")
导入:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 2.2.1 √ purrr 0.2.4
## √ tibble 1.4.2 √ dplyr 0.7.4
## √ tidyr 0.8.0 √ stringr 1.3.0
## √ readr 1.1.1 √ forcats 0.3.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
有用的函数
# tidyverse与其他包的冲突
tidyverse_conflicts()
# 列出所有tidyverse的依赖包
tidyverse_deps()
#获取tidyverse的logo
tidyverse_logo()
# 列出所有tidyverse包
tidyverse_packages()
# 更新tidyverse包
tidyverse_update()
载入数据
library(datasets)
#install.packages("gapminder")
library(gapminder)
attach(iris)
dplyr
过滤
filter()
函数可以用来取数据子集。
iris %>%
filter(Species == "virginica") # 指定满足的行
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 6.3 3.3 6.0 2.5 virginica
## 2 5.8 2.7 5.1 1.9 virginica
## 3 7.1 3.0 5.9 2.1 virginica
## 4 6.3 2.9 5.6 1.8 virginica
## 5 6.5 3.0 5.8 2.2 virginica
## 6 7.6 3.0 6.6 2.1 virginica
## 7 4.9 2.5 4.5 1.7 virginica
## 8 7.3 2.9 6.3 1.8 virginica
## 9 6.7 2.5 5.8 1.8 virginica
## 10 7.2 3.6 6.1 2.5 virginica
## 11 6.5 3.2 5.1 2.0 virginica
## 12 6.4 2.7 5.3 1.9 virginica
## 13 6.8 3.0 5.5 2.1 virginica
## 14 5.7 2.5 5.0 2.0 virginica
## 15 5.8 2.8 5.1 2.4 virginica
## 16 6.4 3.2 5.3 2.3 virginica
## 17 6.5 3.0 5.5 1.8 virginica
## 18 7.7 3.8 6.7 2.2 virginica
## 19 7.7 2.6 6.9 2.3 virginica
## 20 6.0 2.2 5.0 1.5 virginica
## 21 6.9 3.2 5.7 2.3 virginica
## 22 5.6 2.8 4.9 2.0 virginica
## 23 7.7 2.8 6.7 2.0 virginica
## 24 6.3 2.7 4.9 1.8 virginica
## 25 6.7 3.3 5.7 2.1 virginica
## 26 7.2 3.2 6.0 1.8 virginica
## 27 6.2 2.8 4.8 1.8 virginica
## 28 6.1 3.0 4.9 1.8 virginica
## 29 6.4 2.8 5.6 2.1 virginica
## 30 7.2 3.0 5.8 1.6 virginica
## 31 7.4 2.8 6.1 1.9 virginica
## 32 7.9 3.8 6.4 2.0 virginica
## 33 6.4 2.8 5.6 2.2 virginica
## 34 6.3 2.8 5.1 1.5 virginica
## 35 6.1 2.6 5.6 1.4 virginica
## 36 7.7 3.0 6.1 2.3 virginica
## 37 6.3 3.4 5.6 2.4 virginica
## 38 6.4 3.1 5.5 1.8 virginica
## 39 6.0 3.0 4.8 1.8 virginica
## 40 6.9 3.1 5.4 2.1 virginica
## [到达getOption("max.print") -- 略过10行]]
iris %>%
filter(Species == "virginica", Sepal.Length > 6) # 多个条件用,分隔
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 6.3 3.3 6.0 2.5 virginica
## 2 7.1 3.0 5.9 2.1 virginica
## 3 6.3 2.9 5.6 1.8 virginica
## 4 6.5 3.0 5.8 2.2 virginica
## 5 7.6 3.0 6.6 2.1 virginica
## 6 7.3 2.9 6.3 1.8 virginica
## 7 6.7 2.5 5.8 1.8 virginica
## 8 7.2 3.6 6.1 2.5 virginica
## 9 6.5 3.2 5.1 2.0 virginica
## 10 6.4 2.7 5.3 1.9 virginica
## 11 6.8 3.0 5.5 2.1 virginica
## 12 6.4 3.2 5.3 2.3 virginica
## 13 6.5 3.0 5.5 1.8 virginica
## 14 7.7 3.8 6.7 2.2 virginica
## 15 7.7 2.6 6.9 2.3 virginica
## 16 6.9 3.2 5.7 2.3 virginica
## 17 7.7 2.8 6.7 2.0 virginica
## 18 6.3 2.7 4.9 1.8 virginica
## 19 6.7 3.3 5.7 2.1 virginica
## 20 7.2 3.2 6.0 1.8 virginica
## 21 6.2 2.8 4.8 1.8 virginica
## 22 6.1 3.0 4.9 1.8 virginica
## 23 6.4 2.8 5.6 2.1 virginica
## 24 7.2 3.0 5.8 1.6 virginica
## 25 7.4 2.8 6.1 1.9 virginica
## 26 7.9 3.8 6.4 2.0 virginica
## 27 6.4 2.8 5.6 2.2 virginica
## 28 6.3 2.8 5.1 1.5 virginica
## 29 6.1 2.6 5.6 1.4 virginica
## 30 7.7 3.0 6.1 2.3 virginica
## 31 6.3 3.4 5.6 2.4 virginica
## 32 6.4 3.1 5.5 1.8 virginica
## 33 6.9 3.1 5.4 2.1 virginica
## 34 6.7 3.1 5.6 2.4 virginica
## 35 6.9 3.1 5.1 2.3 virginica
## 36 6.8 3.2 5.9 2.3 virginica
## 37 6.7 3.3 5.7 2.5 virginica
## 38 6.7 3.0 5.2 2.3 virginica
## 39 6.3 2.5 5.0 1.9 virginica
## 40 6.5 3.0 5.2 2.0 virginica
## [到达getOption("max.print") -- 略过1行]]
排序
arrange()
函数用来对观察值排序,默认是升序。
iris %>%
arrange(Sepal.Length)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.3 3.0 1.1 0.1 setosa
## 2 4.4 2.9 1.4 0.2 setosa
## 3 4.4 3.0 1.3 0.2 setosa
## 4 4.4 3.2 1.3 0.2 setosa
## 5 4.5 2.3 1.3 0.3 setosa
## 6 4.6 3.1 1.5 0.2 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 4.6 3.6 1.0 0.2 setosa
## 9 4.6 3.2 1.4 0.2 setosa
## 10 4.7 3.2 1.3 0.2 setosa
## 11 4.7 3.2 1.6 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.8 3.4 1.9 0.2 setosa
## 15 4.8 3.1 1.6 0.2 setosa
## 16 4.8 3.0 1.4 0.3 setosa
## 17 4.9 3.0 1.4 0.2 setosa
## 18 4.9 3.1 1.5 0.1 setosa
## 19 4.9 3.1 1.5 0.2 setosa
## 20 4.9 3.6 1.4 0.1 setosa
## 21 4.9 2.4 3.3 1.0 versicolor
## 22 4.9 2.5 4.5 1.7 virginica
## 23 5.0 3.6 1.4 0.2 setosa
## 24 5.0 3.4 1.5 0.2 setosa
## 25 5.0 3.0 1.6 0.2 setosa
## 26 5.0 3.4 1.6 0.4 setosa
## 27 5.0 3.2 1.2 0.2 setosa
## 28 5.0 3.5 1.3 0.3 setosa
## 29 5.0 3.5 1.6 0.6 setosa
## 30 5.0 3.3 1.4 0.2 setosa
## 31 5.0 2.0 3.5 1.0 versicolor
## 32 5.0 2.3 3.3 1.0 versicolor
## 33 5.1 3.5 1.4 0.2 setosa
## 34 5.1 3.5 1.4 0.3 setosa
## 35 5.1 3.8 1.5 0.3 setosa
## 36 5.1 3.7 1.5 0.4 setosa
## 37 5.1 3.3 1.7 0.5 setosa
## 38 5.1 3.4 1.5 0.2 setosa
## 39 5.1 3.8 1.9 0.4 setosa
## 40 5.1 3.8 1.6 0.2 setosa
## [到达getOption("max.print") -- 略过110行]]
iris %>%
arrange(desc(Sepal.Length)) # 降序
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.9 3.8 6.4 2.0 virginica
## 2 7.7 3.8 6.7 2.2 virginica
## 3 7.7 2.6 6.9 2.3 virginica
## 4 7.7 2.8 6.7 2.0 virginica
## 5 7.7 3.0 6.1 2.3 virginica
## 6 7.6 3.0 6.6 2.1 virginica
## 7 7.4 2.8 6.1 1.9 virginica
## 8 7.3 2.9 6.3 1.8 virginica
## 9 7.2 3.6 6.1 2.5 virginica
## 10 7.2 3.2 6.0 1.8 virginica
## 11 7.2 3.0 5.8 1.6 virginica
## 12 7.1 3.0 5.9 2.1 virginica
## 13 7.0 3.2 4.7 1.4 versicolor
## 14 6.9 3.1 4.9 1.5 versicolor
## 15 6.9 3.2 5.7 2.3 virginica
## 16 6.9 3.1 5.4 2.1 virginica
## 17 6.9 3.1 5.1 2.3 virginica
## 18 6.8 2.8 4.8 1.4 versicolor
## 19 6.8 3.0 5.5 2.1 virginica
## 20 6.8 3.2 5.9 2.3 virginica
## 21 6.7 3.1 4.4 1.4 versicolor
## 22 6.7 3.0 5.0 1.7 versicolor
## 23 6.7 3.1 4.7 1.5 versicolor
## 24 6.7 2.5 5.8 1.8 virginica
## 25 6.7 3.3 5.7 2.1 virginica
## 26 6.7 3.1 5.6 2.4 virginica
## 27 6.7 3.3 5.7 2.5 virginica
## 28 6.7 3.0 5.2 2.3 virginica
## 29 6.6 2.9 4.6 1.3 versicolor
## 30 6.6 3.0 4.4 1.4 versicolor
## 31 6.5 2.8 4.6 1.5 versicolor
## 32 6.5 3.0 5.8 2.2 virginica
## 33 6.5 3.2 5.1 2.0 virginica
## 34 6.5 3.0 5.5 1.8 virginica
## 35 6.5 3.0 5.2 2.0 virginica
## 36 6.4 3.2 4.5 1.5 versicolor
## 37 6.4 2.9 4.3 1.3 versicolor
## 38 6.4 2.7 5.3 1.9 virginica
## 39 6.4 3.2 5.3 2.3 virginica
## 40 6.4 2.8 5.6 2.1 virginica
## [到达getOption("max.print") -- 略过110行]]
新增变量
mutate()
可以更新或者新增数据框一列。
iris %>%
mutate(Sepal.Length = Sepal.Length * 10) # 将该列数值变成以mm为单位
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 51 3.5 1.4 0.2 setosa
## 2 49 3.0 1.4 0.2 setosa
## 3 47 3.2 1.3 0.2 setosa
## 4 46 3.1 1.5 0.2 setosa
## 5 50 3.6 1.4 0.2 setosa
## 6 54 3.9 1.7 0.4 setosa
## 7 46 3.4 1.4 0.3 setosa
## 8 50 3.4 1.5 0.2 setosa
## 9 44 2.9 1.4 0.2 setosa
## 10 49 3.1 1.5 0.1 setosa
## 11 54 3.7 1.5 0.2 setosa
## 12 48 3.4 1.6 0.2 setosa
## 13 48 3.0 1.4 0.1 setosa
## 14 43 3.0 1.1 0.1 setosa
## 15 58 4.0 1.2 0.2 setosa
## 16 57 4.4 1.5 0.4 setosa
## 17 54 3.9 1.3 0.4 setosa
## 18 51 3.5 1.4 0.3 setosa
## 19 57 3.8 1.7 0.3 setosa
## 20 51 3.8 1.5 0.3 setosa
## 21 54 3.4 1.7 0.2 setosa
## 22 51 3.7 1.5 0.4 setosa
## 23 46 3.6 1.0 0.2 setosa
## 24 51 3.3 1.7 0.5 setosa
## 25 48 3.4 1.9 0.2 setosa
## 26 50 3.0 1.6 0.2 setosa
## 27 50 3.4 1.6 0.4 setosa
## 28 52 3.5 1.5 0.2 setosa
## 29 52 3.4 1.4 0.2 setosa
## 30 47 3.2 1.6 0.2 setosa
## 31 48 3.1 1.6 0.2 setosa
## 32 54 3.4 1.5 0.4 setosa
## 33 52 4.1 1.5 0.1 setosa
## 34 55 4.2 1.4 0.2 setosa
## 35 49 3.1 1.5 0.2 setosa
## 36 50 3.2 1.2 0.2 setosa
## 37 55 3.5 1.3 0.2 setosa
## 38 49 3.6 1.4 0.1 setosa
## 39 44 3.0 1.3 0.2 setosa
## 40 51 3.4 1.5 0.2 setosa
## [到达getOption("max.print") -- 略过110行]]
iris %>%
mutate(SLMn = Sepal.Length * 10) # 创建新的一列
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species SLMn
## 1 5.1 3.5 1.4 0.2 setosa 51
## 2 4.9 3.0 1.4 0.2 setosa 49
## 3 4.7 3.2 1.3 0.2 setosa 47
## 4 4.6 3.1 1.5 0.2 setosa 46
## 5 5.0 3.6 1.4 0.2 setosa 50
## 6 5.4 3.9 1.7 0.4 setosa 54
## 7 4.6 3.4 1.4 0.3 setosa 46
## 8 5.0 3.4 1.5 0.2 setosa 50
## 9 4.4 2.9 1.4 0.2 setosa 44
## 10 4.9 3.1 1.5 0.1 setosa 49
## 11 5.4 3.7 1.5 0.2 setosa 54
## 12 4.8 3.4 1.6 0.2 setosa 48
## 13 4.8 3.0 1.4 0.1 setosa 48
## 14 4.3 3.0 1.1 0.1 setosa 43
## 15 5.8 4.0 1.2 0.2 setosa 58
## 16 5.7 4.4 1.5 0.4 setosa 57
## 17 5.4 3.9 1.3 0.4 setosa 54
## 18 5.1 3.5 1.4 0.3 setosa 51
## 19 5.7 3.8 1.7 0.3 setosa 57
## 20 5.1 3.8 1.5 0.3 setosa 51
## 21 5.4 3.4 1.7 0.2 setosa 54
## 22 5.1 3.7 1.5 0.4 setosa 51
## 23 4.6 3.6 1.0 0.2 setosa 46
## 24 5.1 3.3 1.7 0.5 setosa 51
## 25 4.8 3.4 1.9 0.2 setosa 48
## 26 5.0 3.0 1.6 0.2 setosa 50
## 27 5.0 3.4 1.6 0.4 setosa 50
## 28 5.2 3.5 1.5 0.2 setosa 52
## 29 5.2 3.4 1.4 0.2 setosa 52
## 30 4.7 3.2 1.6 0.2 setosa 47
## 31 4.8 3.1 1.6 0.2 setosa 48
## 32 5.4 3.4 1.5 0.4 setosa 54
## 33 5.2 4.1 1.5 0.1 setosa 52
## [到达getOption("max.print") -- 略过117行]]
整合函数流:
iris %>%
filter(Species == "Virginica") %>%
mutate(SLMm = Sepal.Length) %>%
arrange(desc(SLMm))
## [1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## [6] SLMm
## <0 行> (或0-长度的row.names)
汇总
summarize()
函数可以让我们将很多变量汇总为单个的数据点。
iris %>%
summarize(medianSL = median(Sepal.Length))
## medianSL
## 1 5.8
iris %>%
filter(Species == "virginica") %>%
summarize(medianSL=median(Sepal.Length))
## medianSL
## 1 6.5
还可以一次性汇总多个变量
iris %>%
filter(Species == "virginica") %>%
summarize(medianSL = median(Sepal.Length),
maxSL = max(Sepal.Length))
## medianSL maxSL
## 1 6.5 7.9
group_by()
可以让我们安装指定的组别进行汇总数据,而不是针对整个数据框
iris %>%
group_by(Species) %>%
summarize(medianSL = median(Sepal.Length),
maxSL = max(Sepal.Length))
## # A tibble: 3 x 3
## Species medianSL maxSL
## <fct> <dbl> <dbl>
## 1 setosa 5.00 5.80
## 2 versicolor 5.90 7.00
## 3 virginica 6.50 7.90
iris %>%
filter(Sepal.Length>6) %>%
group_by(Species) %>%
summarize(medianPL = median(Petal.Length),
maxPL = max(Petal.Length))
## # A tibble: 2 x 3
## Species medianPL maxPL
## <fct> <dbl> <dbl>
## 1 versicolor 4.60 5.00
## 2 virginica 5.60 6.90
ggplot2
散点图
散点图可以帮助我们理解两个变量的数据关系,使用geom_point()
可以绘制散点图:
iris_small <- iris %>%
filter(Sepal.Length > 5)
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width)) +
geom_point()
额外的美学映射
- 颜色
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width,
color = Species)) +
geom_point()
- 大小
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width,
color = Species,
size = Sepal.Length)) +
geom_point()
- 分面
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width)) +
geom_point() +
facet_wrap(~Species)
线图
by_year <- gapminder %>%
group_by(year) %>%
summarize(medianGdpPerCap = median(gdpPercap))
ggplot(by_year, aes(x = year,
y = medianGdpPerCap)) +
geom_line() +
expand_limits(y=0)
条形图
by_species <- iris %>%
filter(Sepal.Length > 6) %>%
group_by(Species) %>%
summarize(medianPL=median(Petal.Length))
ggplot(by_species, aes(x = Species, y=medianPL)) +
geom_col()
直方图
ggplot(iris_small, aes(x = Petal.Length)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
箱线图
ggplot(iris_small, aes(x=Species, y=Sepal.Length)) +
geom_boxplot()
资料来源:DataCamp