The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.
一整套数据处理的方法包-----包含下面的包:
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
处理数据流程:
- 数据导入
- 数据整理
- 数据探索(可视化,统计分析)
If you’d like to learn how to use the tidyverse effectively, the best place to start is R for data science.
安装
# Install from CRAN
install.packages("tidyverse")
# Or the development version from GitHub
# install.packages("devtools")
devtools::install_github("tidyverse/tidyverse")
使用
library(tidyverse)will load the core tidyverse packages:
- ggplot2, for data visualisation.
- dplyr, for data manipulation.
- tidyr, for data tidying.
- readr, for data import.
- purrr, for functional programming.
- tibble, for tibbles, a modern re-imagining of data frames.
- stringr, for strings.
- forcats, for factors.
library(tidyverse)
#载入数据
library(datasets)
install.packages("gapminder")
library(gapminder)
attach(iris)
#数据过滤dplyr
#filter()函数可以用来取数据子集。
iris %>%
filter(Species == "virginica") # 指定满足的行
iris %>%
filter(Species == "virginica", Sepal.Length > 6) # 多个条件用,分隔
#排序
# arrange()函数用来对观察值排序,默认是升序。
iris %>%
arrange(Sepal.Length)
iris %>%
arrange(desc(Sepal.Length)) # 降序
# 新增变量
# mutate()可以更新或者新增数据框一列。
iris %>%
mutate(Sepal.Length = Sepal.Length * 10) # 将该列数值变成以mm为单位
iris %>%
mutate(SLMn = Sepal.Length * 10) # 创建新的一列
# 整合函数流:
iris %>%
filter(Species == "Virginica") %>%
mutate(SLMm = Sepal.Length) %>%
arrange(desc(SLMm))
## [1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## [6] SLMm
## <0 行> (或0-长度的row.names)
# 汇总
# summarize()函数可以让我们将很多变量汇总为单个的数据点。
iris %>%
summarize(medianSL = median(Sepal.Length))
## medianSL
## 1 5.8
iris %>%
filter(Species == "virginica") %>%
summarize(medianSL=median(Sepal.Length))
# 一次性汇总多个变量
iris %>%
filter(Species == "virginica") %>%
summarize(medianSL = median(Sepal.Length),
maxSL = max(Sepal.Length))
# group_by()可以让我们安装指定的组别进行汇总数据,而不是针对整个数据框
iris %>%
group_by(Species) %>%
summarize(medianSL = median(Sepal.Length),
maxSL = max(Sepal.Length))
iris %>%
filter(Sepal.Length>6) %>%
group_by(Species) %>%
summarize(medianPL = median(Petal.Length),
maxPL = max(Petal.Length))
# ggplot2
# 散点图
# 散点图可以帮助我们理解两个变量的数据关系,使用geom_point()可以绘制散点图:
iris_small <- iris %>%
filter(Sepal.Length > 5)
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width)) +
geom_point()
# 颜色
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width,
color = Species)) +
geom_point()
# 大小
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width,
color = Species,
size = Sepal.Length)) +
geom_point()
# 分面
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width)) +
geom_point() +
facet_wrap(~Species)
#线图
by_year <- gapminder %>%
group_by(year) %>%
summarize(medianGdpPerCap = median(gdpPercap))
ggplot(by_year, aes(x = year,
y = medianGdpPerCap)) +
geom_line() +
expand_limits(y=0)
# 条形图
by_species <- iris %>%
filter(Sepal.Length > 6) %>%
group_by(Species) %>%
summarize(medianPL=median(Petal.Length))
ggplot(by_species, aes(x = Species, y=medianPL)) +
geom_col()
# 直方图
ggplot(iris_small, aes(x = Petal.Length)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 箱线图
ggplot(iris_small, aes(x=Species, y=Sepal.Length)) +
geom_boxplot()
参考文章:
https://www.jianshu.com/p/f3c21a5ad10a
https://tidyverse.tidyverse.org/
https://zhuanlan.zhihu.com/p/88947457