转载请注明来源:
https://www.jianshu.com/p/fbe99705789d
简介
可以编写dplyr的人性化函数语句,来调用高效的data.table进行底层数据运算,使高可读性与高计算效率融为一体
github 项目地址:
https://github.com/tidyverse/dtplyr
安装
install.packages("dtplyr")
或
install.packages("devtools")
devtools::install_github("tidyverse/dtplyr")
调用
使用注意:包作者推荐使用时同时加载以下包:
library(tidyverse)
library(dtplyr)
笔者习惯以最小化加载更为轻量:
library(data.table)
library(dtplyr)
library(dplyr)
示例
示例数据集使用base包的mtcars
#使用dtplyr包时,将目标数据集惰性处理
data <- lazy_dt(mtcars)
#查看数据集全貌
data
Source: local data table [32 x 11]
Call: `_DT1`
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# Use as.data.table()/as.data.frame()/as_tibble() to access results
lazy_data可以通过as.data.frame();as.data.table();as_tibble()函数转换成相应形式的数据框
1.筛选:filter()
#筛选出vs = 1,同时wt > 3的数据
data %>%
filter(vs == 1 & wt > 3)
#
Source: local data table [?? x 11]
Call: `_DT1`[vs == 1 & wt > 3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
3 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
4 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
5 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
这里我们可以看到转义后的data.table语句同时出现在返回的对象中
Call: `_DT1`[vs == 1 & wt > 3]
%>%:管道操作符可以将上一语句执行的结果传入下一个语句中
#筛选data中前16行
data %>%
filter(row_number() == 1L:16L)
#
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
2.排序:arrange()
#以hp升序、drat降序排序
data %>%
arrange(hp, desc(drat))
#
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
3.列操作:select()
#提取cyl和disp列
data %>%
select(cyl, disp)
#
cyl disp
<dbl> <dbl>
1 6 160
2 6 160
3 4 108
4 6 258
5 8 360
6 6 225
#删除mpg列
data %>%
select(-mpg)
#
cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 160 110 3.9 2.62 16.5 0 1 4 4
2 6 160 110 3.9 2.88 17.0 0 1 4 4
3 4 108 93 3.85 2.32 18.6 1 1 4 1
4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 8 360 175 3.15 3.44 17.0 0 0 3 2
6 6 225 105 2.76 3.46 20.2 1 0 3 1
#调整vs列至第一列
data %>%
select(vs, everything())
#
vs mpg cyl disp hp drat wt qsec am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 21 6 160 110 3.9 2.62 16.5 1 4 4
2 0 21 6 160 110 3.9 2.88 17.0 1 4 4
3 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1
4 1 21.4 6 258 110 3.08 3.22 19.4 0 3 1
5 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2
6 1 18.1 6 225 105 2.76 3.46 20.2 0 3 1
4.添加列:mutate()
#计算wt与drat的积添加到列wt_drat
#增加新列type赋值"old"
data %>%
mutate(wt_drat = wt * drat) %>%
mutate(type = "old")
#
mpg cyl disp hp drat wt qsec vs am gear carb wt_drat type
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 10.2 old
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 11.2 old
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 8.93 old
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 9.90 old
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 10.8 old
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 9.55 old
5.去重:distinct()
#以变量cyl去重,保留所有变量
data %>%
distinct(cyl, .keep_all = T)
#
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6.数据分组、汇总:group_by() summarise()
#根据cyl分组,计算disp均值
data %>%
group_by(cyl) %>%
summarise(disp_mean = mean(disp))
#
cyl disp_mean
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
7.数据关联:join()
准备数据集:
#给data增加标签列id按序排号,调整id列值至第一列,将数据集赋值为data1
data1 <- data %>%
mutate(id = 1:nrow(data)) %>%
select(id, everything())
#
id mpg cyl disp hp drat wt qsec vs am gear carb
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#提取data1中17到32行数据赋值为data2
data2 <- data1 %>%
filter(row_number() == 17L:32L)
#
id mpg cyl disp hp drat wt qsec vs am gear carb
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
2 18 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 19 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 20 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
5 21 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
6 22 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
内连接:inner_join()
#合并数据仅保留匹配的记录 *data1 ∩ data2
inner_join(data2, data1, by = "id")
#
id mpg.x cyl.x disp.x hp.x drat.x wt.x qsec.x vs.x am.x gear.x carb.x mpg.y cyl.y disp.y hp.y drat.y
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 14.7 8 440 230 3.23
2 18 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 32.4 4 78.7 66 4.08
3 19 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 30.4 4 75.7 52 4.93
4 20 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 33.9 4 71.1 65 4.22
5 21 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 21.5 4 120. 97 3.7
6 22 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 15.5 8 318 150 2.76
# ... with 6 more variables: wt.y <dbl>, qsec.y <dbl>, vs.y <dbl>, am.y <dbl>, gear.y <dbl>, carb.y <dbl>
左连接:left_join()
#将data1,data2以data2中id列为基准进行匹配连接
left_join(data2, data1, by = "id")
#
id mpg.x cyl.x disp.x hp.x drat.x wt.x qsec.x vs.x am.x gear.x carb.x mpg.y cyl.y disp.y hp.y drat.y
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 14.7 8 440 230 3.23
2 18 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 32.4 4 78.7 66 4.08
3 19 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 30.4 4 75.7 52 4.93
4 20 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 33.9 4 71.1 65 4.22
5 21 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 21.5 4 120. 97 3.7
6 22 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 15.5 8 318 150 2.76
# ... with 6 more variables: wt.y <dbl>, qsec.y <dbl>, vs.y <dbl>, am.y <dbl>, gear.y <dbl>, carb.y <dbl>
全连接:full_join()
#合并数据保留所有记录 *data1 ∪ data2
full_join(data1, data2, by = "id")
#
id mpg.x cyl.x disp.x hp.x drat.x wt.x qsec.x vs.x am.x gear.x carb.x mpg.y cyl.y disp.y hp.y drat.y
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 NA NA NA NA NA
2 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 NA NA NA NA NA
3 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 NA NA NA NA NA
4 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 NA NA NA NA NA
5 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 NA NA NA NA NA
6 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 NA NA NA NA NA
# ... with 6 more variables: wt.y <dbl>, qsec.y <dbl>, vs.y <dbl>, am.y <dbl>, gear.y <dbl>, carb.y <dbl>
除连接:anti_join()
#返回无法与data2匹配的data1的所有记录 *data1 - (data1 ∩ data2)
anti_join(data1, data2, by = "id")
#
id mpg cyl disp hp drat wt qsec vs am gear carb
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
保留全数据左连接:semi_join()
#返回能够与data2匹配的data1的所有记录
semi_join(data1, data2, by = "id")
#
id mpg cyl disp hp drat wt qsec vs am gear carb
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
2 18 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 19 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 20 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
5 21 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
6 22 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
例如bind_rows数据集合并函数暂时还不适用于lazy_dt()格式,希望包作者能尽快搞定吧~