按多个变量分组
当使用多个变量进行分组时,每次统计摘要会用掉一个分组变量,这样就可以对数据集进行循序渐进的分析:
library(dplyr)
library(nycflights13)
daily <- group_by(flights,year,month,day)
(per_day <- summarize(daily,flights=n()))
# A tibble: 365 x 4
# Groups: year, month [12]
year month day flights
<int> <int> <int> <int>
1 2013 1 1 842
2 2013 1 2 943
3 2013 1 3 914
4 2013 1 4 915
5 2013 1 5 720
6 2013 1 6 832
7 2013 1 7 933
8 2013 1 8 899
9 2013 1 9 902
10 2013 1 10 932
# ... with 355 more rows
(per_month <- summarise(per_day,flights=sum(flights)))
# A tibble: 12 x 3
# Groups: year [1]
year month flights
<int> <int> <int>
1 2013 1 27004
2 2013 2 24951
3 2013 3 28834
4 2013 4 28330
5 2013 5 28796
6 2013 6 28243
7 2013 7 29425
8 2013 8 29327
9 2013 9 27574
10 2013 10 28889
11 2013 11 27268
12 2013 12 28135
(per_year <- summarise(per_month,flights=sum(flights)))
# A tibble: 1 x 2
year flights
<int> <int>
1 2013 336776
取消分组
如果要取消分组,并返回到未分组的数据继续操作,可以使用ungroup()
函数:
daily %>% ungroup() %>% summarise(flights=n())
# A tibble: 1 x 1
flights
<int>
1 336776
分组新变量和筛选器
我们经常把group_by()
和summarize()
结合起来使用,但分组也可以与mutate()
和filter()
函数结合。
- 找出每个分组中最差的成员:
flights %>% group_by(year,month,day) %>% filter(rank(desc(arr_delay))<10)
# A tibble: 3,306 x 19
# Groups: year, month, day [365]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 848 1835 853 1001 1950 851
2 2013 1 1 1815 1325 290 2120 1542 338
3 2013 1 1 1842 1422 260 1958 1535 263
4 2013 1 1 1942 1705 157 2124 1830 174
5 2013 1 1 2006 1630 216 2230 1848 222
6 2013 1 1 2115 1700 255 2330 1920 250
7 2013 1 1 2205 1720 285 46 2040 246
8 2013 1 1 2312 2000 192 21 2110 191
9 2013 1 1 2343 1724 379 314 1938 456
10 2013 1 2 1244 900 224 1431 1104 207
# ... with 3,296 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
- 找出大于某个阈值的所有分组:
(popular_dests <- flights %>% group_by(dest) %>% filter(n()>365))
# A tibble: 332,577 x 19
# Groups: dest [77]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 517 515 2 830 819 11
2 2013 1 1 533 529 4 850 830 20
3 2013 1 1 542 540 2 923 850 33
4 2013 1 1 544 545 -1 1004 1022 -18
5 2013 1 1 554 600 -6 812 837 -25
6 2013 1 1 554 558 -4 740 728 12
7 2013 1 1 555 600 -5 913 854 19
8 2013 1 1 557 600 -3 709 723 -14
9 2013 1 1 557 600 -3 838 846 -8
10 2013 1 1 558 600 -2 753 745 8
# ... with 332,567 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
- 对数据标准化以计算分组指标:
head(popular_dests %>% filter(arr_delay>0) %>% mutate(prop_delay=arr_delay/sum(arr_delay)) %>% select(year:day,dest,arr_delay,prop_delay))
# A tibble: 6 x 6
# Groups: dest [4]
year month day dest arr_delay prop_delay
<int> <int> <int> <chr> <dbl> <dbl>
1 2013 1 1 IAH 11 0.000111
2 2013 1 1 IAH 20 0.000201
3 2013 1 1 MIA 33 0.000235
4 2013 1 1 ORD 12 0.0000424
5 2013 1 1 FLL 19 0.0000938
6 2013 1 1 ORD 8 0.0000283