- 合并连接 加入新变量
- 筛选连接 筛选原有观测
- 集合操作 相同变量进行集合
由于连接每对数据表的变量称为键(key)
主键:唯一能识别其所在数据表中的观测
外键:唯一标识另一个数据表的观测
可通过count()函数验证主键
> planes %>% count(tailnum) %>%
+ filter(n > 1)
# A tibble: 0 x 2
# … with 2 variables: tailnum <chr>, n <int>
1. 合并连接
left_join(x, y, by = NULL)
e.g.
> flight2 <- flights %>%
+ select(year:day, hour, origin, dest, tailnum, carrier)
> flight2
# A tibble: 336,776 x 8
year month day hour origin dest tailnum carrier
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA
2 2013 1 1 5 LGA IAH N24211 UA
3 2013 1 1 5 JFK MIA N619AA AA
4 2013 1 1 5 JFK BQN N804JB B6
5 2013 1 1 6 LGA ATL N668DN DL
6 2013 1 1 5 EWR ORD N39463 UA
7 2013 1 1 6 EWR FLL N516JB B6
8 2013 1 1 6 LGA IAD N829AS EV
9 2013 1 1 6 JFK MCO N593JB B6
10 2013 1 1 6 LGA ORD N3ALAA AA
# … with 336,766 more rows
> flight2 %>% select(-origin, -dest) %>%
+ left_join(airlines, by = "carrier")
# A tibble: 336,776 x 7
year month day hour tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr>
1 2013 1 1 5 N14228 UA United Air Lines Inc.
2 2013 1 1 5 N24211 UA United Air Lines Inc.
3 2013 1 1 5 N619AA AA American Airlines Inc.
4 2013 1 1 5 N804JB B6 JetBlue Airways
5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 N39463 UA United Air Lines Inc.
7 2013 1 1 6 N516JB B6 JetBlue Airways
8 2013 1 1 6 N829AS EV ExpressJet Airlines In…
9 2013 1 1 6 N593JB B6 JetBlue Airways
10 2013 1 1 6 N3ALAA AA American Airlines Inc.
# … with 336,766 more rows
当存在重复键时,会得倒所有可能的组合
定义键列:
-
key
默认值为NULL
flight2 %>% left_join(weather)
by = "x"
flights2 %>% left_join(planes, by = "tailnum")
- 命名字符变量
by = c("a" = "b")
匹配x表中的a
变量与y表中的b
变量
flight2 %>% left_join(airports, c("dest" = "faa"))
2. 筛选连接
semi_join(x, y)
保留x中与y中观测想匹配的观测
anti_join(x, y)
丢弃x中与y中观测想匹配的观测
e.g. 找出飞往最受欢迎的10个目的地的航班
> top_dest <- flights %>%
+ count(dest, sort = TRUE) %>%
+ head(10)
> flights %>%
+ semi_join(top_dest)
Joining, by = "dest"
# A tibble: 141,145 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 542 540 2 923
2 2013 1 1 554 600 -6 812
3 2013 1 1 554 558 -4 740
4 2013 1 1 555 600 -5 913
5 2013 1 1 557 600 -3 838
6 2013 1 1 558 600 -2 753
7 2013 1 1 558 600 -2 924
8 2013 1 1 558 600 -2 923
9 2013 1 1 559 559 0 702
10 2013 1 1 600 600 0 851
# … with 141,135 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
3. 集合
集合操作基于整行,需要 x 和 y 具有相同的变量