在处理数据的时候,很少遇到那种只对一张表进行处理的情况。经常的状况是关于某个主题有几张表,这些表以某种关键字联系在一起。
这时候:
• 合并连接:向数据框中加入新变量,新变量的值是另一个数据框中的匹配观测。
• 筛选连接:根据是否匹配另一个数据框中的观测,筛选数据框中的观测。
• 集合操作:将观测作为集合元素来处理。
> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.2.0 √ purrr 0.3.2
√ tibble 2.1.1 √ dplyr 0.8.0.1
√ tidyr 0.8.3 √ stringr 1.4.0
√ readr 1.3.1 √ forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Warning messages:
1: 程辑包‘tidyverse’是用R版本3.5.3 来建造的
2: 程辑包‘ggplot2’是用R版本3.5.3 来建造的
3: 程辑包‘tibble’是用R版本3.5.3 来建造的
4: 程辑包‘tidyr’是用R版本3.5.3 来建造的
5: 程辑包‘readr’是用R版本3.5.2 来建造的
6: 程辑包‘purrr’是用R版本3.5.3 来建造的
7: 程辑包‘dplyr’是用R版本3.5.2 来建造的
8: 程辑包‘stringr’是用R版本3.5.2 来建造的
> setwd("D:\\Users\\Administrator\\Desktop\\RStudio\\R-Programming")
描述航空公司的四个tibble数据:
- airlines:可以根据航空公司的缩写码查到公司全名。
- airports:给出了每个机场的信息,通过 faa 机场编码进行标识。
- planes:给出了每架飞机的信息,通过 tailnum 进行标识。
- weather:给出了纽约机场每小时的天气状况。
> library(nycflights13)
Warning message:
程辑包‘nycflights13’是用R版本3.5.3 来建造的
> airlines
# A tibble: 16 x 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
> airports
# A tibble: 1,458 x 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_York
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Chicago
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago
4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_York
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_York
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/New_York
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/New_York
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/New_York
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/New_York
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_Angeles
# ... with 1,448 more rows
> planes
# A tibble: 3,322 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
7 N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
8 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
9 N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
10 N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
# ... with 3,312 more rows
> weather
# A tibble: 26,115 x 15
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
<chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA 0 1012 10 2013-01-01 01:00:00
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA 0 1012. 10 2013-01-01 02:00:00
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA 0 1012. 10 2013-01-01 03:00:00
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA 0 1012. 10 2013-01-01 04:00:00
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA 0 1012. 10 2013-01-01 05:00:00
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA 0 1012. 10 2013-01-01 06:00:00
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 NA 0 1012. 10 2013-01-01 07:00:00
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 NA 0 1012. 10 2013-01-01 08:00:00
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 NA 0 1013. 10 2013-01-01 09:00:00
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 NA 0 1012. 10 2013-01-01 10:00:00
# ... with 26,105 more rows
- flights connects to planes via a single variable, tailnum.
- flights connects to airlines through the carrier variable.
- flights connects to airports in two ways: via the origin and dest variables.
- flights connects to weather via origin (the location), and year, month, day and hour (the time).
用于连接每对数据表的变量称为键。键是能唯一标识观测的变量(或变量集合)。
• 主键:唯一标识其所在数据表中的观测。例如, planes
$tailnum 是一个主键,因为它可以唯一表示planes的每架飞机。
- 外键:唯一表示另一个数据框的观测数据。如filghts$tailnum 是一个外键,因为其出现在 flights 表中,并可以将每次航班与唯一一架飞机匹配。
> # 对主键进行 count() 操作,然后查看是否有 n 大于 1 的记录,即查看是否唯一值
> planes %>%
+ count(tailnum) %>%
+ filter(n > 1)
# A tibble: 0 x 2
# ... with 2 variables: tailnum <chr>, n <int>
> weather %>%
+ count(year, month, day, hour, origin) %>%
+ filter(n > 1)
# A tibble: 3 x 6
year month day hour origin n
<dbl> <dbl> <int> <int> <chr> <int>
1 2013 11 3 1 EWR 2
2 2013 11 3 1 JFK 2
3 2013 11 3 1 LGA 2
> #flights 表中的主键是什么?
> flights %>%
+ count(year, month, day, flight) %>%
+ filter(n > 1)
# A tibble: 29,768 x 5
year month day flight n
<int> <int> <int> <int> <int>
1 2013 1 1 1 2
2 2013 1 1 3 2
3 2013 1 1 4 2
4 2013 1 1 11 3
5 2013 1 1 15 2
6 2013 1 1 21 2
7 2013 1 1 27 4
8 2013 1 1 31 2
9 2013 1 1 32 2
10 2013 1 1 35 2
# ... with 29,758 more rows
> flights %>%
+ count(year, month, day, tailnum) %>%
+ filter(n > 1)
# A tibble: 64,928 x 5
year month day tailnum n
<int> <int> <int> <chr> <int>
1 2013 1 1 N0EGMQ 2
2 2013 1 1 N11189 2
3 2013 1 1 N11536 2
4 2013 1 1 N11544 3
5 2013 1 1 N11551 2
6 2013 1 1 N12540 2
7 2013 1 1 N12567 2
8 2013 1 1 N13123 2
9 2013 1 1 N13538 3
10 2013 1 1 N13566 3
# ... with 64,918 more rows
>
如果一张表没有主键,有时就需要使用 mutate() 函数和 row_number() 函数为表加上一个主键。这样一来,
如果你完成了一些筛选工作,并想要使用原始数据检查的话,就可以更容易地匹配观测。这种主键称为代理键。
理解链接
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
Inner join
x %>%
inner_join(y, by = "key")
#> # A tibble: 2 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
Outer joins
A left join keeps all observations in x.
A right join keeps all observations in y.
A full join keeps all observations in x and y.
Another way to depict the different types of joins is with a Venn diagram:
Duplicate keys
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
left_join(x, y, by = "key")
#> # A tibble: 4 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 2 x3 y2
#> 4 1 x4 y1
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
#> # A tibble: 6 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 2 x2 y3
#> 4 2 x3 y2
#> 5 2 x3 y3
#> 6 3 x4 y4
Defining the key columns
flights2 %>%
left_join(weather)
#> Joining, by = c("year", "month", "day", "hour", "origin")
#> # A tibble: 336,776 x 18
#> year month day hour origin dest tailnum carrier temp dewp humid
#> <dbl> <dbl> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
#> 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
#> 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
#> 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
#> 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
#> 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
#> # … with 3.368e+05 more rows, and 7 more variables: wind_dir <dbl>,
#> # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#> # visib <dbl>, time_hour <dttm>
flights2 %>%
left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#> year.x month day hour origin dest tailnum carrier year.y type
#> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe…
#> 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe…
#> 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe…
#> 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe…
#> 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe…
#> 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
#> # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
flights2 %>%
left_join(airports, c("dest" = "faa"))
#> # A tibble: 336,776 x 15
#> year month day hour origin dest tailnum carrier name lat lon
#> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2013 1 1 5 EWR IAH N14228 UA Geor… 30.0 -95.3
#> 2 2013 1 1 5 LGA IAH N24211 UA Geor… 30.0 -95.3
#> 3 2013 1 1 5 JFK MIA N619AA AA Miam… 25.8 -80.3
#> 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA
#> 5 2013 1 1 6 LGA ATL N668DN DL Hart… 33.6 -84.4
#> 6 2013 1 1 5 EWR ORD N39463 UA Chic… 42.0 -87.9
#> # … with 3.368e+05 more rows, and 4 more variables: alt <int>, tz <dbl>,
#> # dst <chr>, tzone <chr>
flights2 %>%
left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#> year month day hour origin dest tailnum carrier name lat lon
#> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2013 1 1 5 EWR IAH N14228 UA Newa… 40.7 -74.2
#> 2 2013 1 1 5 LGA IAH N24211 UA La G… 40.8 -73.9
#> 3 2013 1 1 5 JFK MIA N619AA AA John… 40.6 -73.8
#> 4 2013 1 1 5 JFK BQN N804JB B6 John… 40.6 -73.8
#> 5 2013 1 1 6 LGA ATL N668DN DL La G… 40.8 -73.9
#> 6 2013 1 1 5 EWR ORD N39463 UA Newa… 40.7 -74.2
#> # … with 3.368e+05 more rows, and 4 more variables: alt <int>, tz <dbl>,
#> # dst <chr>, tzone <chr>
Other implementations
dplyr | merge |
---|---|
inner_join(x, y) | merge(x, y) |
left_join(x, y) | merge(x, y, all.x = TRUE) |
right_join(x, y) | merge(x, y, all.y = TRUE) |
full_join(x, y) | merge(x, y, all.x = TRUE, all.y = TRUE) |
Filtering joins
semi_join(x, y): 保留 x 表中与 y 表中的观测相匹配的所有观测。
• anti_join(x, y): 丢弃 x 表中与 y 表中的观测相匹配的所有观测。
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
# 筛选出前10个
flights %>%
filter(dest %in% top_dest$dest)
# 半连接,只保留两个数据框都有的行的数据
flights %>%
semi_join(top_dest)
# 反连接保留 x 表中那些没有匹配 y 表的行。
# 反连接可以用于诊断连接中的不匹配。例如,在连接 flights 和 planes 时,你可能想知道flights 中是否有很多行在 planes 中没有匹配记录:
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
Join problems
Set operations
The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:
intersect(x, y): return only observations in both x and y.
union(x, y): return unique observations in x and y.
setdiff(x, y): return observations in x, but not in y.
Given this simple data:
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
The four possibilities are:
intersect(df1, df2)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
# Note that we get 3 rows, not 4
union(df1, df2)
#> # A tibble: 3 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 1
#> 3 1 1
setdiff(df1, df2)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 2 1
setdiff(df2, df1)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 2