用dplyr合并数据
Chapter1. 合并数据表
Chapter2. 向左,向右合并
Chapter3. 完全合并,半完全合并
Chapter4. 问题实践
Chapter2. 向左,向右合并
left_join
向左合并
向左合并left_join
,顾名思义,就是向左边的数据集对齐,保留第一个数据集所有的信息。
举个例子,根据"part_num"和"color_id"这两个变量把"millennium_falcon"和"star_destroyer"向左合并。并重命名名字一样的变量。
# Combine the star_destroyer and millennium_falcon tables
millennium_falcon %>%
left_join(star_destroyer,by=c("part_num","color_id"),
suffix=c("_falcon","_star_destroyer"))
# A tibble: 263 x 6
set_num_falcon part_num color_id quantity_falcon set_num_star_de~
<chr> <chr> <dbl> <dbl> <chr>
1 7965-1 63868 71 62 <NA>
2 7965-1 3023 0 60 <NA>
3 7965-1 3021 72 46 75190-1
4 7965-1 2780 0 37 75190-1
5 7965-1 60478 72 36 <NA>
6 7965-1 6636 71 34 75190-1
7 7965-1 3009 71 28 75190-1
8 7965-1 3665 71 22 <NA>
9 7965-1 2412b 72 20 75190-1
10 7965-1 3010 71 19 <NA>
# ... with 253 more rows, and 1 more variable: quantity_star_destroyer <dbl>
接下来的例子稍微复杂点,会结合到别的课程学到的知识。
- 根据某个变量分别对两组数据进行描述行统计(用到
group_by
和summarize
)
- 根据某个变量分别对两组数据进行描述行统计(用到
- 合并这两个描述性统计量
# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
left_join(star_destroyer_colors,by="color_id",
suffix=c("_falcon","_star_destroyer"))
下面的例子会用到以前学过的filter
。先从数据集inventories里提取出变量"version"是1的数据,然后和第二个数据集sets根据共同变量"set_ num"向左合并。然后提取出数据集inventories里不存在的变量,也就是合并以后"version"显示NA的数据。这里用到了is.na()
。
inventory_version_1 <- inventories %>%
filter(version == 1)
# Join versions to sets
sets %>%
left_join(inventory_version_1,by="set_num") %>%
# Filter for where version is na
filter(is.na(version))
# A tibble: 1 x 6
set_num name year theme_id id version
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 40198-1 Ludo game 2018 598 NA NA
right_join
向右合并
向右合并和向左相反,合并以后保留第二个数据集的所有内容。现举个例子,用
count
描述变量"part_cat_id"的频度(此时产生一个默认频度变量n)。然后和数据集"part_categories"向右合并。提取出n为NA的数据。这里用到了之前不同变量名之间的匹配语法
by=c("A"="B")
。
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Filter for NA
filter(is.na(n))
# A tibble: 1 x 3
part_cat_id n name
<dbl> <int> <chr>
1 66 NA Modulex
教程里还介绍了替换NA值得方法。replace_na
用0来替换NA。
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Use replace_na to replace missing values in the n column
replace_na(list(n=0))
# A tibble: 64 x 3
part_cat_id n name
<dbl> <dbl> <chr>
1 1 135 Baseplates
2 3 303 Bricks Sloped
3 4 1900 Duplo, Quatro and Primo
4 5 107 Bricks Special
5 6 128 Bricks Wedged
6 7 97 Containers
7 8 24 Technic Bricks
8 9 167 Plates Special
9 11 490 Bricks
10 12 85 Technic Connectors
# ... with 54 more rows