用dplyr操作数据课程目录
Chapter1. 数据变形
Chapter2. 数据统计
Chapter3. 数据选择和变形
Chapter4. 实战演练
是时候通过实战演练来复习一下目前为止所学习到的内容了。
实战演练中会用到的数据是babynames
。数据里一共有三个变量year
,name
,number
。
对数据进行筛选和排序
选择1990年的数据,并根据number
进行排序。
babynames %>%
# Filter for the year 1990
filter(year==1990) %>%
# Sort the number column in descending order
arrange(desc(number))
# A tibble: 21,223 x 3
year name number
<dbl> <chr> <int>
1 1990 Michael 65560
2 1990 Christopher 52520
3 1990 Jessica 46615
4 1990 Ashley 45797
5 1990 Matthew 44925
6 1990 Joshua 43382
7 1990 Brittany 36650
8 1990 Amanda 34504
9 1990 Daniel 33963
10 1990 David 33862
# ... with 21,213 more rows
显示每年最常用的名字。这里会用到group_by()
进行分组,用top_n()
显示排名。
# Find the most common name in each year
babynames %>%
group_by(year) %>%
top_n(1,number)
# A tibble: 28 x 3
# Groups: year [28]
year name number
<dbl> <chr> <int>
1 1880 John 9701
2 1885 Mary 9166
3 1890 Mary 12113
4 1895 Mary 13493
5 1900 Mary 16781
6 1905 Mary 16135
7 1910 Mary 22947
8 1915 Mary 58346
9 1920 Mary 71175
10 1925 Mary 70857
# ... with 18 more rows
选择名字是Steven, Thomas, Matthew
的数据,并以year
为x轴,number
为y轴,看每个name
的year
和number
的差别。
# Filter for the names Steven, Thomas, and Matthew
selected_names <- babynames %>%
filter(name %in% c("Steven", "Thomas", "Matthew"))
# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year , y = number, color = name)) +
geom_line()
首先根据year
分组,计算每个year
里number
的总和命名为year_total
。然后根据这个year_total
计算每年每个名字占到的比例命名为fraction
。最后计算每个name
里fraction
排名第一的数据。
# Calculate the fraction of people born each year with the same name
babynames %>%
group_by(year) %>%
mutate(year_total=sum(number)) %>%
ungroup() %>%
mutate(fraction = number/year_total)
# A tibble: 332,595 x 5
year name number year_total fraction
<dbl> <chr> <int> <int> <dbl>
1 1880 Aaron 102 201478 0.000506
2 1880 Ab 5 201478 0.0000248
3 1880 Abbie 71 201478 0.000352
4 1880 Abbott 5 201478 0.0000248
5 1880 Abby 6 201478 0.0000298
6 1880 Abe 50 201478 0.000248
7 1880 Abel 9 201478 0.0000447
8 1880 Abigail 12 201478 0.0000596
9 1880 Abner 27 201478 0.000134
10 1880 Abraham 81 201478 0.000402
# ... with 332,585 more rows