用dplyr操作数据课程目录
Chapter1. 数据变形
Chapter2. 数据统计
Chapter3. 数据选择和变形
Chapter4. 实战演练
本章节前半部分内容在之前的<Tidyverse>有出现过一些,重复的内容就不详细讲解了,简单带过。但是后半部分出现了一些新的内容会稍微详细的说明一下。希望能对大家有所帮助。
用select()
选择变量
用select()
选择变量,并用arrange()
根据某变量进行排序。
counties %>%
# Select state, county, population, and industry-related columns
select(state,county,population,professional,service,office,construction,production) %>%
# Arrange service in descending order
arrange(desc(service))
# A tibble: 3,138 x 8
state county population professional service office construction production
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Missis~ Tunica 10477 23.9 36.6 21.5 3.5 14.5
2 Texas Kinney 3577 30 36.5 11.6 20.5 1.3
3 Texas Kenedy 565 24.9 34.1 20.5 20.5 0
4 New Yo~ Bronx 1428357 24.3 33.3 24.2 7.1 11
5 Texas Brooks 7221 19.6 32.4 25.3 11.1 11.5
6 Colora~ Fremo~ 46809 26.6 32.2 22.8 10.7 7.6
7 Texas Culbe~ 2296 20.1 32.2 24.2 15.7 7.8
8 Califo~ Del N~ 27788 33.9 31.5 18.8 8.9 6.8
9 Minnes~ Mahno~ 5496 26.8 31.5 18.7 13.1 9.9
10 Virgin~ Lanca~ 11129 30.3 31.2 22.8 8.1 7.6
# ... with 3,128 more rows
用filter()
对数据进行筛选。
counties %>%
# Select the state, county, population, and those ending with "work"
select(state,county,population,ends_with("work")) %>%
# Filter for counties that have at least 50% of people engaged in public work
filter( public_work >= 50)
# A tibble: 7 x 6
state county population private_work public_work family_work
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Alaska Lake and Peninsula~ 1474 42.2 51.6 0.2
2 Alaska Yukon-Koyukuk Cens~ 5644 33.3 61.7 0
3 California Lassen 32645 42.6 50.5 0.1
4 Hawaii Kalawao 85 25 64.1 0
5 North Dak~ Sioux 4380 32.9 56.8 0.1
6 South Dak~ Todd 9942 34.4 55 0.8
7 Wisconsin Menominee 4451 36.8 59.1 0.4
select()
的其他用法
当数据变量很多的时候,手动一个一个输入变量明显会降低神产销率。select()
支持批量性的选择变量。
counties %>%
select(state, county, drive:work_at_home)
-
contains
包含xx的变量 -
starts_with
以xx开始的变量 -
ends_with
以xx结尾的变量
举个例子
也可以用select()
删除某个变量
用rename()
给变量重新命名
rename()
是第一次出现,用法可以参照下面的代码。
counties %>%
count(state)
# A tibble: 50 x 2
state n
<chr> <int>
1 Alabama 67
2 Alaska 28
3 Arizona 15
4 Arkansas 75
5 California 58
6 Colorado 64
7 Connecticut 8
8 Delaware 3
9 Florida 67
10 Georgia 159
# ... with 40 more rows
# Rename the n column to num_counties
counties %>%
count(state) %>%
rename(num_counties=n)
# A tibble: 50 x 2
state num_counties
<chr> <int>
1 Alabama 67
2 Alaska 28
3 Arizona 15
4 Arkansas 75
5 California 58
6 Colorado 64
7 Connecticut 8
8 Delaware 3
9 Florida 67
10 Georgia 159
# ... with 40 more rows
也可以不用rename()
直接简单粗暴点。
# Select state, county, and poverty as poverty_rate
counties %>%
select(state,county,poverty_rate=poverty)
# A tibble: 3,138 x 3
state county poverty_rate
<chr> <chr> <dbl>
1 Alabama Autauga 12.9
2 Alabama Baldwin 13.4
3 Alabama Barbour 26.7
4 Alabama Bibb 16.8
5 Alabama Blount 16.7
6 Alabama Bullock 24.6
7 Alabama Butler 25.4
8 Alabama Calhoun 20.5
9 Alabama Chambers 21.6
10 Alabama Cherokee 19.2
# ... with 3,128 more rows
用transmute()
变换和产生新的变量
transmute()
的特点
- 选择变量&转换变量
- 产生的新变量会替换之前的变量
比方说我们要根据population/land_area
来产生新的变量density
。用transmute
就不需要先select()
再mutate()
了。
counties %>%
# Keep the state, county, and populations columns, and add a density column
transmute(state,county,population,density=population/land_area) %>%
# Filter for counties with a population greater than one million
filter(population > 1000000) %>%
# Sort density in ascending order
arrange(density)
# A tibble: 41 x 4
state county population density
<chr> <chr> <dbl> <dbl>
1 California San Bernardino 2094769 104.
2 Nevada Clark 2035572 258.
3 California Riverside 2298032 319.
4 Arizona Maricopa 4018143 437.
5 Florida Palm Beach 1378806 700.
6 California San Diego 3223096 766.
7 Washington King 2045756 967.
8 Texas Travis 1121645 1133.
9 Florida Hillsborough 1302884 1277.
10 Florida Orange 1229039 1360.
# ... with 31 more rows
语法总结
只保留特定的变量 | 同时保留别的变量 | |
---|---|---|
不改变变量值 | select() |
rename() |
改变变量值 | transmute() |
mutate() |