前面的2节介绍了数据的格式转换宽表转长表函数pivot_longer以及行过滤函数filter( )今天来继续介绍列选择函数select( )的使用
选择列:基础
要选择几列,只需在select函数中添加其名称即可。添加它们的顺序将确定它们在输出中出现的顺序
msleep %>%
select(name, genus, sleep_total, awake)
# A tibble: 83 x 4
name genus sleep_total awake
<chr> <chr> <dbl> <dbl>
1 Cheetah Acinonyx 12.1 11.9
2 Owl monkey Aotus 17 7
3 Mountain beaver Aplodontia 14.4 9.6
如果要添加大量的列,可以使用start_col:end_col的语句:
msleep %>%
select(name:order,sleep_cycle:brainwt)
# A tibble: 83 x 7
name genus vore order sleep_cycle awake brainwt
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Cheetah Acinonyx carni Carnivora NA 11.9 NA
2 Owl monkey Aotus omni Primates NA 7 0.0155
3 Mountain beaver Aplodontia herbi Rodentia NA 9.6 NA
还可以通过在列名称前面添加减号来取消列
msleep %>%
select(-conservation, -(sleep_total:awake))
# A tibble: 83 x 6
name genus vore order brainwt bodywt
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Cheetah Acinonyx carni Carnivora NA 50
2 Owl monkey Aotus omni Primates 0.0155 0.48
3 Mountain beaver Aplodontia herbi Rodentia NA 1.35
根据部分列名选择列
如果有很多的列具有相似的结构,可以通过starts_with(),ends_with()或contains( )来进行选择
msleep %>% select(name, starts_with("sleep"))
# A tibble: 83 x 4
name sleep_total sleep_rem sleep_cycle
<chr> <dbl> <dbl> <dbl>
1 Cheetah 12.1 NA NA
2 Owl monkey 17 1.8 NA
3 Mountain beaver 14.4 2.4 NA
msleep %>%
select(contains("eep"), ends_with("wt"))
# A tibble: 83 x 5
sleep_total sleep_rem sleep_cycle brainwt bodywt
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12.1 NA NA NA 50
2 17 1.8 NA 0.0155 0.48
3 14.4 2.4 NA NA 1.35
根据正则表达式选择列
如果列名没有相似性,则可以使用matches()来进行选择;
以下示例代码将添加包含“ o”,后跟一个或多个其他字母和“ er”的列
msleep %>% select(matches("o.+er"))
# A tibble: 83 x 2
order conservation
<chr> <chr>
1 Carnivora lc
2 Primates NA
3 Rodentia nt
根据数据集来选择列
class <- c("name", "genus", "vore", "order", "conservation")
msleep %>% select(!!class)
# A tibble: 83 x 5
name genus vore order conservation
<chr> <chr> <chr> <chr> <chr>
1 Cheetah Acinonyx carni Carnivora lc
2 Owl monkey Aotus omni Primates NA
3 Mountain beaver Aplodontia herbi Rodentia nt
按数据类型选择列 (重点)
select_if( )函数来判断数据类型,可以使用其来选择所有字符串列select_if(is.character)。同样也可以添加is.numeric, is.integer,is.double,is.logical,is.factor等列;如果有日期列,则可以加载lubridate包,然后使用 is.POSIXt或is.Date
msleep %>% select_if(is.numeric)
# A tibble: 83 x 6
sleep_total sleep_rem sleep_cycle awake brainwt bodywt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12.1 NA NA 11.9 NA 50
2 17 1.8 NA 7 0.0155 0.48
3 14.4 2.4 NA 9.6 NA 1.35
同样也可以取反,选择不需要那种数据类型的列
msleep %>% select_if(~!is.numeric(.))
# A tibble: 83 x 5
name genus vore order conservation
<chr> <chr> <chr> <chr> <chr>
1 Cheetah Acinonyx carni Carnivora lc
2 Owl monkey Aotus omni Primates NA
3 Mountain beaver Aplodontia herbi Rodentia nt
通过逻辑表达式选择列 (重点)
select_if( )不仅仅是基于数据类型来进行选择。还可以选择所有列平均值大于10的列。
mean > 10它本身不是函数,因此需要在前面添加波浪号,或使用funs()将语句转换为函数
msleep %>% select_if(is.numeric) %>%
select_if(~mean(., na.rm=TRUE) > 10)
# A tibble: 83 x 3
sleep_total awake bodywt
<dbl> <dbl> <dbl>
1 12.1 11.9 50
2 17 7 0.48
3 14.4 9.6 1.35
也可以这样写
msleep %>%
select_if(~is.numeric(.) & mean(., na.rm=TRUE) > 10)
另一个有用的select_if参数是n_distinct(),它可以在列中找到的不同值出现的数量
msleep %>% select_if(~n_distinct(.) < 10)
vore conservation
<chr> <chr>
1 carni lc
2 omni NA
3 herbi nt
4 omni lc
对列进行重新排序
everything()函数可将选择的列移至表格最前
msleep %>%
select(conservation, sleep_total, everything())
conservation sleep_total name genus vore order sleep_rem sleep_cycle awake brainwt
<chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 lc 12.1 Cheetah Acin… carni Carni… NA NA 11.9 NA
2 NA 17 Owl monk… Aotus omni Prima… 1.8 NA 7 0.0155
3 nt 14.4 Mountain… Aplo… herbi Roden… 2.4 NA 9.6 NA
更改列名
msleep %>%
select(animal = name, sleep_total, extinction_threat = conservation)
# A tibble: 83 x 3
animal sleep_total extinction_threat
<chr> <dbl> <chr>
1 Cheetah 12.1 lc
2 Owl monkey 17 NA
3 Mountain beaver 14.4 nt
也可以通过rename()函数来重命名
msleep %>%
rename(animal = name, extinction_threat = conservation)
# A tibble: 83 x 11
animal genus vore order extinction_thre… sleep_total sleep_rem sleep_cycle awake brainwt
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cheet… Acin… carni Carn… lc 12.1 NA NA 11.9 NA
2 Owl m… Aotus omni Prim… NA 17 1.8 NA 7 0.0155
3 Mount… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6 NA
重新格式化所有列名
select_all()函数允许更改所有列,并以一个函数作为参数。
要以大写形式获取所有列名称,可以使用toupper(),也可以使用tolower()将其全部转化为小写
msleep %>% select_all(toupper)
NAME GENUS VORE ORDER CONSERVATION SLEEP_TOTAL SLEEP_REM SLEEP_CYCLE AWAKE BRAINWT
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carni… lc 12.1 NA NA 11.9 NA
2 Owl monk… Aotus omni Prima… NA 17 1.8 NA 7 0.0155
3 Mountain… Aplo… herbi Roden… nt 14.4 2.4 NA 9.6 NA
自主创建函数(重点)
将列名中的空格替换为下划线
msleep2 <- select(msleep, name, sleep_total, brainwt)
colnames(msleep2) <- c("name", "sleep total", "brain weight")
msleep2 %>%
select_all(~str_replace(., " ", "_"))
name sleep_total brain_weight
<chr> <dbl> <dbl>
1 Cheetah 12.1 NA
2 Owl monkey 17 0.0155
3 Mountain beaver 14.4 NA
4 Greater short-tailed shrew 14.9 0.00029
还可以使用select_all与str_replace来消除多余的字符
msleep2 <- select(msleep, name, sleep_total, brainwt)
colnames(msleep2) <- c("Q1 name", "Q2 sleep total", "Q3 brain weight")
msleep2[1:3,]
`Q1 name` `Q2 sleep total` `Q3 brain weight`
<chr> <dbl> <dbl>
1 Cheetah 12.1 NA
2 Owl monkey 17 0.0155
3 Mountain beaver 14.4 NA
msleep2 %>%
select_all(~str_replace(., "Q[0-9]+", "")) %>%
select_all(~str_replace(., " ", "_"))
`_name` `_sleep total` `_brain weight`
<chr> <dbl> <dbl>
1 Cheetah 12.1 NA
2 Owl monkey 17 0.0155
3 Mountain beaver 14.4 NA
行名称到列
某些数据框的行名本身实际上并不是一列,例如mtcars数据集
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
如果希望此列为实际列,则可以使用该 rownames_to_column()函数,并指定新的列名称
mtcars %>%
tibble::rownames_to_column("car_model") %>% head
car_model mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
本节介绍了select( )函数的绝大部分使用方法,在以后的数据处理中希望多多查阅一定能大大提高数据处理的效率,下一节将介绍mutate( )函数,敬请期待