14.关于select()

【上一篇：13.关于arrange()】
【下一篇：15.关于mutate()】

select()函数用来选择数据框的列。重申：filter()函数用来根据列的值选择行，arrange()函数用来根据列的值对行进行排序。
?select之后看到title是"Subset columns using their names and types"，也就是根据列名和类型对列取子集。
以前你学习数据框的时候是如何对选择数据框的列的呢？

flights[,c("year","month","day")]
flights[,1:3]

select()函数完全可以替代以上写法。两者的区别在哪里呢？（好吧，现在脑子是懵的，姑且认为select()有可以和其他很多函数连用，省去了自己写子函数筛选的麻烦吧。）

#根据列名选择，一列
> select(flights,year)
#一个变量名写重了，也只会输出一个，不像直接flights[,c("year","year")]
> select(flights,year,year)
# A tibble: 336,776 x 1
    year
   <int>
 1  2013
 2  2013
 3  2013
 4  2013
 5  2013
 6  2013
 7  2013
 8  2013
 9  2013
10  2013
# ... with 336,766 more rows
> flights[,c("year","year")]
# A tibble: 336,776 x 2
    year  year
   <int> <int>
 1  2013  2013
 2  2013  2013
 3  2013  2013
 4  2013  2013
 5  2013  2013
 6  2013  2013
 7  2013  2013
 8  2013  2013
 9  2013  2013
10  2013  2013
# ... with 336,766 more rows

#变量名的顺序会影响输出结果中变量名显示的顺序
> select(flights,year,month)
> select(flights,month,year)
#这点和flights[,c("year","month")]、flights[,c("month","year")]是一样的

#根据列名选择，多列
> select(flights,year,month,day)
> select(flights,year|month|day)
> select(flights,1:3)
> select(flights,year:day)
> select(flights,c(year,month,day))
> select(flights,c("year","month","day"))
###小结
# 与filter()和arrange()函数不同，从第二个参数开始，“逗号”分割的变量名或表达式是"与"的关系
# filter()函数会筛选出同时满足所有条件的行
# arrange()函数会先按照第二个参数排序，第三个会进一步改变前面排序的结果，第四个会改变前两步排序的结果，依次类推
# select()函数中，从第二个参数开始，“逗号”分割的变量名或表达式是“或”的关系
# 这里提到了两个在select()函数中应用的操作符：“：”和"c()"
# 和向量的使用方法有点像
# 不要记所有写法，够用就行，少的列，直接逗号枚举，多的连续的列用“:”
# 用数字，好吧，列少且知道你想要那几列的时候最适合不过了。

#反选
> select(flights,-year)
> select(flights,!year)
> select(flights,-year,-month,-day)
> select(flights,-(year:day))
> select(flights,!(year:day))
###小结：
# 这里也提到两种在select()函数中使用的操作符："!"和"-"
# !和-在这里都是取反的意思（或者说取补集），基于上面正选的方法加个取反符号就可以。

#筛选包含"time"同时以"time"开头的列
> select(flights,contains("time") & starts_with("time"))
###小结：
# 这里的&也是应用在select()函数中的一个操作符，表示“和”的关系

#用starts_with()函数多选
> select(flights,starts_with("dep"))
> select(flights,starts_with("dep"),starts_with("arr"))
#用ends_with()函数多选
> select(flights,ends_with("time"))
#用contains()函数多选
> select(flights,contains("time"))
#用matches()函数多选
> select(flights,matches("time"))
#用num_range()函数多选,num_range()函数执行后产生字符串向量c("dep_time","dep_delay")
> select(flights,num_range("dep_",c("time","delay")))

starts_with()，ends_with()，contains()，matches()，num_range()这些函数是select()函数中的一类通过匹配变量名中的模式（"模式匹配"）来选择列的函数，这些函数必须要和select函数连用才行，单独的这些函数是不会发生作用的（只是本人的猜测，也许不正确）。其帮助文档见：

starts_with(match, ignore.case = TRUE, vars = NULL)
ends_with(match, ignore.case = TRUE, vars = NULL)
contains(match, ignore.case = TRUE, vars = NULL)
matches(match, ignore.case = TRUE, perl = FALSE, vars = NULL)
num_range(prefix, range, width = NULL, vars = NULL)

#注意哦，ignore.case默认是忽略大小写的，要根据自己的需要更改呀！

还有一类函数可以指定特定的列，也需要和select()函数连用才行，他们是everything()和last_col()，其帮助文档如下：

#选择所有变量
everything(vars = NULL)
#选择最后一个变量
last_col(offset = 0L, vars = NULL)

vars是一个字符串向量，offset是位移量。
这两个函数的工作原理没有弄很清楚，其实真的不知道select(flights,everything())这样写一遍有啥意义。
先看他们的经典应用场景吧：
# 用everything()函数将某些变量移动到数据框的最前面
# 在调整数据框列的顺序的时候应该特别有用
> select(flights, time_hour, air_time, everything())
# A tibble: 336,776 x 19
   time_hour           air_time  year month   day dep_time sched_dep_time
   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
 1 2013-01-01 05:00:00      227  2013     1     1      517            515
 2 2013-01-01 05:00:00      227  2013     1     1      533            529
 3 2013-01-01 05:00:00      160  2013     1     1      542            540
 4 2013-01-01 05:00:00      183  2013     1     1      544            545
 5 2013-01-01 06:00:00      116  2013     1     1      554            600
 6 2013-01-01 05:00:00      150  2013     1     1      554            558
 7 2013-01-01 06:00:00      158  2013     1     1      555            600
 8 2013-01-01 06:00:00       53  2013     1     1      557            600
 9 2013-01-01 06:00:00      140  2013     1     1      557            600
10 2013-01-01 06:00:00      138  2013     1     1      558            600
# ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
#   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#   hour <dbl>, minute <dbl>

# 选择数据框的最后一列
# 这个就好像更容易理解一些，更实用一些
> select(flights,last_col())
# A tibble: 336,776 x 1
   time_hour          
   <dttm>             
 1 2013-01-01 05:00:00
 2 2013-01-01 05:00:00
 3 2013-01-01 05:00:00
 4 2013-01-01 05:00:00
 5 2013-01-01 06:00:00
 6 2013-01-01 05:00:00
 7 2013-01-01 06:00:00
 8 2013-01-01 06:00:00
 9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ... with 336,766 more rows

下面两种写法是一样的效果：
> select(flights,2:last_col(5))
> select(flights,day:last_col(5))
#这里2:last_col(5)表示取第二个变量到（从后数第5个变量之前的列）

第三列函数是通过字符串向量选择列的，包括all_of()、any_of()、where()，其帮助文档：

#将x向量中的列提取出来，如果数据框中没有这样的列名，则会报错。所以说条件严格
all_of(x)
#与all_of()一样，但是如果数据框中没有这样的列名，也不会报错，条件宽松
any_of(x, ..., vars = NULL)
例如：
x<-c("month","day")
select(flights,all_of(x))
select(flights,any_of(x))
# x是一个字符串向量或表示列位置的数字向量
# vars是一个向量，不提供的话会直接从上下文中找

where()函数是个新大陆，它的参数是一个函数：where(fn)，fn这个函数的返回值必须是一个TRUE或FLASE的向量，最后会保留TRUE的列，例如：
#如下命令，先将数据框的列名（即所有变量，是一个向量）传递给is.numeric()函数，函数执行后返回一个逻辑向量。
> select(flights,where(is.numeric))
# A tibble: 336,776 x 14
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with 336,766 more rows, and 6 more variables: arr_delay <dbl>,
#   flight <int>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>

select()函数的可以实现对变量的重命名，参数格式为new_name = old_name：

> flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>
> select(flights,month_new = month)
# A tibble: 336,776 x 1
   month_new
       <int>
 1         1
 2         1
 3         1
 4         1
 5         1
 6         1
 7         1
 8         1
 9         1
10         1
# ... with 336,766 more rows

## 执行后你会发现，并不如你想象的只改变month列的名字，它还把其他列都弄没了。
##所以，用rename()函数会更好：
> rename(flights,month_new = month)
# A tibble: 336,776 x 19
    year month_new   day dep_time sched_dep_time dep_delay arr_time
   <int>     <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013         1     1      517            515         2      830
 2  2013         1     1      533            529         4      850
 3  2013         1     1      542            540         2      923
 4  2013         1     1      544            545        -1     1004
 5  2013         1     1      554            600        -6      812
 6  2013         1     1      554            558        -4      740
 7  2013         1     1      555            600        -5      913
 8  2013         1     1      557            600        -3      709
 9  2013         1     1      557            600        -3      838
10  2013         1     1      558            600        -2      753
# ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

【上一篇：13.关于arrange()】
【下一篇：15.关于mutate()】

14.关于select()

推荐阅读更多精彩内容