高效数据整理工具——dplyr（一）

在前面的数据处理笔记中提到了多个简单的数据处理函数（包括R内置的transform、aggregate、by、summary其他操作）以及工具包（主要为reshape、reshape2），这些工具虽然用起来比较方便，但是功能比较少，如aggregate和reshape2包groupby后处理都只能返回一个值，本篇将介绍一个更为强大而又系统的用来处理数据框结构数据的工具包——dplyr。值得一提的是，reshape、reshape2、plyr、dplyr以及ggplot2的作者都是同一人—— Hadley Wickham。下面将通过dplyr包官网中的示例了解一下大神的杰作。

概览

20160607113522768.jpg

20160607113514383.jpg

dplyr包的功能应用方面主要包括3个：Single table verbs， Two-table verbs和Databases。
本文将主要了解dplyr对单个数据表（Single table,也即数据框）的处理。使用的示例数据集来自于hflights包，值得注意的是hflights数据结构类型是tibble，tibble是Rstudio开发的一种新的数据类型，被认为是未来data.frame的替代，使用as_tibble()可将data.frame转化为tibble，简单了解可看这里——R语言数据科学新类型tibble。dplyr常用的数据处理函数主要包括：

filter() 筛选符合条件的记录（rows）
arrange() 对数据进行排序
select()、rename() 通过列名来选取变量
mutate()和transmute() 通过已有列创建（计算并赋值）新列
summarise() 聚合数据，一般先分组（groupby）后再通过聚合函数返回分组的值
sample_n()、sample_frac() 随机抽样函数（随机选取rows）
group_by() 分组函数
%>% 管道操作（pipe），连接多个操作

1、基本操作

1.1 筛选：`filter()`

根据逻辑判断筛选出符合要求的子数据集，如：

### 查看数据，可看到flights是tibble类型，而且直接读取也不会全部显示，很智能人性化
> library(nycflights13)
> dim(flights)
[1] 336776     19
> flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515        2.      830            819
 2  2013     1     1      533            529        4.      850            830
 3  2013     1     1      542            540        2.      923            850
 4  2013     1     1      544            545       -1.     1004           1022
 5  2013     1     1      554            600       -6.      812            837
 6  2013     1     1      554            558       -4.      740            728
 7  2013     1     1      555            600       -5.      913            854
 8  2013     1     1      557            600       -3.      709            723
 9  2013     1     1      557            600       -3.      838            846
10  2013     1     1      558            600       -2.      753            745
# ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

使用filter()筛选数据，格式为filter(data, formula)，formula为逻辑判断，判断符号有==, >, >= etc，&, |, !, xor()，is.na()，between(), near()等。

### 筛选出month==1和day==2的行
> filter(flights, month == 1, day == 1)
Error in match.arg(method) : object 'day' not found
### 这里报错是因为有多个载入的包都含有filter函数，因此如下使用dplyr的filter函数
> dplyr::filter(flights, month==1, day==1)
# A tibble: 842 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515        2.      830            819       11.
 2  2013     1     1      533            529        4.      850            830       20.
 3  2013     1     1      542            540        2.      923            850       33.
 4  2013     1     1      544            545       -1.     1004           1022      -18.
 5  2013     1     1      554            600       -6.      812            837      -25.
 6  2013     1     1      554            558       -4.      740            728       12.
 7  2013     1     1      555            600       -5.      913            854       19.
 8  2013     1     1      557            600       -3.      709            723      -14.
 9  2013     1     1      557            600       -3.      838            846       -8.
10  2013     1     1      558            600       -2.      753            745        8.
# ... with 832 more rows, and 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

### 使用R内置方法进行同样的处理
> flights[flights$month==1 & flights$day==1,]
# A tibble: 842 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515        2.      830            819       11.
 2  2013     1     1      533            529        4.      850            830       20.
 3  2013     1     1      542            540        2.      923            850       33.
 4  2013     1     1      544            545       -1.     1004           1022      -18.
 5  2013     1     1      554            600       -6.      812            837      -25.
 6  2013     1     1      554            558       -4.      740            728       12.
 7  2013     1     1      555            600       -5.      913            854       19.
 8  2013     1     1      557            600       -3.      709            723      -14.
 9  2013     1     1      557            600       -3.      838            846       -8.
10  2013     1     1      558            600       -2.      753            745        8.
# ... with 832 more rows, and 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

让我们再来继续比较下，不同的方法来选择子集。

df <- expand.grid(A = 1:100, B = 1:100, C = 1:100)
df$value <- 1:nrow(df)

library(dplyr); library(microbenchmark)
f1 <- function() subset(df, A == 1 & B == 3 | A == 3 & B == 2)
f2 <- function() filter(df, A == 1 & B == 3 | A == 3 & B == 2)
f3 <- function() df[with(df, A == 1 & B == 3 | A == 3 & B == 2), ]
f4 <- function() df[(df$A == 1 & df$B == 3) | (df$A == 3 & df$B == 2),]

microbenchmark(subset = f1(), filter = f2(), with = f3(), "$" = f4())
# Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval
#  subset 47.42671 49.99802 75.95385 92.24430 96.05960 141.2964   100
#  filter 36.94019 38.77325 60.22831 42.64112 84.35896 155.0145   100
#    with 38.90918 44.36299 71.29214 86.39629 88.89008 134.7670   100
#       $ 40.22723 44.08606 71.32186 86.71372 89.59275 133.1132   100

1.2 排序：`arrange()`

根据某一列或多列进行排序，格式为：arrange(data, colnames , ...)，默认为升序排列，使用desc可进行降序排序。

### 升序排序
> dplyr::arrange(flights, month, day)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515        2.      830            819       11.
 2  2013     1     1      533            529        4.      850            830       20.
 3  2013     1     1      542            540        2.      923            850       33.
 4  2013     1     1      544            545       -1.     1004           1022      -18.
 5  2013     1     1      554            600       -6.      812            837      -25.
 6  2013     1     1      554            558       -4.      740            728       12.
 7  2013     1     1      555            600       -5.      913            854       19.
 8  2013     1     1      557            600       -3.      709            723      -14.
 9  2013     1     1      557            600       -3.      838            846       -8.
10  2013     1     1      558            600       -2.      753            745        8.
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
### 降序排序
> dplyr::arrange(flights, desc(month, day))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013    12     1       13           2359       14.      446            445        1.
 2  2013    12     1       17           2359       18.      443            437        6.
 3  2013    12     1      453            500       -7.      636            651      -15.
 4  2013    12     1      520            515        5.      749            808      -19.
 5  2013    12     1      536            540       -4.      845            850       -5.
 6  2013    12     1      540            550      -10.     1005           1027      -22.
 7  2013    12     1      541            545       -4.      734            755      -21.
 8  2013    12     1      546            545        1.      826            835       -9.
 9  2013    12     1      549            600      -11.      648            659      -11.
10  2013    12     1      550            600      -10.      825            854      -29.
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

### 使用R内置的order函数进行排序
> flights[order(flights$month,flights$day),]
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515        2.      830            819       11.
 2  2013     1     1      533            529        4.      850            830       20.
 3  2013     1     1      542            540        2.      923            850       33.
 4  2013     1     1      544            545       -1.     1004           1022      -18.
 5  2013     1     1      554            600       -6.      812            837      -25.
 6  2013     1     1      554            558       -4.      740            728       12.
 7  2013     1     1      555            600       -5.      913            854       19.
 8  2013     1     1      557            600       -3.      709            723      -14.
 9  2013     1     1      557            600       -3.      838            846       -8.
10  2013     1     1      558            600       -2.      753            745        8.
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

1.3 选择与重命名：`select()`、`rename()`

通过列名来选择子数据集，格式为：select(data, colnames, ...)，同样的select也可以使用-来排除列名。同时select()还具有重命名的功能，但是进行选择并重命名时他也只返回子集，而rename()则能重命名特定列并返回所有列。select()支持的选取方式还有c(colnames...)、year:day等多种方式。

### 选择year，month，day 3列作为子集
> df<-dplyr::select(flights,year,month,DAY=day);df
# A tibble: 336,776 x 3
    year month   DAY
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ... with 336,766 more rows
> dplyr::rename(flights,DAY=day)
# A tibble: 336,776 x 19
    year month   DAY dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515        2.      830            819       11.
 2  2013     1     1      533            529        4.      850            830       20.
 3  2013     1     1      542            540        2.      923            850       33.
 4  2013     1     1      544            545       -1.     1004           1022      -18.
 5  2013     1     1      554            600       -6.      812            837      -25.
 6  2013     1     1      554            558       -4.      740            728       12.
 7  2013     1     1      555            600       -5.      913            854       19.
 8  2013     1     1      557            600       -3.      709            723      -14.
 9  2013     1     1      557            600       -3.      838            846       -8.
10  2013     1     1      558            600       -2.      753            745        8.
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

### R内置的选择子集的方法
> flights[c('year','month','day')]
# A tibble: 336,776 x 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ... with 336,766 more rows

1.4 变形：`mutate()`、`transmutate()`

mutate()函数可用来对添加列，与cbind()以及transform()函数相似，但是更优于transform，mutate()在创建一列时还可以将其作为变量再来创建后面的列。transmutate()则是仅保留刚刚创建的变量。

> dplyr::mutate(flights,gain = arr_delay - dep_delay,gain_per_hour = gain / (air_time / 60))
# A tibble: 336,776 x 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515        2.      830            819       11.
 2  2013     1     1      533            529        4.      850            830       20.
 3  2013     1     1      542            540        2.      923            850       33.
 4  2013     1     1      544            545       -1.     1004           1022      -18.
 5  2013     1     1      554            600       -6.      812            837      -25.
 6  2013     1     1      554            558       -4.      740            728       12.
 7  2013     1     1      555            600       -5.      913            854       19.
 8  2013     1     1      557            600       -3.      709            723      -14.
 9  2013     1     1      557            600       -3.      838            846       -8.
10  2013     1     1      558            600       -2.      753            745        8.
# ... with 336,766 more rows, and 12 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>
### 这里可以看到flights是增加了两列的，而且新列gain_per_hour是通过gain这个新建的列创建的

> dplyr::transmute(flights,gain = arr_delay - dep_delay,gain_per_hour = gain / (air_time / 60))
# A tibble: 336,776 x 2
    gain gain_per_hour
   <dbl>         <dbl>
 1    9.          2.38
 2   16.          4.23
 3   31.         11.6 
 4  -17.         -5.57
 5  -19.         -9.83
 6   16.          6.40
 7   24.          9.11
 8  -11.        -12.5 
 9   -5.         -2.14
10   10.          4.35
# ... with 336,766 more rows

1.5 聚合汇总：`summarize()`

对数据框调用函数进行操作返回结果，常用于分组后的处理。

> dplyr::summarise(flights,delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
  delay
  <dbl>
1  12.6

1.6 抽样：`sample_n()`、`sample_frac()`

这两个函数是从数据集中随机抽取指定行，不同之处是sample_n()表示抽取的行数而sample_frac()则表示百分比的行数。格式：sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL)，sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = NULL)，若replace==TRUE则表示bootstrap抽样，通过weight指定权重参数。

> sample_n(flights, 10);sample_frac(flights,0.1)
# A tibble: 10 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     2    12      819            825       -6.      937            945       -8.
 2  2013     5     5     2158           2159       -1.     2258           2337      -39.
 3  2013     5     3     1535           1540       -5.     1745           1650       55.
 4  2013     5     2     1824           1830       -6.     2115           2200      -45.
 5  2013     5     1     1610           1610        0.     1719           1751      -32.
 6  2013     9     8     1954           1859       55.     2134           2127        7.
 7  2013     5    27      537            540       -3.      828            840      -12.
 8  2013     1    24      806            810       -4.     1022           1044      -22.
 9  2013     5    13     1551           1555       -4.     1659           1727      -28.
10  2013     8    12       NA            920       NA        NA           1210       NA 
# ... with 10 more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 33,678 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013    11    16     1950           1930       20.     2055           2047        8.
 2  2013     2     8     1724           1655       29.     2043           2009       34.
 3  2013     9    12      652            700       -8.      931            949      -18.
 4  2013     4     8     1830           1831       -1.     2157           2203       -6.
 5  2013    10    17     1022           1025       -3.     1136           1140       -4.
 6  2013     5    14     1736           1745       -9.     1942           2021      -39.
 7  2013    11    28      745            736        9.      924            920        4.
 8  2013    12    17     1034           1035       -1.     1418           1405       13.
 9  2013    12     6     1953           2000       -7.     2114           2115       -1.
10  2013    10    22     1057           1100       -3.     1416           1415        1.
# ... with 33,668 more rows, and 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

2、分组操作

上面六个函数可以解决大部分的数据清理问题了，当他们与分组操作结合时会更加强大。
在下面的例子当中，我们将使用tailnum作为分组因子对flights进行分组。

> dim(flights)
[1] 336776     19
> length(levels(factor(flights$tailnum)))
[1] 4043
### 可以看到flights共有336776行，其中tailnum列包含4043个不同的航班号
### 分组
> df1<-group_by(flights, tailnum);df1
# A tibble: 336,776 x 19
# Groups:   tailnum [4,044]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest 
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>
 1  2013     1     1      517            515        2.      830            819       11. UA        1545 N14228  EWR    IAH  
 2  2013     1     1      533            529        4.      850            830       20. UA        1714 N24211  LGA    IAH  
 3  2013     1     1      542            540        2.      923            850       33. AA        1141 N619AA  JFK    MIA  
 4  2013     1     1      544            545       -1.     1004           1022      -18. B6         725 N804JB  JFK    BQN  
 5  2013     1     1      554            600       -6.      812            837      -25. DL         461 N668DN  LGA    ATL  
 6  2013     1     1      554            558       -4.      740            728       12. UA        1696 N39463  EWR    ORD  
 7  2013     1     1      555            600       -5.      913            854       19. B6         507 N516JB  EWR    FLL  
 8  2013     1     1      557            600       -3.      709            723      -14. EV        5708 N829AS  LGA    IAD  
 9  2013     1     1      557            600       -3.      838            846       -8. B6          79 N593JB  JFK    MCO  
10  2013     1     1      558            600       -2.      753            745        8. AA         301 N3ALAA  LGA    ORD  
# ... with 336,766 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
### 聚合
### 这里对分好组的数据进行了3个操作，（1）计算每个组内数据的个数（也即行数，通过n()函数获得）；（2）计算每个组内距离的平均数（mean(distance)）;（3）计算每个组内晚点到达的平均数（mean(arr_delay)）
> df2<-summarise(df1,count=n(),dist=mean(distance, na.rm=TRUE),delay=mean(arr_delay,na.rm=TRUE));df2
# A tibble: 4,044 x 4
   tailnum count  dist   delay
   <chr>   <int> <dbl>   <dbl>
 1 D942DN      4  854.  31.5  
 2 N0EGMQ    371  676.   9.98 
 3 N10156    153  758.  12.7  
 4 N102UW     48  536.   2.94 
 5 N103US     46  535.  -6.93 
 6 N104UW     47  535.   1.80 
 7 N10575    289  520.  20.7  
 8 N105UW     45  525.  -0.267
 9 N107US     41  529.  -5.73 
10 N108UW     60  534.  -1.25 
# ... with 4,034 more rows
### 最后对数据进行筛选
> df3<-filter(df2, count>=20 & dist<2000);df3
# A tibble: 2,986 x 4
   tailnum count  dist   delay
   <chr>   <int> <dbl>   <dbl>
 1 N0EGMQ    371  676.   9.98 
 2 N10156    153  758.  12.7  
 3 N102UW     48  536.   2.94 
 4 N103US     46  535.  -6.93 
 5 N104UW     47  535.   1.80 
 6 N10575    289  520.  20.7  
 7 N105UW     45  525.  -0.267
 8 N107US     41  529.  -5.73 
 9 N108UW     60  534.  -1.25 
10 N109UW     48  536.  -2.52 
# ... with 2,976 more rows

接下来我们做个图看看飞机平均延时跟飞行距离的关系：

> ggplot(data=df3) +
+     geom_point(aes(x=dist, y=delay, size=count)) +
+     geom_smooth(aes(x=dist,y=delay))

从图中可以看到飞机延时跟飞行距离相关性不大。
dplyr中一些聚合时的函数：

n() 计算个数
n_distinct() 计算每个组中唯一值的个数
first(x), last(x) 和 nth(x, n) 返回对应秩的值, 类似于自带函数 x[1], x[length(x)], 和 x[n]

3、连接符：`%>%`

连接符是dplyr包中的一个非常实用的功能，他使得我们能够将所有操作步骤写在一起而且易于理解，不用储存中间结果。下面我们使用连接符重现上一节中的数据处理操作：

library(hflights)
library(dplyr)

df<-flights %>%
  group_by(tailnum) %>%
  summarise(count=n(),
            dist=mean(distance, na.rm=TRUE),
            delay=mean(arr_delay,na.rm=TRUE)) %>%
  filter(count>=20 & dist<2000)

df
#输出结果
> df
# A tibble: 2,986 x 4
   tailnum count  dist   delay
   <chr>   <int> <dbl>   <dbl>
 1 N0EGMQ    371  676.   9.98 
 2 N10156    153  758.  12.7  
 3 N102UW     48  536.   2.94 
 4 N103US     46  535.  -6.93 
 5 N104UW     47  535.   1.80 
 6 N10575    289  520.  20.7  
 7 N105UW     45  525.  -0.267
 8 N107US     41  529.  -5.73 
 9 N108UW     60  534.  -1.25 
10 N109UW     48  536.  -2.52 
# ... with 2,976 more rows

4、总结

可以看到dplyr的操作非常方便简洁，而且解决了reshape2中分组聚合函数不能返回一个多维数据的缺点。他的管道操作思想与shell中的管道非常相似，同时在写法上又与R内置的with(和within)函数颇为相似，都省去了每一步写数据变量名，而且易读性好。相比于reshape2更好上手。
此外，dplyr包还有针对多个数据集之间的操作，如连接取交集等。

参考：
dplyr官方文档
 dplyr官方入门简介
 R语言dplyr包

最后编辑于：2018.07.17 15:45:02

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,039评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,426评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,417评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,868评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,892评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,692评论 1赞 305
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,416评论 3赞 419
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,326评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,782评论 1赞 316
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,957评论 3赞 337
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,102评论 1赞 350
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,790评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,442评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,996评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,113评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,332评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,044评论 2赞 355

高效数据整理工具——dplyr（一）

概览

1、基本操作

1.1 筛选：filter()

1.2 排序：arrange()

1.3 选择与重命名：select()、rename()

1.4 变形：mutate()、transmutate()

1.5 聚合汇总：summarize()

1.6 抽样：sample_n()、sample_frac()