R数据科学笔记——data transformation2

这两节的习题好多问题啊

6.grouped summaries with `summarise()`

6.7 Exercises

Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
- A flight is always 10 minutes late.
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

方法一

> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-15)==0.5)&(mean(arr_delay==15)==0.5))
>not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-15)==0.5)&(mean(arr_delay==15)==0.5))
> not_cancelled%>%
+ group_by(flight)%>%
+ filter(mean(arr_delay==10)==1)
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-30)==0.5)&(mean(arr_delay==30)==0.5))
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==0)==0.99)&(mean(arr_delay==120)==0.01))

方法二

> percent<-not_cancelled%>%
+      group_by(flight)%>%
+     summarise(n=n(),percent_15ahead=sum(arr_delay==-15)/n,
+                  percent_15late=sum(arr_delay==15)/n,
+                  percent_10late=sum(arr_delay==10)/n,
+                  percent_30ahead=sum(arr_delay==-30)/n,
+                  percent_30late=sum(arr_delay==30)/n,
+                  percent_ontime=sum(arr_delay==0)/n,
+                  percent_2hlate=sum(arr_delay==120)/n)
> percent
# A tibble: 3,835 x 9
   flight     n percent_15ahead percent_15late percent_10late percent_30ahead percent_30late
    <int> <int>           <dbl>          <dbl>          <dbl>           <dbl>          <dbl>
 1      1   697         0.0215         0.0100         0.00574         0.00574        0.00574
 2      2    51         0.0392         0.0196         0               0              0      
 3      3   628         0.00955        0.00637        0.0159          0.0159         0.00318
 4      4   391         0.0358         0.0102         0.00767         0.0128         0.00256
 5      5   324         0.0123         0.00617        0.00926         0.0216         0      
 6      6   206         0.0291         0.00485        0.00485         0.0291         0      
 7      7   236         0.0169         0.00424        0               0.00847        0.00424
 8      8   234         0.0556         0.00855        0.0214          0              0      
 9      9   152         0.0132         0.0132         0.0197          0              0      
10     10    61         0.0164         0.0164         0.0328          0              0      
# ... with 3,825 more rows, and 2 more variables: percent_ontime <dbl>, percent_2hlate <dbl>
> percent%>%
+ filter(percent_15ahead==0.5&percent_15late==0.5)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
#   percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
#   percent_2hlate <dbl>
> percent%>%
+ filter(percent_10late==1)
# A tibble: 5 x 9
  flight     n percent_15ahead percent_15late percent_10late percent_30ahead percent_30late
   <int> <int>           <dbl>          <dbl>          <dbl>           <dbl>          <dbl>
1   2254     1               0              0              1               0              0
2   3656     1               0              0              1               0              0
3   3785     1               0              0              1               0              0
4   3880     1               0              0              1               0              0
5   5854     1               0              0              1               0              0
# ... with 2 more variables: percent_ontime <dbl>, percent_2hlate <dbl>
> percent%>%
+ filter(percent_30ahead==0.5&percent_30late==0.5)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
#   percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
#   percent_2hlate <dbl>
> percent%>%
+ filter(percent_ontime==0.99&percent_2hlate==0.01)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
#   percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
#   percent_2hlate <dbl>

还有个问题，如果这里的比例是算上cancel的怎么办？
Which is more important: arrival delay or departure delay?
我觉得应该是arr_delay吧，因为的确有起飞延误然后返航或者备降的航班？（dep_delay=na)

Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

> not_cancelled%>%
+ group_by(dest)%>%
+ summarise(n=n())

> not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(n=sum(distance))#被自己蠢哭了..
#wt=distance就是count=sum(n(x)*distance(x))，那直接分组后sum(distance)就好了啊

Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

>summarise(flights,exist_dep_delay=sum(!is.na(dep_delay)),exist_arr_delay=sum(!is.na(arr_delay)),not_cancelled=sum(!(is.na(dep_delay)|is.na(arr_delay))))
# A tibble: 1 x 3
  exist_dep_delay exist_arr_delay not_cancelled
            <int>           <int>         <int>
1          328521          327346        327346
> #也就是说，存在只有dep_delay的数据而无arr_delay的数据，那意思是返航了？所以应该扩大点范围？
> #只要存在dep_delay就应该不算取消吗？
>#我明白了，不可能有到达了但没有起飞的航班，so只需关注dep_delay

Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

> flights%>%#题目里的pattern啥意思？
+group_by(year,month,day)%>%
+summarise(cancel=sum(is.na(dep_delay)|is.na(arr_delay)),cancel_prop=mean(is.na(dep_delay)|is.na(arr_delay)),avg_delay=mean(arr_delay,na.rm=TRUE))%>%
+ggplot(mapping=aes(x=avg_delay,y=cancel_prop))+geom_point()+geom_smooth(se=FALSE)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

不太明白题目里pattern啥意思。看图的话两个应该是正相关的

Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))
```
> not_cancelled%>%
+ group_by(carrier)%>%
+ summarise(mean_delay=mean(arr_delay[arr_delay>0]))%>%
+ ggplot(mapping=aes(x=carrier,y=mean_delay))+geom_bar(stat="identity")
```

怎么去定义worst delay？如果是指mean_delay最高的话，那应该是OO。
题目的后半部分不会做.我觉得没办法，因为有些地方的机场只有一个或两个carrier，或者说有的carrier只有一个航线。无法区分是bad carrier还是bad airport

What does the sort argument to count() do. When might you use it?

sort if TRUE will sort output in descending order of n

> not_cancelled%>%
+ count(dest,sort=TRUE)#当想看哪个变量的数量最多时用很方便
# A tibble: 104 x 2
   dest      n
   <chr> <int>
 1 ATL   16837
 2 ORD   16566
 3 LAX   16026
 4 BOS   15022
 5 MCO   13967
 6 CLT   13674
 7 SFO   13173
 8 FLL   11897
 9 MIA   11593
10 DCA    9111
# ... with 94 more rows

等价于

not_cancelled%>%
group_by(dest)%>%
summarise(n=n())%>%
arrange(desc(n))

7.Grouped mutates (and filters)

Find the worst members of each group:

flights_sml %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)
#注意这里找的是每天arr_delay最高的九个航班，不是arr_delay最高的日子
#每天9个所以有3300多行哦

Find all groups bigger than a threshold:

> popular_dest<-flights%>%
+ group_by(dest)%>%
+ filter(n()>365)
> popular_dest

Standardise to compute per group metrics:

> popular_dest%>%#这个数据集已经按dest分组了
+ filter(arr_delay>0)%>%#这里的filter是对每行数据
+ mutate(prop_delay=arr_delay/sum(arr_delay))%>%#sum_delay是整组的sum（每个dest在整个数据集中的总delay）
#得到的prop是每一个单独的航程，占该dest全年所有航班总delay的比例
+ select(year:day,arr_delay,prop_delay)
Adding missing grouping variables: `dest`

Exercise

Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
？我觉得没啥变化，只不过这些函数都是在group内进行啊
Which plane (tailnum) has the worst on-time record?

#还是那个问题，怎么定义worst？是发生delay的比例吗还是平均delay时间？
#如果先group再做filter/select，都是针对组内进行操作
#如果要对整组的情况做一个总体的描述，得用summarise
#这里如果先group_by(tailnum)，后面得用summarise吧？
> tailnum<-not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(prop_delay=sum(arr_delay>0)/n(),mean_delay=sum(arr_delay)/n())
> arrange(tailnum,desc(prop_delay))
# A tibble: 4,037 x 3
   tailnum prop_delay mean_delay
   <chr>        <dbl>      <dbl>
 1 N121DE           1       15  
 2 N136DL           1      146  
 3 N143DA           1       24  
 4 N17627           1       15.5
 5 N240AT           1       53.6
 6 N26906           1       19  
 7 N295AT           1       30.8
 8 N302AS           1       29  
 9 N303AS           1       46  
10 N32626           1        5  
# ... with 4,027 more rows
> arrange(tailnum,desc(mean_delay))
# A tibble: 4,037 x 3
   tailnum prop_delay mean_delay
   <chr>        <dbl>      <dbl>
 1 N844MH       1           320 
 2 N911DA       1           294 
 3 N922EV       1           276 
 4 N587NW       1           264 
 5 N851NW       1           219 
 6 N928DN       1           201 
 7 N7715E       1           188 
 8 N654UA       1           185 
 9 N665MQ       0.667       175.
10 N427SW       1           157 
# ... with 4,027 more rows
#可是我用filter(rank(desc(prop_delay))<10)就怎么都返回空表诶

怎样精简代码呢？

What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

> not_cancelled%>%
+ group_by(dest)%>%
+ summarise(total_by_dest=sum(arr_delay))

我不明白这里的flight，是指一个航班号（每天都飞，总共有很多行数据，是一个组）还是每一个单独的航程（就是单行数据）
先按航班号理解好了
思路：先按dest分组，用mutate计算每一个单独航程的比例，再按flight分组，sum

> by_flight<-not_cancelled%>%
+ group_by(flight)
> by_dest%>%
+ mutate(prop_each=arr_delay/sum(arr_delay))%>%
+ group_by(flight)%>%
+ mutate(prop_flight=sum(prop_each))%>%
+ select(year:day,flight,prop_each,prop_flight)
# A tibble: 327,346 x 6
# Groups:   flight [3,835]
    year month   day flight  prop_each prop_flight
   <int> <int> <int>  <int>      <dbl>       <dbl>
 1  2013     1     1   1545  0.000366      0.0116 
 2  2013     1     1   1714  0.000666      0.0647 
 3  2013     1     1   1141  0.00952      -0.119  
 4  2013     1     1    725 -0.00246       0.0240 
 5  2013     1     1    461 -0.000131     -0.0243 
 6  2013     1     1   1696  0.000123      0.00471
 7  2013     1     1    507  0.000198      0.0508 
 8  2013     1     1   5708 -0.000188      0.00154
 9  2013     1     1     79 -0.000105     -0.00491
10  2013     1     1    301  0.0000822    -0.177  
# ... with 327,336 more rows

Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
思路：按origin分组，将数据按预计起飞时间(sched)排序，x=lag(dep_delay),y=dep_delay作图，得到前一个航班的delay时间与当前航班delay时间的相关关系

> flights%>%
+ group_by(origin)%>%
+ arrange(year,month,day,sched_dep_time)%>%
+ mutate(prev_delay=lag(dep_delay))%>%
+ ggplot(mapping=aes(x=prev_delay,y=dep_delay))+geom_point()

所以这图传达了什么信息呢？

所以这图是说，相关性不大吗orz

Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
思路：定义suspciously fast为air_time<=30min
按dest分组，去掉缺失值，去掉suspciously fast航班，找出最短飞行时间再计算

> not_cancelled%>%
+ filter(air_time<=30)
> not_cancelled%>%
+ filter(air_time>30)%>%
+ group_by(dest)%>%
+ mutate(rel_airtime=air_time/min(air_time))%>%
+ filter(rank(desc(rel_airtime))<10)
# A tibble: 940 x 20
# Groups:   dest [104]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1  2013     1     1      646            645         1     1023           1030        -7 UA     
 2  2013     1     1      656            705        -9     1007            940        27 MQ     
 3  2013     1     1      805            815       -10     1006           1010        -4 MQ     
 4  2013     1     1      817            810         7     1005            948        17 B6     
 5  2013     1     1      848            851        -3     1155           1136        19 UA     
 6  2013     1     1      926            928        -2     1233           1220        13 UA     
 7  2013     1     1     1032           1035        -3     1305           1250        15 EV     
 8  2013     1     1     1044           1045        -1     1231           1212        19 EV     
 9  2013     1     1     1147           1155        -8     1335           1327         8 FL     
10  2013     1     1     1330           1321         9     1613           1536        37 EV

Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
其实没懂题目啥意思，以下代码是在github看见的

> flights%>%
+ group_by(dest)%>%
+ filter(n_distinct(carrier)>=2)%>%
+ group_by(carrier)%>%
+ mutate(possible_transfers=n_distinct(dest))%>%
+ arrange(desc(possible_transfers))%>%
+ select(dest,year:day,possible_transfers)
Adding missing grouping variables: `carrier`
# A tibble: 325,397 x 6
# Groups:   carrier [16]
   carrier dest   year month   day possible_transfers
   <chr>   <chr> <int> <int> <int>              <int>
 1 EV      IAD    2013     1     1                 51
 2 EV      MSP    2013     1     1                 51
 3 EV      IAD    2013     1     1                 51
 4 EV      JAX    2013     1     1                 51
 5 EV      CHS    2013     1     1                 51
 6 EV      MEM    2013     1     1                 51
 7 EV      CLE    2013     1     1                 51
 8 EV      RDU    2013     1     1                 51
 9 EV      DCA    2013     1     1                 51
10 EV      AVL    2013     1     1                 51
# ... with 325,387 more rows

我有一个疑问，这里第二次group_by的时候是在先按dest分组的情况下分还是直接在原tibble重新分组？按最后输出结果以及可以用dest当运算参数看应该是重新分组的？我有点不懂这个管道操作是什么个意思了

For each plane, count the number of flights before the first delay of greater than 1 hour.
思路：应该是按示例中的threshold类型处理。
想要得到一个第一次delay超过1小时之前的数据的tibble，得排序
哭泣了这题也不会做是在StackOverflow上找的答案
方法一：用first()函数得到第一个delay超过一小时的数据

> filghts_del<-flights%>%
+ group_by(tailnum)%>%
+ arrange(year,month,day,dep_time)%>%
+ filter(arr_delay>60)%>%
+ summarise(first_month=first(month),first_day=first(day),first_time=first(dep_time))
> filghts_del
# A tibble: 3,371 x 4
   tailnum first_month first_day first_time
   <chr>         <int>     <int>      <int>
 1 D942DN            2        11       1508
 2 N0EGMQ            1         1       1604
 3 N10156            1        13       2221
 4 N102UW            6        22       2141
 5 N104UW            2        13       1331
 6 N10575            1         2       1548
 7 N105UW            7        11       1411
 8 N107US            7        27       1624
 9 N108UW            8        13       1329
10 N109UW            9        13       1643
# ... with 3,361 more rows
> not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(n=sum(month<filghts_del$first_month)+sum(month==filghts_del$first_month&day<filghts_del$first_day)+sum(month==filghts_del$first_month&day==filghts_del$first_day&dep_time<filghts_del$first_time))
# A tibble: 4,037 x 2
   tailnum     n
   <chr>   <int>
 1 D942DN   1390
 2 N0EGMQ    913
 3 N10156   1150
 4 N102UW    671
 5 N103US    693
 6 N104UW    632
 7 N10575    930
 8 N105UW    476
 9 N107US    583
10 N108UW    578
# ... with 4,027 more rows
There were 50 or more warnings (use warnings() to see the first 50)

特别复杂还不知道对不对，哭泣了
方法二：用cumall

> not_cancelled%>%
+ group_by(tailnum)%>%
+ arrange(year,month,day,dep_time)%>%
+ filter(cumall(arr_delay<60))%>%
+ tally()
# A tibble: 3,799 x 2
   tailnum     n
   <chr>   <int>
 1 N10156      9
 2 N102UW     25
 3 N103US     46
 4 N104UW      3
 5 N105UW     22
 6 N107US     20
 7 N108UW     36
 8 N109UW     28
 9 N110UW     15
10 N11107      7
# ... with 3,789 more rows

R数据科学笔记——data transformation2