R数据科学笔记——data transformation2

这两节的习题好多问题啊

6.grouped summaries with summarise()

6.7 Exercises

  1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

    • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
    • A flight is always 10 minutes late.
    • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
    • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

方法一

> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-15)==0.5)&(mean(arr_delay==15)==0.5))
>not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-15)==0.5)&(mean(arr_delay==15)==0.5))
> not_cancelled%>%
+ group_by(flight)%>%
+ filter(mean(arr_delay==10)==1)
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-30)==0.5)&(mean(arr_delay==30)==0.5))
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==0)==0.99)&(mean(arr_delay==120)==0.01))

方法二

> percent<-not_cancelled%>%
+      group_by(flight)%>%
+     summarise(n=n(),percent_15ahead=sum(arr_delay==-15)/n,
+                  percent_15late=sum(arr_delay==15)/n,
+                  percent_10late=sum(arr_delay==10)/n,
+                  percent_30ahead=sum(arr_delay==-30)/n,
+                  percent_30late=sum(arr_delay==30)/n,
+                  percent_ontime=sum(arr_delay==0)/n,
+                  percent_2hlate=sum(arr_delay==120)/n)
> percent
# A tibble: 3,835 x 9
   flight     n percent_15ahead percent_15late percent_10late percent_30ahead percent_30late
    <int> <int>           <dbl>          <dbl>          <dbl>           <dbl>          <dbl>
 1      1   697         0.0215         0.0100         0.00574         0.00574        0.00574
 2      2    51         0.0392         0.0196         0               0              0      
 3      3   628         0.00955        0.00637        0.0159          0.0159         0.00318
 4      4   391         0.0358         0.0102         0.00767         0.0128         0.00256
 5      5   324         0.0123         0.00617        0.00926         0.0216         0      
 6      6   206         0.0291         0.00485        0.00485         0.0291         0      
 7      7   236         0.0169         0.00424        0               0.00847        0.00424
 8      8   234         0.0556         0.00855        0.0214          0              0      
 9      9   152         0.0132         0.0132         0.0197          0              0      
10     10    61         0.0164         0.0164         0.0328          0              0      
# ... with 3,825 more rows, and 2 more variables: percent_ontime <dbl>, percent_2hlate <dbl>
> percent%>%
+ filter(percent_15ahead==0.5&percent_15late==0.5)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
#   percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
#   percent_2hlate <dbl>
> percent%>%
+ filter(percent_10late==1)
# A tibble: 5 x 9
  flight     n percent_15ahead percent_15late percent_10late percent_30ahead percent_30late
   <int> <int>           <dbl>          <dbl>          <dbl>           <dbl>          <dbl>
1   2254     1               0              0              1               0              0
2   3656     1               0              0              1               0              0
3   3785     1               0              0              1               0              0
4   3880     1               0              0              1               0              0
5   5854     1               0              0              1               0              0
# ... with 2 more variables: percent_ontime <dbl>, percent_2hlate <dbl>
> percent%>%
+ filter(percent_30ahead==0.5&percent_30late==0.5)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
#   percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
#   percent_2hlate <dbl>
> percent%>%
+ filter(percent_ontime==0.99&percent_2hlate==0.01)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
#   percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
#   percent_2hlate <dbl>

还有个问题,如果这里的比例是算上cancel的怎么办?
Which is more important: arrival delay or departure delay?
我觉得应该是arr_delay吧,因为的确有起飞延误然后返航或者备降的航班?(dep_delay=na)

  1. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

    > not_cancelled%>%
    + group_by(dest)%>%
    + summarise(n=n())
    
    > not_cancelled%>%
    + group_by(tailnum)%>%
    + summarise(n=sum(distance))#被自己蠢哭了..
    #wt=distance就是count=sum(n(x)*distance(x)),那直接分组后sum(distance)就好了啊
    
  2. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

    >summarise(flights,exist_dep_delay=sum(!is.na(dep_delay)),exist_arr_delay=sum(!is.na(arr_delay)),not_cancelled=sum(!(is.na(dep_delay)|is.na(arr_delay))))
    # A tibble: 1 x 3
      exist_dep_delay exist_arr_delay not_cancelled
                <int>           <int>         <int>
    1          328521          327346        327346
    > #也就是说,存在只有dep_delay的数据而无arr_delay的数据,那意思是返航了?所以应该扩大点范围?
    > #只要存在dep_delay就应该不算取消吗?
    >#我明白了,不可能有到达了但没有起飞的航班,so只需关注dep_delay
    
  3. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

    > flights%>%#题目里的pattern啥意思?
    +group_by(year,month,day)%>%
    +summarise(cancel=sum(is.na(dep_delay)|is.na(arr_delay)),cancel_prop=mean(is.na(dep_delay)|is.na(arr_delay)),avg_delay=mean(arr_delay,na.rm=TRUE))%>%
    +ggplot(mapping=aes(x=avg_delay,y=cancel_prop))+geom_point()+geom_smooth(se=FALSE)
    `geom_smooth()` using method = 'loess' and formula 'y ~ x'
    

不太明白题目里pattern啥意思。看图的话两个应该是正相关的

  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))
    > not_cancelled%>%
    + group_by(carrier)%>%
    + summarise(mean_delay=mean(arr_delay[arr_delay>0]))%>%
    + ggplot(mapping=aes(x=carrier,y=mean_delay))+geom_bar(stat="identity")
    

怎么去定义worst delay?如果是指mean_delay最高的话,那应该是OO。
题目的后半部分不会做.我觉得没办法,因为有些地方的机场只有一个或两个carrier,或者说有的carrier只有一个航线。无法区分是bad carrier还是bad airport

  1. What does the sort argument to count() do. When might you use it?

    sort if TRUE will sort output in descending order of n

> not_cancelled%>%
+ count(dest,sort=TRUE)#当想看哪个变量的数量最多时用很方便
# A tibble: 104 x 2
   dest      n
   <chr> <int>
 1 ATL   16837
 2 ORD   16566
 3 LAX   16026
 4 BOS   15022
 5 MCO   13967
 6 CLT   13674
 7 SFO   13173
 8 FLL   11897
 9 MIA   11593
10 DCA    9111
# ... with 94 more rows

等价于

not_cancelled%>%
group_by(dest)%>%
summarise(n=n())%>%
arrange(desc(n))

7.Grouped mutates (and filters)

Find the worst members of each group:
flights_sml %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)
#注意这里找的是每天arr_delay最高的九个航班,不是arr_delay最高的日子
#每天9个所以有3300多行哦
Find all groups bigger than a threshold:
> popular_dest<-flights%>%
+ group_by(dest)%>%
+ filter(n()>365)
> popular_dest
Standardise to compute per group metrics:
> popular_dest%>%#这个数据集已经按dest分组了
+ filter(arr_delay>0)%>%#这里的filter是对每行数据
+ mutate(prop_delay=arr_delay/sum(arr_delay))%>%#sum_delay是整组的sum(每个dest在整个数据集中的总delay)
#得到的prop是每一个单独的航程,占该dest全年所有航班总delay的比例
+ select(year:day,arr_delay,prop_delay)
Adding missing grouping variables: `dest`

Exercise

  1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
    ?我觉得没啥变化,只不过这些函数都是在group内进行啊

  2. Which plane (tailnum) has the worst on-time record?

#还是那个问题,怎么定义worst?是发生delay的比例吗还是平均delay时间?
#如果先group再做filter/select,都是针对组内进行操作
#如果要对整组的情况做一个总体的描述,得用summarise
#这里如果先group_by(tailnum),后面得用summarise吧?
> tailnum<-not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(prop_delay=sum(arr_delay>0)/n(),mean_delay=sum(arr_delay)/n())
> arrange(tailnum,desc(prop_delay))
# A tibble: 4,037 x 3
   tailnum prop_delay mean_delay
   <chr>        <dbl>      <dbl>
 1 N121DE           1       15  
 2 N136DL           1      146  
 3 N143DA           1       24  
 4 N17627           1       15.5
 5 N240AT           1       53.6
 6 N26906           1       19  
 7 N295AT           1       30.8
 8 N302AS           1       29  
 9 N303AS           1       46  
10 N32626           1        5  
# ... with 4,027 more rows
> arrange(tailnum,desc(mean_delay))
# A tibble: 4,037 x 3
   tailnum prop_delay mean_delay
   <chr>        <dbl>      <dbl>
 1 N844MH       1           320 
 2 N911DA       1           294 
 3 N922EV       1           276 
 4 N587NW       1           264 
 5 N851NW       1           219 
 6 N928DN       1           201 
 7 N7715E       1           188 
 8 N654UA       1           185 
 9 N665MQ       0.667       175.
10 N427SW       1           157 
# ... with 4,027 more rows
#可是我用filter(rank(desc(prop_delay))<10)就怎么都返回空表诶

怎样精简代码呢?

  1. What time of day should you fly if you want to avoid delays as much as possible?

  2. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

> not_cancelled%>%
+ group_by(dest)%>%
+ summarise(total_by_dest=sum(arr_delay))

我不明白这里的flight,是指一个航班号(每天都飞,总共有很多行数据,是一个组)还是每一个单独的航程(就是单行数据)
先按航班号理解好了
思路:先按dest分组,用mutate计算每一个单独航程的比例,再按flight分组,sum

> by_flight<-not_cancelled%>%
+ group_by(flight)
> by_dest%>%
+ mutate(prop_each=arr_delay/sum(arr_delay))%>%
+ group_by(flight)%>%
+ mutate(prop_flight=sum(prop_each))%>%
+ select(year:day,flight,prop_each,prop_flight)
# A tibble: 327,346 x 6
# Groups:   flight [3,835]
    year month   day flight  prop_each prop_flight
   <int> <int> <int>  <int>      <dbl>       <dbl>
 1  2013     1     1   1545  0.000366      0.0116 
 2  2013     1     1   1714  0.000666      0.0647 
 3  2013     1     1   1141  0.00952      -0.119  
 4  2013     1     1    725 -0.00246       0.0240 
 5  2013     1     1    461 -0.000131     -0.0243 
 6  2013     1     1   1696  0.000123      0.00471
 7  2013     1     1    507  0.000198      0.0508 
 8  2013     1     1   5708 -0.000188      0.00154
 9  2013     1     1     79 -0.000105     -0.00491
10  2013     1     1    301  0.0000822    -0.177  
# ... with 327,336 more rows
  1. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
    思路:按origin分组,将数据按预计起飞时间(sched)排序,x=lag(dep_delay),y=dep_delay作图,得到前一个航班的delay时间与当前航班delay时间的相关关系
> flights%>%
+ group_by(origin)%>%
+ arrange(year,month,day,sched_dep_time)%>%
+ mutate(prev_delay=lag(dep_delay))%>%
+ ggplot(mapping=aes(x=prev_delay,y=dep_delay))+geom_point()
所以这图传达了什么信息呢?

所以这图是说,相关性不大吗orz

  1. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
    思路:定义suspciously fast为air_time<=30min
    按dest分组,去掉缺失值,去掉suspciously fast航班,找出最短飞行时间再计算
> not_cancelled%>%
+ filter(air_time<=30)
> not_cancelled%>%
+ filter(air_time>30)%>%
+ group_by(dest)%>%
+ mutate(rel_airtime=air_time/min(air_time))%>%
+ filter(rank(desc(rel_airtime))<10)
# A tibble: 940 x 20
# Groups:   dest [104]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1  2013     1     1      646            645         1     1023           1030        -7 UA     
 2  2013     1     1      656            705        -9     1007            940        27 MQ     
 3  2013     1     1      805            815       -10     1006           1010        -4 MQ     
 4  2013     1     1      817            810         7     1005            948        17 B6     
 5  2013     1     1      848            851        -3     1155           1136        19 UA     
 6  2013     1     1      926            928        -2     1233           1220        13 UA     
 7  2013     1     1     1032           1035        -3     1305           1250        15 EV     
 8  2013     1     1     1044           1045        -1     1231           1212        19 EV     
 9  2013     1     1     1147           1155        -8     1335           1327         8 FL     
10  2013     1     1     1330           1321         9     1613           1536        37 EV     
  1. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
    其实没懂题目啥意思,以下代码是在github看见的
> flights%>%
+ group_by(dest)%>%
+ filter(n_distinct(carrier)>=2)%>%
+ group_by(carrier)%>%
+ mutate(possible_transfers=n_distinct(dest))%>%
+ arrange(desc(possible_transfers))%>%
+ select(dest,year:day,possible_transfers)
Adding missing grouping variables: `carrier`
# A tibble: 325,397 x 6
# Groups:   carrier [16]
   carrier dest   year month   day possible_transfers
   <chr>   <chr> <int> <int> <int>              <int>
 1 EV      IAD    2013     1     1                 51
 2 EV      MSP    2013     1     1                 51
 3 EV      IAD    2013     1     1                 51
 4 EV      JAX    2013     1     1                 51
 5 EV      CHS    2013     1     1                 51
 6 EV      MEM    2013     1     1                 51
 7 EV      CLE    2013     1     1                 51
 8 EV      RDU    2013     1     1                 51
 9 EV      DCA    2013     1     1                 51
10 EV      AVL    2013     1     1                 51
# ... with 325,387 more rows

我有一个疑问,这里第二次group_by的时候是在先按dest分组的情况下分还是直接在原tibble重新分组?按最后输出结果以及可以用dest当运算参数看应该是重新分组的?我有点不懂这个管道操作是什么个意思了

  1. For each plane, count the number of flights before the first delay of greater than 1 hour.
    思路:应该是按示例中的threshold类型处理。
    想要得到一个第一次delay超过1小时之前的数据的tibble,得排序
    哭泣了这题也不会做是在StackOverflow上找的答案
    方法一:用first()函数得到第一个delay超过一小时的数据
> filghts_del<-flights%>%
+ group_by(tailnum)%>%
+ arrange(year,month,day,dep_time)%>%
+ filter(arr_delay>60)%>%
+ summarise(first_month=first(month),first_day=first(day),first_time=first(dep_time))
> filghts_del
# A tibble: 3,371 x 4
   tailnum first_month first_day first_time
   <chr>         <int>     <int>      <int>
 1 D942DN            2        11       1508
 2 N0EGMQ            1         1       1604
 3 N10156            1        13       2221
 4 N102UW            6        22       2141
 5 N104UW            2        13       1331
 6 N10575            1         2       1548
 7 N105UW            7        11       1411
 8 N107US            7        27       1624
 9 N108UW            8        13       1329
10 N109UW            9        13       1643
# ... with 3,361 more rows
> not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(n=sum(month<filghts_del$first_month)+sum(month==filghts_del$first_month&day<filghts_del$first_day)+sum(month==filghts_del$first_month&day==filghts_del$first_day&dep_time<filghts_del$first_time))
# A tibble: 4,037 x 2
   tailnum     n
   <chr>   <int>
 1 D942DN   1390
 2 N0EGMQ    913
 3 N10156   1150
 4 N102UW    671
 5 N103US    693
 6 N104UW    632
 7 N10575    930
 8 N105UW    476
 9 N107US    583
10 N108UW    578
# ... with 4,027 more rows
There were 50 or more warnings (use warnings() to see the first 50)

特别复杂还不知道对不对,哭泣了
方法二:用cumall

> not_cancelled%>%
+ group_by(tailnum)%>%
+ arrange(year,month,day,dep_time)%>%
+ filter(cumall(arr_delay<60))%>%
+ tally()
# A tibble: 3,799 x 2
   tailnum     n
   <chr>   <int>
 1 N10156      9
 2 N102UW     25
 3 N103US     46
 4 N104UW      3
 5 N105UW     22
 6 N107US     20
 7 N108UW     36
 8 N109UW     28
 9 N110UW     15
10 N11107      7
# ... with 3,789 more rows
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,053评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,527评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,779评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,685评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,699评论 5 366
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,609评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,989评论 3 396
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,654评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,890评论 1 298
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,634评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,716评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,394评论 4 319
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,976评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,950评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,191评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 44,849评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,458评论 2 342

推荐阅读更多精彩内容