这两节的习题好多问题啊
6.grouped summaries with summarise()
6.7 Exercises
-
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
- A flight is always 10 minutes late.
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
方法一
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-15)==0.5)&(mean(arr_delay==15)==0.5))
>not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-15)==0.5)&(mean(arr_delay==15)==0.5))
> not_cancelled%>%
+ group_by(flight)%>%
+ filter(mean(arr_delay==10)==1)
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-30)==0.5)&(mean(arr_delay==30)==0.5))
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==0)==0.99)&(mean(arr_delay==120)==0.01))
方法二
> percent<-not_cancelled%>%
+ group_by(flight)%>%
+ summarise(n=n(),percent_15ahead=sum(arr_delay==-15)/n,
+ percent_15late=sum(arr_delay==15)/n,
+ percent_10late=sum(arr_delay==10)/n,
+ percent_30ahead=sum(arr_delay==-30)/n,
+ percent_30late=sum(arr_delay==30)/n,
+ percent_ontime=sum(arr_delay==0)/n,
+ percent_2hlate=sum(arr_delay==120)/n)
> percent
# A tibble: 3,835 x 9
flight n percent_15ahead percent_15late percent_10late percent_30ahead percent_30late
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 697 0.0215 0.0100 0.00574 0.00574 0.00574
2 2 51 0.0392 0.0196 0 0 0
3 3 628 0.00955 0.00637 0.0159 0.0159 0.00318
4 4 391 0.0358 0.0102 0.00767 0.0128 0.00256
5 5 324 0.0123 0.00617 0.00926 0.0216 0
6 6 206 0.0291 0.00485 0.00485 0.0291 0
7 7 236 0.0169 0.00424 0 0.00847 0.00424
8 8 234 0.0556 0.00855 0.0214 0 0
9 9 152 0.0132 0.0132 0.0197 0 0
10 10 61 0.0164 0.0164 0.0328 0 0
# ... with 3,825 more rows, and 2 more variables: percent_ontime <dbl>, percent_2hlate <dbl>
> percent%>%
+ filter(percent_15ahead==0.5&percent_15late==0.5)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
# percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
# percent_2hlate <dbl>
> percent%>%
+ filter(percent_10late==1)
# A tibble: 5 x 9
flight n percent_15ahead percent_15late percent_10late percent_30ahead percent_30late
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2254 1 0 0 1 0 0
2 3656 1 0 0 1 0 0
3 3785 1 0 0 1 0 0
4 3880 1 0 0 1 0 0
5 5854 1 0 0 1 0 0
# ... with 2 more variables: percent_ontime <dbl>, percent_2hlate <dbl>
> percent%>%
+ filter(percent_30ahead==0.5&percent_30late==0.5)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
# percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
# percent_2hlate <dbl>
> percent%>%
+ filter(percent_ontime==0.99&percent_2hlate==0.01)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
# percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
# percent_2hlate <dbl>
还有个问题,如果这里的比例是算上cancel的怎么办?
Which is more important: arrival delay or departure delay?
我觉得应该是arr_delay吧,因为的确有起飞延误然后返航或者备降的航班?(dep_delay=na)
-
Come up with another approach that will give you the same output as
not_cancelled %>% count(dest)
andnot_cancelled %>% count(tailnum, wt = distance)
(without usingcount()
).> not_cancelled%>% + group_by(dest)%>% + summarise(n=n()) > not_cancelled%>% + group_by(tailnum)%>% + summarise(n=sum(distance))#被自己蠢哭了.. #wt=distance就是count=sum(n(x)*distance(x)),那直接分组后sum(distance)就好了啊
-
Our definition of cancelled flights (
is.na(dep_delay) | is.na(arr_delay)
) is slightly suboptimal. Why? Which is the most important column?>summarise(flights,exist_dep_delay=sum(!is.na(dep_delay)),exist_arr_delay=sum(!is.na(arr_delay)),not_cancelled=sum(!(is.na(dep_delay)|is.na(arr_delay)))) # A tibble: 1 x 3 exist_dep_delay exist_arr_delay not_cancelled <int> <int> <int> 1 328521 327346 327346 > #也就是说,存在只有dep_delay的数据而无arr_delay的数据,那意思是返航了?所以应该扩大点范围? > #只要存在dep_delay就应该不算取消吗? >#我明白了,不可能有到达了但没有起飞的航班,so只需关注dep_delay
-
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
> flights%>%#题目里的pattern啥意思? +group_by(year,month,day)%>% +summarise(cancel=sum(is.na(dep_delay)|is.na(arr_delay)),cancel_prop=mean(is.na(dep_delay)|is.na(arr_delay)),avg_delay=mean(arr_delay,na.rm=TRUE))%>% +ggplot(mapping=aes(x=avg_delay,y=cancel_prop))+geom_point()+geom_smooth(se=FALSE) `geom_smooth()` using method = 'loess' and formula 'y ~ x'
不太明白题目里pattern啥意思。看图的话两个应该是正相关的
-
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights %>% group_by(carrier, dest) %>% summarise(n())
)> not_cancelled%>% + group_by(carrier)%>% + summarise(mean_delay=mean(arr_delay[arr_delay>0]))%>% + ggplot(mapping=aes(x=carrier,y=mean_delay))+geom_bar(stat="identity")
怎么去定义worst delay?如果是指mean_delay最高的话,那应该是OO。
题目的后半部分不会做.我觉得没办法,因为有些地方的机场只有一个或两个carrier,或者说有的carrier只有一个航线。无法区分是bad carrier还是bad airport
-
What does the
sort
argument tocount()
do. When might you use it?sort if TRUE will sort output in descending order of n
> not_cancelled%>%
+ count(dest,sort=TRUE)#当想看哪个变量的数量最多时用很方便
# A tibble: 104 x 2
dest n
<chr> <int>
1 ATL 16837
2 ORD 16566
3 LAX 16026
4 BOS 15022
5 MCO 13967
6 CLT 13674
7 SFO 13173
8 FLL 11897
9 MIA 11593
10 DCA 9111
# ... with 94 more rows
等价于
not_cancelled%>%
group_by(dest)%>%
summarise(n=n())%>%
arrange(desc(n))
7.Grouped mutates (and filters)
Find the worst members of each group:
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
#注意这里找的是每天arr_delay最高的九个航班,不是arr_delay最高的日子
#每天9个所以有3300多行哦
Find all groups bigger than a threshold:
> popular_dest<-flights%>%
+ group_by(dest)%>%
+ filter(n()>365)
> popular_dest
Standardise to compute per group metrics:
> popular_dest%>%#这个数据集已经按dest分组了
+ filter(arr_delay>0)%>%#这里的filter是对每行数据
+ mutate(prop_delay=arr_delay/sum(arr_delay))%>%#sum_delay是整组的sum(每个dest在整个数据集中的总delay)
#得到的prop是每一个单独的航程,占该dest全年所有航班总delay的比例
+ select(year:day,arr_delay,prop_delay)
Adding missing grouping variables: `dest`
Exercise
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
?我觉得没啥变化,只不过这些函数都是在group内进行啊Which plane (tailnum) has the worst on-time record?
#还是那个问题,怎么定义worst?是发生delay的比例吗还是平均delay时间?
#如果先group再做filter/select,都是针对组内进行操作
#如果要对整组的情况做一个总体的描述,得用summarise
#这里如果先group_by(tailnum),后面得用summarise吧?
> tailnum<-not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(prop_delay=sum(arr_delay>0)/n(),mean_delay=sum(arr_delay)/n())
> arrange(tailnum,desc(prop_delay))
# A tibble: 4,037 x 3
tailnum prop_delay mean_delay
<chr> <dbl> <dbl>
1 N121DE 1 15
2 N136DL 1 146
3 N143DA 1 24
4 N17627 1 15.5
5 N240AT 1 53.6
6 N26906 1 19
7 N295AT 1 30.8
8 N302AS 1 29
9 N303AS 1 46
10 N32626 1 5
# ... with 4,027 more rows
> arrange(tailnum,desc(mean_delay))
# A tibble: 4,037 x 3
tailnum prop_delay mean_delay
<chr> <dbl> <dbl>
1 N844MH 1 320
2 N911DA 1 294
3 N922EV 1 276
4 N587NW 1 264
5 N851NW 1 219
6 N928DN 1 201
7 N7715E 1 188
8 N654UA 1 185
9 N665MQ 0.667 175.
10 N427SW 1 157
# ... with 4,027 more rows
#可是我用filter(rank(desc(prop_delay))<10)就怎么都返回空表诶
怎样精简代码呢?
What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
> not_cancelled%>%
+ group_by(dest)%>%
+ summarise(total_by_dest=sum(arr_delay))
我不明白这里的flight,是指一个航班号(每天都飞,总共有很多行数据,是一个组)还是每一个单独的航程(就是单行数据)
先按航班号理解好了
思路:先按dest分组,用mutate计算每一个单独航程的比例,再按flight分组,sum
> by_flight<-not_cancelled%>%
+ group_by(flight)
> by_dest%>%
+ mutate(prop_each=arr_delay/sum(arr_delay))%>%
+ group_by(flight)%>%
+ mutate(prop_flight=sum(prop_each))%>%
+ select(year:day,flight,prop_each,prop_flight)
# A tibble: 327,346 x 6
# Groups: flight [3,835]
year month day flight prop_each prop_flight
<int> <int> <int> <int> <dbl> <dbl>
1 2013 1 1 1545 0.000366 0.0116
2 2013 1 1 1714 0.000666 0.0647
3 2013 1 1 1141 0.00952 -0.119
4 2013 1 1 725 -0.00246 0.0240
5 2013 1 1 461 -0.000131 -0.0243
6 2013 1 1 1696 0.000123 0.00471
7 2013 1 1 507 0.000198 0.0508
8 2013 1 1 5708 -0.000188 0.00154
9 2013 1 1 79 -0.000105 -0.00491
10 2013 1 1 301 0.0000822 -0.177
# ... with 327,336 more rows
- Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
思路:按origin分组,将数据按预计起飞时间(sched)排序,x=lag(dep_delay),y=dep_delay作图,得到前一个航班的delay时间与当前航班delay时间的相关关系
> flights%>%
+ group_by(origin)%>%
+ arrange(year,month,day,sched_dep_time)%>%
+ mutate(prev_delay=lag(dep_delay))%>%
+ ggplot(mapping=aes(x=prev_delay,y=dep_delay))+geom_point()
所以这图是说,相关性不大吗orz
- Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
思路:定义suspciously fast为air_time<=30min
按dest分组,去掉缺失值,去掉suspciously fast航班,找出最短飞行时间再计算
> not_cancelled%>%
+ filter(air_time<=30)
> not_cancelled%>%
+ filter(air_time>30)%>%
+ group_by(dest)%>%
+ mutate(rel_airtime=air_time/min(air_time))%>%
+ filter(rank(desc(rel_airtime))<10)
# A tibble: 940 x 20
# Groups: dest [104]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 646 645 1 1023 1030 -7 UA
2 2013 1 1 656 705 -9 1007 940 27 MQ
3 2013 1 1 805 815 -10 1006 1010 -4 MQ
4 2013 1 1 817 810 7 1005 948 17 B6
5 2013 1 1 848 851 -3 1155 1136 19 UA
6 2013 1 1 926 928 -2 1233 1220 13 UA
7 2013 1 1 1032 1035 -3 1305 1250 15 EV
8 2013 1 1 1044 1045 -1 1231 1212 19 EV
9 2013 1 1 1147 1155 -8 1335 1327 8 FL
10 2013 1 1 1330 1321 9 1613 1536 37 EV
- Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
其实没懂题目啥意思,以下代码是在github看见的
> flights%>%
+ group_by(dest)%>%
+ filter(n_distinct(carrier)>=2)%>%
+ group_by(carrier)%>%
+ mutate(possible_transfers=n_distinct(dest))%>%
+ arrange(desc(possible_transfers))%>%
+ select(dest,year:day,possible_transfers)
Adding missing grouping variables: `carrier`
# A tibble: 325,397 x 6
# Groups: carrier [16]
carrier dest year month day possible_transfers
<chr> <chr> <int> <int> <int> <int>
1 EV IAD 2013 1 1 51
2 EV MSP 2013 1 1 51
3 EV IAD 2013 1 1 51
4 EV JAX 2013 1 1 51
5 EV CHS 2013 1 1 51
6 EV MEM 2013 1 1 51
7 EV CLE 2013 1 1 51
8 EV RDU 2013 1 1 51
9 EV DCA 2013 1 1 51
10 EV AVL 2013 1 1 51
# ... with 325,387 more rows
我有一个疑问,这里第二次group_by的时候是在先按dest分组的情况下分还是直接在原tibble重新分组?按最后输出结果以及可以用dest当运算参数看应该是重新分组的?我有点不懂这个管道操作是什么个意思了
- For each plane, count the number of flights before the first delay of greater than 1 hour.
思路:应该是按示例中的threshold类型处理。
想要得到一个第一次delay超过1小时之前的数据的tibble,得排序
哭泣了这题也不会做是在StackOverflow上找的答案
方法一:用first()函数得到第一个delay超过一小时的数据
> filghts_del<-flights%>%
+ group_by(tailnum)%>%
+ arrange(year,month,day,dep_time)%>%
+ filter(arr_delay>60)%>%
+ summarise(first_month=first(month),first_day=first(day),first_time=first(dep_time))
> filghts_del
# A tibble: 3,371 x 4
tailnum first_month first_day first_time
<chr> <int> <int> <int>
1 D942DN 2 11 1508
2 N0EGMQ 1 1 1604
3 N10156 1 13 2221
4 N102UW 6 22 2141
5 N104UW 2 13 1331
6 N10575 1 2 1548
7 N105UW 7 11 1411
8 N107US 7 27 1624
9 N108UW 8 13 1329
10 N109UW 9 13 1643
# ... with 3,361 more rows
> not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(n=sum(month<filghts_del$first_month)+sum(month==filghts_del$first_month&day<filghts_del$first_day)+sum(month==filghts_del$first_month&day==filghts_del$first_day&dep_time<filghts_del$first_time))
# A tibble: 4,037 x 2
tailnum n
<chr> <int>
1 D942DN 1390
2 N0EGMQ 913
3 N10156 1150
4 N102UW 671
5 N103US 693
6 N104UW 632
7 N10575 930
8 N105UW 476
9 N107US 583
10 N108UW 578
# ... with 4,027 more rows
There were 50 or more warnings (use warnings() to see the first 50)
特别复杂还不知道对不对,哭泣了
方法二:用cumall
> not_cancelled%>%
+ group_by(tailnum)%>%
+ arrange(year,month,day,dep_time)%>%
+ filter(cumall(arr_delay<60))%>%
+ tally()
# A tibble: 3,799 x 2
tailnum n
<chr> <int>
1 N10156 9
2 N102UW 25
3 N103US 46
4 N104UW 3
5 N105UW 22
6 N107US 20
7 N108UW 36
8 N109UW 28
9 N110UW 15
10 N11107 7
# ... with 3,789 more rows