R数据科学笔记——data transformation1

Workflow:Basics

4.Practice

1.the "i"?

2.??? 不懂

`ggplot(data=mpg)+geom_point(mapping=aes(x=displ,y=hwy),data=filter(mpg,cyl==8))`

`ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut),data=filter(diamonds,carat>3))`

3.Press Alt + Shift + K. What happens? How can you get to the same place
using the menus?

keyboard shortcut reference

Tools->keyboard shortcut help

Data: Transformation

1.Introduction

-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()#It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: `stats::filter()` and `stats::lag()`.

2.Filter rows with filter()

2.4 Excercise

  1. Find all flights that

    1. Had an arrival delay of two or more hours

    2. Flew to Houston (IAH or HOU)

    3. Were operated by United, American, or Delta

    4. Departed in summer (July, August, and September)

    5. Arrived more than two hours late, but didn’t leave late

    6. Were delayed by at least an hour, but made up over 30 minutes in flight

    7. Departed between midnight and 6am (inclusive)

      filter(flights,arr_delay>=120)
      filter(flights,dest %in% c("IAH","HOU"))
      filter(flights,dest=="IAH"|dest=="HOU")#same
      filter(flights,carrier %in% c("UA","AA","DL"))
      filter(flights,month %in% c("7","8","9"))
      filter(flights,arr_delay>120&dep_delay<=0)
      filter(flights,arr_delay>=120&air_time>30)
      midnight1<-filter(flights,hour %in% c(0:5)|(hour==6&minute==0))#不太确定
      

这个数据集本身数据有问题?为什么hour minute的数据与时间time_hour对不上啊?
知道了hour minute是schedule的时间,难怪这么规整

  1. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

    between(x, left, right)
    #等于
    x %in% c(left:right)#when left and right are numeric.
    
    #This is a shortcut for x >= left & x <= right
    

    %in%的使用范围更广些,构成向量的可以不是数字。

  2. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

    1. dep_delay, arr_time, arr_delay.
      They might represent the flights be canceled(they didn't take off.)
  3. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

运算的先后顺序?优先服从逻辑运算符/数学运算符的规则。任何数的0次方为1;|的规则是任意一个为TRUE即为TRUE,&的规则是任意一个为FALSE则为FALSE。然而NA*0先考虑的是NA的不可比较性。

(不确定呢)

3.Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)#先按year排序,再在相同year中按month排序,再在相同year中按day排序
arrange(flights, desc(dep_delay))#按dep_delay降序排列

Use desc() to re-order by a column in descending order

Missing values are always sorted at the end:

Excercise:

  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

    arrange(df,desc(is.na(x)))#is.na返回值TRUE(1),FALSE(0).缺失值返回1.此时再按降序排列,则1(na值)都在前面
    
  2. Sort flights to find the most delayed flights. Find the flights that left earliest.

    arrange(flights,desc(dep_delay),desc(arr_delay))#emm所以哪项最大算是most delayed呢?貌似找到的那个是两项都最大
    arrange(flights,dep_time)#不知道诶
    arrange(flights,desc(distance/air_time))
    arrange(flights,is.na(dep_time),desc(distance))
    arrange(flights,is.na(dep_time),distance)#不加is.na的话会有实际上没起飞的航班
    
  3. Sort flights to find the fastest flights.

  4. Which flights travelled the longest? Which travelled the shortest?

4.select columns with select()

select(flights,year,month,day)
select(flights,year:day)
select(flights,-(year:day))
select(flights,starts_with("dep"))
select(flights,ends_with("delay"))
select(flights,matches("(.)\\1"))
rename(flights,tail_num=tailnum)#这里rename之后变不回去了怎么办
select(flights,time_hour,air_time,everything())#把所选的提到最前面,并且保留所有的列
#Excercise
select(flights,starts_with("dep"),starts_with("arr"))
select(flights,dep_time,dep_delay,arr_time,arr_delay)
select(flights,year,year)#只出现一列,不重复
vars<-c("year","month","day","dep_delay","arr_delay")
select(flights,one_of(vars))#运行结果是五列都出来了,所以是等价于
#one_of(): select variables in character vector.
select(flights,year,month,day,dep_delay,arr_delay)#?
#contains(match, ignore.case = TRUE, vars = peek_vars())
select(flights,contains("TIME",ignore.case=FALSE))#修改默认值

Excercise

  1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
    如上

  2. What happens if you include the name of a variable multiple times in a select() call?

  3. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

    vars <- c("year", "month", "day", "dep_delay", "arr_delay")
    
  4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

    select(flights, contains("TIME"))
    

5.Add new variables with mutate()

mutate(flights_sml,gain=dep_delay-arr_delay,hours=air_time/60,gain_per_hour=gain/hours)
transmute(flights_sml,gain=dep_delay-arr_delay,hours=air_time/60,gain_per_hour=gain/hours)#输出结果仅保留显式提及的变量和新产生的变量
transmute(flights,dep_time,hour=dep_time%/%100,minute=dep_time%%100)#%/%商,%%余数
#lead,lag干啥的没懂?

1.Useful creation functions

对一个向量进行运算,返回一个同等大小的向量

1.Arithmetic operators: +, -, *, /, ^.

2.Modular arithmetic: %/% (integer division) and %% (remainder), wherex == y * (x %/% y) + (x %% y).

3.Logs: log(), log2(), log10().

4.Offsets: lead() and lag() allow you to refer to leading or lagging values.

Find the "next" or "previous" values in a vector. Useful for comparing values ahead of or behind the current values.

x<-runif(5)
> cbind(ahead=lead(x),x,behind=lag(x))
         ahead          x     behind
[1,] 0.3001377 0.01974997         NA
[2,] 0.2235623 0.30013771 0.01974997
[3,] 0.2873173 0.22356229 0.30013771
[4,] 0.2258159 0.28731729 0.22356229
[5,]        NA 0.22581594 0.28731729
>#大概就是找到向量中当前位置的前一个值和后一个值

5.Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.

x<-c(1:10)
> roll_mean(x)
 [1]  1  2  3  4  5  6  7  8  9 10
> roll_sum(x)
 [1]  1  2  3  4  5  6  7  8  9 10
> cumsum(x)
 [1]  1  3  6 10 15 21 28 36 45 55
>cummean(x)
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5#区分roll和cummulative

6.Logical comparisons, <, <=, >, >=, !=

7.Ranking:

y<-c(1,2,2,NA,4,5)
> min_rank(y)
[1]  1  2  2 NA  4  5#返回的是相应位置值的排位,1st,2nd
> min_rank(desc(y))
[1]  5  3  3 NA  2  1
>z<-c(5,4,NA,2,2,1)
> min_rank(z)
[1]  5  4 NA  2  2  1#desc不是简单的倒过来排,是转换成相反数这样内在的大小顺序就反过来了
> desc(y)
[1] -1 -2 -2 NA -4 -5
>min_rank(desc(y))
[1]  5  3  3 NA  2  1

If min_rank() doesn’t do what you need, look at the variantsrow_number(), dense_rank(), percent_rank(), cume_dist(),ntile().

> y<-c(1,1,3,NA,5,5,7)
> min_rank(y)
[1]  1  1  3 NA  4  4  6#同样大小的给予相同排位,然后下一位顺延(1,1,3)
> min_rank(desc(y))
[1]  5  5  4 NA  2  2  1
> row_number(y)
[1]  1  2  3 NA  4  5  6#同样大小的排位不同,不存在相同排位
> dense_rank(y)
[1]  1  1  2 NA  3  3  4#dense意思是密集排序吧,相同大小相同排位,下一个紧接着排(1,1,2)
> percent_rank(y)
[1] 0.0 0.0 0.4  NA 0.6 0.6 1.0#排位规则跟min_rank一样,1->0,最大->1,换算成百分位数
> cume_dist(y)
[1] 0.3333333 0.3333333 0.5000000        NA 0.8333333 0.8333333 1.0000000
#排位规则跟dense_rank一样,再换成百分位数

2.Excercise

  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

    transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,arr_time,arrtime=arr_time%/%100*60+arr_time%%100)
    # A tibble: 336,776 x 4
       dep_time deptime arr_time arrtime
          <int>   <dbl>    <int>   <dbl>
     1      517     317      830     510
     2      533     333      850     530
     3      542     342      923     563
     4      544     344     1004     604
     5      554     354      812     492
     6      554     354      740     460
     7      555     355      913     553
     8      557     357      709     429
     9      557     357      838     518
    10      558     358      753     473
    # ... with 336,766 more rows
    
  2. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

    >transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,arr_time,arrtime=arr_time%/%100*60+arr_time%%100,air_time,airtime=arrtime-deptime)
    # A tibble: 336,776 x 6
       dep_time deptime arr_time arrtime air_time airtime
          <int>   <dbl>    <int>   <dbl>    <dbl>   <dbl>
     1      517     317      830     510      227     193
     2      533     333      850     530      227     197
     3      542     342      923     563      160     221
     4      544     344     1004     604      183     260
     5      554     354      812     492      116     138
     6      554     354      740     460      150     106
     7      555     355      913     553      158     198
     8      557     357      709     429       53      72
     9      557     357      838     518      140     161
    10      558     358      753     473      138     115
    # ... with 336,766 more rows
    #所以为啥还是对不上啊,它这个airtime咋算的?
    
  3. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

    >transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,sched_dep_time,schedtime=sched_dep_time%/%100*60+sched_dep_time%%100,dep_delay,pseudo=dep_time-sched_dep_time,delay=deptime-schedtime)#直接减是不对的
    # A tibble: 336,776 x 7
       dep_time deptime sched_dep_time schedtime dep_delay pseudo delay
          <int>   <dbl>          <int>     <dbl>     <dbl>  <int> <dbl>
     1      517     317            515       315         2      2     2
     2      533     333            529       329         4      4     4
     3      542     342            540       340         2      2     2
     4      544     344            545       345        -1     -1    -1
     5      554     354            600       360        -6    -46    -6
     6      554     354            558       358        -4     -4    -4
     7      555     355            600       360        -5    -45    -5
     8      557     357            600       360        -3    -43    -3
     9      557     357            600       360        -3    -43    -3
    10      558     358            600       360        -2    -42    -2
    # ... with 336,766 more rows
    
  4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

    arrange(flights,min_rank(desc(dep_delay)))
    arrange(flights,min_rank(desc(arr_delay)))
    
  5. What does 1:3 + 1:10 return? Why?

    > 1:3+1:10
     [1]  2  4  6  5  7  9  8 10 12 11
    Warning message:
    In 1:3 + 1:10 :
      longer object length is not a multiple of shorter object length
    #=(1,2,3,1,2,3,1,2,3,1)+(1:10)
    
  6. What trigonometric functions does R provide?

    cos(x) sin(x) tan(x)

    acos(x) asin(x) atan(x)
    atan2(y, x)

    cospi(x) sinpi(x) tanpi(x)( compute cos(pix), sin(pix), and tan(pi*x).

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,080评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,422评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,630评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,554评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,662评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,856评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,014评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,752评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,212评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,541评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,687评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,347评论 4 331
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,973评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,777评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,006评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,406评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,576评论 2 349

推荐阅读更多精彩内容