R数据科学(三)dplyr

CHAPTER 3 Data Transformation with dplyr

library(nycflights13)
library(tidyverse)

查看冲突信息发现dplyr与基本包的函数冲突,如想用基本包的函数,可以这样写:stats::filter(),stats::lag()。
本次演示数据为nycflights13::flights,包括336,776 flights that departed from New York City in 2013,数据来自US Bureau of Transportation Statistics。
查看具体信息:

?flights
class(flights)
dim(flights)
head(flights)
View(flights)

可以发现该数据是一个Tibbles,属于数据框,但是在tidyverse里更方便。

dplyr包的核心函数:

  • filter 按行筛选
  • arrange 给行排序
  • select 按列筛选
  • mutate 根据原有列生成新列
  • summarize 摘要统计
    以上函数都可以与group_by()连用,用于分组执行。
    执行模板为:第一个参数为数据框,后面的参数为要做的事情,然后返回新数据框。

Filter Rows with filter()

filter(flights, month == 1, day == 1)
# 如果想保存结果,需要另外赋值给一个变量
jan1 <- filter(flights, month == 1, day == 1)
# 如果想保存同时打印,需要在外层用括号包裹起来。
(jan1 <- filter(flights, month == 1, day == 1))

Comparisons

比较运算符:>, >=, <, <=, != (not equal), and == (equal)。
几个易犯的错误:

  • 判断相等是“==” 而不是“=”;
  • 浮点型和整型数据不相等,用near判断。
sqrt(2) ^ 2 == 2
#> [1] FALSE
1/49 * 49 == 1
#> [1] FALSE
near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

Logical Operators

或 | ,与 & ,非 !

# 找出11月或12月出发的航班
filter(flights, month == 11 | month == 12)

用管道符简化选择 x%in%y,指x是y中的一个

(nov_dec <- filter(flights, month %in% c(11,12)))

其他简化操作:!(x & y)等价于!x | !y,!(x | y)等价于!x & !y

# 这两个结果相同
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

Missing Values 缺失值

缺失值用NAs表示 (“not availables”)
NULL表示空值,无

NA > 5
#> [1] NA
10 == NA
#> [1] NA
NA + 10
#> [1] NA

NA == NA
#> [1] NA

is.na()检查是否是缺失值
注意filter只选择TRUE的值,FALSE 和 NA 的会排除掉,
如果想保留缺失值

df <- tibble(x=c(1,NA,3))
filter(df,x > 1)
#> # A tibble: 1 × 1
#> x
#> <dbl>
#> 1 3
filter(df, is.na(x) | x > 1)
#> # A tibble: 2 × 1
#> x
#> <dbl>
#> 1 NA
#> 2 3

练习题:
1.a选择延误到达大于等于两个小时的航班

View(flights)
filter(flights, arr_delay >= 120)

b.The flights that flew to Houston were are those flights where the destination (dest) is either “IAH” or “HOU”.

filter(flights,dest=='IAH' | dest=='HOU')
# 或者用%in% 选择
filter(flights,dest %in% c('IAH','HOU'))

c. Were operated by United, American, or Delta

filter(flights, carrier %in% c("AA", "DL", "UA"))

d.Departed in summer (July, August, and September)

filter(flights, month >= 7, month <= 9)

e. Arrived more than two hours late, but didn’t leave late

filter(flights, dep_delay <= 0, arr_delay > 120)

f. Were delayed by at least an hour, but made up over 30 minutes in flight

filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)

g. Departed between midnight and 6 a.m. (inclusive)

filter(flights, dep_time <= 600 | dep_time == 2400)
  1. between()的作用
    between(x, left, right) 与 x >= left & x <= right 相同
filter(flights, between(month, 7, 9))
# month >= 7 & month <= 9
  1. missing dep_time 缺失值
filter(flights, is.na(dep_time))
  1. Why is NA ^ 0 not missing? Why is NA | TRUE not missing?
    Why is FALSE & NA not missing? Can you figure out the general
    rule? (NA * 0 is a tricky counterexample!)
NA ^ 0
#> [1] 1
NA | TRUE
#> [1] TRUE
NA & FALSE
#> [1] FALSE
NA | FALSE
#> [1] NA
NA & TRUE
#> [1] NA
NA * 0
#> [1] NA
Inf * 0
#> [1] NaN
-Inf * 0
#> [1] NaN

Arrange Rows with arrange() 对列排序

arrange(flights, year, month, day)
  • desc()函数逆序排序
arrange(flights, desc(arr_delay))
df <- tibble(x = c(5, 2, NA))
arrange(df, x) #缺失值排到最后
arrange(df, desc(x))

Select Columns with select() 按列选择

select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
select(flights, ends_with("y"))
  • select支持正则表达式:
    starts_with("abc"),ends_with("xyz") ,contains("ijk"),matches("(.)\1"),num_range("x", 1:3)

rename() 重命名变量

rename(flights, tail_num = tailnum)
select(flights, time_hour, air_time, everything())

练习题:
Exercise 5.4.1.1 Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, 4, 5, 6, 9)
select(flights, one_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))

variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, one_of(variables))
select(flights, starts_with("dep_"), starts_with("arr_"))
select(flights, matches("^(dep|arr)_(time|delay)$"))
select(flights, ends_with("arr_time"), ends_with("dep_time"))
select(flights, contains("_time"), contains("arr_"))

Exercise 5.4.1.2 What happens if you include the name of a variable multiple times in a select() call?

# select忽略重复项,只选第一个。
select(flights, year, month, day, year, year)
# everything
select(flights, arr_delay, everything())

Exercise 5.4.1.3 What does the one_of() function do? Why might it be helpful in conjunction with this vector?

# one_of 可以将一个向量传入
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))

Exercise 5.4.1.4 Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))
# contains忽略了大小写,有个参数可以改变
select(flights, contains("TIME", ignore.case = FALSE))

Add New Variables with mutate() 添加新列

flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)

mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60)

mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# transmute
transmute(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)

该函数支持输入函数,注意输入和输出必须为向量。常用创建函数运算为:

  • 算数运算符:+ - * / ^
  • 模运算符:%/% 取整,%% 取余
  • 对数函数:log(),log2(),log10()
  • 偏移函数lead(),lag(),帮助向左或者向右延伸一个变量
(x <- 1:10)
lead(x)
lag(x)
  • 累加和积等 cumsum(), cumprod(), cummin(), cummax()
cumsum(x)
cummean(x)
  • 逻辑比较:<, <=, >, >=, !=
  • 排序:min_rank(),row_number(), dense_rank(), percent_rank(), cume_dist(),
    and ntile()

5.6 Grouped summaries with summarise() 折叠数据框,一般与group_by()连用

by_day <- group_by(flights, year, month, day) # 设置分组
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE)) #summarize设置函数

Combining Multiple Operations with the Pipe

head(flights)
# 1. 对dest进行分组
by_dest <- group_by(flights, dest)
# 2.计算距离,平均延误时间,飞机数量
delay <- summarize(by_dest,
count = n(),dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
# 3.对数据进行过滤
delay <- filter(delay, count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)

# 用管道符 %>% 连接前后数据
delays <- flights %>%
group_by(dest) %>%
summarize(
count = n(),dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")

Missing Values 的处理 na.rm 参数

flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay)) # 如果有一个NA,那么结果就为NA,需要先去掉

flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay, na.rm = TRUE))

# 可以先把数据中的空值去掉
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))

Counts

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(delay=mean(arr_delay))

ggplot(delays,aes(delay))+geom_freqpoly(binwidth=10)


delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)

delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)

library(Lahman)
batting <- as_tibble(Lahman::Batting)

batters <- batting %>% group_by(playerID) %>% 
  summarize(ba=sum(H,na.rm=TRUE)/sum(AB,na.rm = TRUE),
            ab = sum(AB,na.rm = TRUE))

batters %>% filter(ab>100) %>% ggplot(aes(ab,ba)) + geom_point() +geom_smooth(se=F)

batters %>% arrange(desc(ba))
# 常用的分组函数:mean(x),median(x)
not_cancelled %>% group_by(year,month,day) %>% summarize(
  avg_delay1 = mean(arr_delay),
  avg_delay2 = mean(arr_delay[arr_delay > 0])
)
# sd(x), 四分位数IQR(x), 中位值偏差mad(x)
not_cancelled %>%
group_by(dest) %>%
summarize(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))

# min(x), quantile(x, 0.25), max(x)
# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first = min(dep_time),
last = max(dep_time)
)

not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
# 计算非空值sum(!is.na(x)),计算唯一值:n_distinct(x)
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))

# 计数:n(),和count()
not_cancelled %>%
count(dest)

not_cancelled %>%
count(tailnum, wt = distance)

# How many flights left before 5am? (these usually
# indicate delayed flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarize(n_early = sum(dep_time < 500)) 
# 也可以对逻辑值进行计数:如sum(x>10)代表数多少个TRUE,mean(x)计算其比例。
not_cancelled %>%
group_by(year, month, day) %>%
summarize(hour_perc = mean(arr_delay > 60))

按多个变量分组

daily <- group_by(flights, year, month, day)
(per_day <- summarize(daily, flights = n()))
(per_month <- summarize(per_day, flights = sum(flights)))
(per_year <- summarize(per_month, flights = sum(flights)))
flights %>% group_by(day) %>% summarize(mean(dep_time,na.rm = T))

# ungrouping 取消分组
daily %>%
ungroup() %>% # no longer grouped by date
summarize(flights = n()) # all flights

3.6.7 练习

(2) 找出另外一种方法,这种方法要可以给出与 not_cancelled %>% count(dest) 和 not_
cancelled %>% count(tailnum, wt = distance) 同样的输出(不能使用 count())。

not_canceled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>% count(dest)
not_cancelled %>% count(tailnum, wt = distance)
# 可以先分组再求每组的长度。
not_cancelled %>%
  group_by(dest) %>%
  summarise(n = length(dest))

not_cancelled %>%
  group_by(dest) %>%
  summarise(n = n())

not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))

3.7 分组新变量(和筛选器)

# 找出每个分组中最差的成员
flights_sml %>% group_by(year,month,day) %>% filter(rank(desc(arr_delay))<10)

#找出大于某个阈值的所有分组:
popular_dests <- flights %>% group_by(dest) %>% filter(n()>365)
popular_dests

#对数据进行标准化以计算分组指标
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
head(flights)
filter(flights,origin %>% c('IAH'))

阅读推荐:
生信技能树公益视频合辑:学习顺序是linux,r,软件安装,geo,小技巧,ngs组学!
B站链接:https://m.bilibili.com/space/338686099
YouTube链接:https://m.youtube.com/channel/UC67sImqK7V8tSWHMG8azIVA/playlists
生信工程师入门最佳指南:https://mp.weixin.qq.com/s/vaX4ttaLIa19MefD86WfUA

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,294评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,493评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,790评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,595评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,718评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,906评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,053评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,797评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,250评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,570评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,711评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,388评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,018评论 3 316
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,796评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,023评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,461评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,595评论 2 350

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,312评论 0 10
  • 少吃一顿饭 首先感谢Four建立这个社群还有各位战友的陪伴,没有你们,我不可能持续的学习英语这么长时间。并且接下来...
    胡晓军阅读 403评论 0 1