[R语言] forcats包 因子操作《R for data science》 9

《R for Data Science》第十五章 factors 啃书知识点积累
参考链接:R for Data Science

Creating factors

x1 <- c("Dec", "Apr", "Jan", "Mar")

纯粹创建一个向量记录月份,有两个缺点:

  1. 没有很好的办法避免打字错误
x2 <- c("Dec", "Apr", "Jam", "Mar")
  1. 排序只能按照首字母顺序
sort(x1)
#> [1] "Apr" "Dec" "Jan" "Mar"

策略:创建factor
首先创建levels

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

然后创建因子

y1 <- factor(x1, levels = month_levels)

sort(y1)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  • 如果向量中的值不存在于levels中会被静默转换为NA
    可以用readr::parse_factor捕获warning
x2 <- c("Dec", "Apr", "Jam", "Mar")

y2 <- factor(x2, levels = month_levels)
y2
#> [1] Dec  Apr  <NA> Mar 
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

y2 <- parse_factor(x2, levels = month_levels)
#> Warning: 1 parsing failure.
#> row col           expected actual
#>   3  -- value in level set    Jam
  • 如果不设定levels,会自动创建按照字母表顺序的levels
factor(x1)
#> [1] Dec Apr Jan Mar
#> Levels: Apr Dec Jan Mar
  • 按照分类变量第一次出现的顺序设定levels

方法一:创建时用unique

f1 <- factor(x1, levels = unique(x1))
f1
#> [1] Dec Apr Jan Mar
#> Levels: Dec Apr Jan Mar

方法二:创建后用fct_inorder

f2 <- x1 %>% factor() %>% fct_inorder()
f2
#> [1] Dec Apr Jan Mar
#> Levels: Dec Apr Jan Mar
  • ·levels·直接查询因子内部水平
levels(f2)
#> [1] "Dec" "Apr" "Jan" "Mar"

General Social Survey

??forcats::gss_cat
  • 分类变量映射ggplot2的x轴

会自动转factor并且删除没有任何值的级别,可以用drop=FALSE强迫显示

library(ggplot2)
library(patchwork)

p1 <- ggplot(gss_cat, aes(race)) +
  geom_bar() 

p2 <- ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

p1 + p2

- Exercises

gss_cat %>%
  # 过滤掉符合条件的
  filter(!rincome %in% c("Not applicable")) %>%
  # 修改变量中某一亚群的名字
  mutate(rincome = fct_recode(rincome,
                              "Less than $1000" = "Lt $1000"
  )) %>%
  # 区别填充色的预处理
  mutate(rincome_na = rincome %in% c("Refused", "Don't know", "No answer")) %>%
  ggplot(aes(x = rincome, fill = rincome_na)) +
  geom_bar() +
  coord_flip() +
  scale_y_continuous("Number of Respondents", labels = scales::comma) +
  scale_x_discrete("Respondent's Income") +
  # 区别填充
  scale_fill_manual(values = c("FALSE" = "black", "TRUE" = "gray")) +
  theme(legend.position = "None")

Modifying factor order

It’s often useful to change the order of the factor levels in a visualisation.

- 依数值重排序 fct_reorder

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

p1 <- ggplot(relig_summary, aes(tvhours, relig)) + 
  geom_point()

# 默认降序
p2 <- ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

# 也可以用EDA提到的reorder
ggplot(relig_summary, aes(tvhours, reorder(relig, tvhours))) +
  geom_point()

p1 + p2

- 自定义重排序 fct_relevel

It takes a factor, f, and then any number of levels that you want to move to the front of the line.

rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

p1 <- ggplot(rincome_summary, aes(age, rincome)) + 
  geom_point()

p2 <- ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()

p1 + p2

- 调节图例顺序 fct_reorder2()

fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.

主要作用是调节图例顺序便于阅读

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")
  • 另一个例子:各党派每年比例的变化
p1 <- gss_cat %>%
  mutate(
    partyid =
      fct_collapse(
         partyid,
         Others = c("No answer", "Don't know", "Other party"),
         Republican = c("Strong republican", "Not str republican"),
         Independent = c("Ind,near rep", "Independent", "Ind,near dem"),
         Democrat = c("Not str democrat", "Strong democrat")
      )
  ) %>%
  count(year, partyid) %>% 
  group_by(year) %>%
  mutate(proportions = n / sum(n)) %>% 
  ggplot(aes(year, proportions,
    colour = partyid
  )) +
  geom_point() +
  geom_line(size = 1) 

p2 <- gss_cat %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        Others = c("No answer", "Don't know", "Other party"),
        Republican = c("Strong republican", "Not str republican"),
        Independent = c("Ind,near rep", "Independent", "Ind,near dem"),
        Democrat = c("Not str democrat", "Strong democrat")
      )
  ) %>%
  count(year, partyid) %>% 
  group_by(year) %>%
  mutate(proportions = n / sum(n)) %>% 
  ggplot(aes(year, proportions,
             colour = fct_reorder2(partyid, year, proportions)
  )) +
  geom_point() +
  geom_line(size = 1) +
  labs(colour = "Party ID")    

p1 + p2

- 柱形图的简易重排

利用fct_infreq()fct_rev()

# 调节为顺序递增
p1 <- gss_cat %>%
  mutate(marital = marital %>% fct_infreq()) %>%
  ggplot(aes(marital)) +
  geom_bar()

# 配合fct_rev是顺序递减
p2 <- gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
  geom_bar()

p1 + p2
  • 判断一个dataset哪些变量是factor
str(gss_cat)
# 或者有更简便的办法
keep(gss_cat,is.factor) %>% 
  names(.)
# [1] "marital" "race"    "rincome" "partyid" "relig"   "denom"  

Modifying factor levels

More powerful than changing the orders of the levels is changing their values.

- 修改变量中的值 fct_recode()

gss_cat %>% 
  count(partyid)
#> # A tibble: 10 x 2
#>   partyid                n
#>   <fct>              <int>
#> 1 No answer            154
#> 2 Don't know             1
#> 3 Other party          393
#> 4 Strong republican   2314
#> 5 Not str republican  3032
#> 6 Ind,near rep        1791
#> # … with 4 more rows

gss_cat %>%
  mutate(partyid = fct_recode(
    partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
#> # A tibble: 10 x 2
#>   partyid                   n
#>   <fct>                 <int>
#> 1 No answer               154
#> 2 Don't know                1
#> 3 Other party             393
#> 4 Republican, strong     2314
#> 5 Republican, weak       3032
#> 6 Independent, near rep  1791
#> # … with 4 more rows

fct_recode() will leave levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

  • 可以将多个不同值整合为同一种便于分组
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)

- 同时整合多个值 fct_collapse()

gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)
#> # A tibble: 4 x 2
#>   partyid     n
#>   <fct>   <int>
#> 1 other     548
#> 2 rep      5346
#> 3 ind      8409
#> 4 dem      7180
  • 放一个案例:整合收入数据可视化
gss_cat %>%
  mutate(
    rincome =
      fct_collapse(
        rincome,
        `Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"),
        `Less than $5000` = c("Lt $1000", str_c(
          "$", c("1000", "3000", "4000"),
          " to ", c("2999", "3999", "4999")
        )),
        `$5000 to 10000` = str_c(
          "$", c("5000", "6000", "7000", "8000"),
          " to ", c("5999", "6999", "7999", "9999")
        )
      )
  ) %>%
  ggplot(aes(x = rincome)) +
  geom_bar() +
  coord_flip()

- 自动堆砌值,多值化少值 fct_lump()

整合方式是从最少堆开始逐渐向上吞并

一般用于无序数据的整合

gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)
#> # A tibble: 2 x 2
#>   relig          n
#>   <fct>      <int>
#> 1 Protestant 10846
#> 2 Other      10637
  • 可以用参数n控制最后整合成的堆数
gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
#> # A tibble: 10 x 2
#>    relig                       n
#>    <fct>                   <int>
#>  1 Protestant              10846
#>  2 Catholic                 5124
#>  3 None                     3523
#>  4 Christian                 689
#>  5 Other                     458
#>  6 Jewish                    388
#>  7 Buddhism                  147
#>  8 Inter-nondenominational   109
#>  9 Moslem/islam              104
#> 10 Orthodox-christian         95


gss_cat %>%
  mutate(relig = fct_lump(relig, n = 5)) %>%
  count(relig, sort = TRUE)
# # A tibble: 6 x 2
#   relig          n
#   <fct>      <int>
# 1 Protestant 10846
# 2 Catholic    5124
# 3 None        3523
# 4 Other        913
# 5 Christian    689
# 6 Jewish       388
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,923评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,154评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,775评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,960评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,976评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,972评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,893评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,709评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,159评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,400评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,552评论 1 346
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,265评论 5 341
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,876评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,528评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,701评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,552评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,451评论 2 352

推荐阅读更多精彩内容