[R语言] stringr包正则匹配《R for data science》 8-2

《R for Data Science》第十四章 strings 啃书知识点积累
参考链接：R for Data Science

涉及部分：stringr正则相关内容

Introduction

When you first look at a regexp, you’ll think a cat walked across your keyboard

Matching patterns with regular expressions

- Basic matches

利用str_view()可视化匹配结果

x <- c("apple", "banana", "pear")

str_view(x, ".a.")
str_view(x, ".a.*")
str_view(x, ".*a.*")

正则中的转义要和字符串自身转义
正则的转义是基于字符串转义\再区分，固用\\

dot <- "\\."
dot
# [1] "\\." # print依然会显示\\，真实情况用writelines

writeLines(dot) # 字符串的转义仅一个\
# \.

str_view(c("abc", "a.c", "bef"), "a\\.c")
# a.c

# 如果匹配 \ ，正则需要 \\ ,字符串自身需要\，因此需要4个\
x <- "a\\b"
str_view(x, "\\\\")

Exercises

Q: Explain why each of these strings don’t match a : "\", "\\", "\\\".

\：字符串中的转义写法
\\：正则表达式中的转义写法
\\\：前两个\\是正则转义写法，第三个\会转义下一个字符

- Anchors

自然文集可以用于练习 stringr::words

^ to match the start of the string.
$ to match the end of the string.‘

If you begin with power (^), you end up with money ($).

\b可以匹配单词的边界，^ $是匹配全字符串边界

x <- c("apple pie", "apple", "apple cake")

str_view(x, "apple")
str_view(x, "^apple$")
str_view(x, "\\bapple\\b")

- Character classes and alternatives

有时候用[]替代转义可以提高可读性

str_extract(c("abc", "a.c", "a*c", "a c"), ".[*]c")
# [1] NA    NA    "a*c" NA  

str_extract(c("abc", "a.c", "a*c", "a c"), ".\\*c")
# [1] NA    NA    "a*c" NA  

str_extract(c("abc", "a.c", "a*c", "a c"), ".[.]c")
# [1] NA    "a.c" NA    NA  

str_extract(c("abc", "a.c", "a*c", "a c"), ".\\.c")
# [1] NA    "a.c" NA    NA  

# str_extract 会保留NA，str_subset会去除NA

关于|
|的优先级很低

abc|xyz matches abc or xyz not abcyz or `abxyz.

str_extract(c("grey", "gray", "ggap"), "gr(e|a)y")
# [1] "grey" "gray" NA

Q: Create regular expressions to find all words that:
(1) Start with a vowel.

str_subset(words, '^[aeiou].*')

(2) That only contain consonants. (Hint: thinking about matching “not”-vowels.)

str_subset(words, '^[^aeiou]+$')

(3) End with ed, but not with eed.

# 需要考虑可能只有2个字符
str_subset(words, '(^|[^e])ed$')

(4) End with ing or ise.

str_subset(words, 'i(ng|se)$')

- Repetition

?: 0 or 1
+: 1 or more
*: 0 or more
{n}: exactly n
{n,}: n or more
{,m}: at most m
{n,m}: between n and m

{n,m}默认是贪婪匹配，可匹配m就m，加上?可成为非贪婪模式

https://regexcrossword.com/

这个网站有各种难度和情境可以练习正则，可以一玩

- Grouping and backreferences

Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like \1, \2 etc.

str_subset(fruit, "(..)\\1")
# [1] "banana"      "coconut"     "cucumber"    "jujube"      "papaya"      "salal berry"
str_subset(fruit, "(.)(.)l\\2\\b")
# [1] "chili pepper"

Tools

- Detect matches

(1) str_detect: 返回匹配的逻辑值

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

FALSE becomes 0 and TRUE becomes 1.

sum(str_detect(words, '^(.).*\\1$'))
# [1] 36

mean(str_detect(words, '^(.).*\\1$'))
# [1] 0.03673469

利用str_detect可以简化问题
如：找到所有不含元音字母的单词

# 第一种方法更简便
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")

# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")

identical(no_vowels_1, no_vowels_2)
#> [1] TRUE

取子集

words[!str_detect(words, "[aeiou]")]
# 等价于
str_subset(words,'^[^aeiou]+$')

str_detect可以配合dplyr::filter进行数据框列过滤

tibble(
  word = words, 
  i = seq_along(word)
) %>% 
  filter(str_detect(word, "x$"))
#> # A tibble: 4 x 2
#>   word      i
#>   <chr> <int>
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841

(2) str_count: 返回每个字符串（单词、词组）匹配正则的个数

str_count(c('chenxi','hello','world','apple'),'[aeiou]')
# [1] 2 2 1 2
str_count(c('chenxi','hello','world','apple'),'^[^aeiou]')
# [1] 1 1 1 0

# 计算words集中每个单词含元音字母的平均个数
mean(str_count(words, '[aeiou]'))
# [1] 1.991837

str_count可以配合dplyr::mutate

df <- tibble(
  id = seq_along(word),
  word = words
)

df %>% 
  mutate(
    vowels = str_count(word,'[aeiou]'),
    consonants = str_count(word,'[^aeiou]')
  )

正则匹配不会相互重叠

str_count("abababa", "aba")
#> [1] 2

str_view_all("abababa", "aba")

无序匹配大法：环视

# 匹配words数据集中含五个元音字母的单词
str_subset(words,
           '^(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u).*$')

# 简化问题也可以
words[str_detect(words,'a') &
      str_detect(words,'e') &
      str_detect(words,'i') &
      str_detect(words,'o') &
      str_detect(words,'u')]

- Extract matches

用的例子：stringr::sentences

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

str_extract

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|");colour_match

has_colour <- str_subset(sentences, colour_match)
# extract the colour to figure out which one it is
matches <- str_extract(has_colour, colour_match)
# has_colour已经是匹配上的，如果无法匹配上str_extract会显示NA
head(matches)

Note that str_extract() only extracts the first match.

一个字符串中有多个匹配时也只返回第一个

str_extract_all
可以返回多个匹配，且以list形式返回

more <- sentences[str_count(sentences, colour_match) > 1]

str_extract(more, colour_match)
# [1] "blue"   "green"  "orange"

str_extract_all(more, colour_match)
# [[1]]
# [1] "blue" "red" 
# 
# [[2]]
# [1] "green" "red"  
# 
# [[3]]
# [1] "orange" "red"  

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#    [,1] [,2] [,3]
# [1,] "a"  ""   ""  
# [2,] "a"  "b"  ""  
# [3,] "a"  "b"  "c" 

# 可见，形成矩阵时列数由单个字符串最高匹配数决定
# 其他行用“”空字符补齐

unlist(str_extract_all(sentences,'\\b[A-Za-z]*ing\\b'))
# 可以消除没返回的结果

- Grouped matches

str_subset str_extract str_match功能对比

# 以非空格开头即下一个单词
noun <- "(a|the) ([^ ]+)"

# 提取符合条件的整个字符串，返回向量
results <- sentences %>%
  str_subset(noun)
# [1] "The birch canoe slid on the smooth planks."               
# [2] "Glue the sheet to the dark blue background."              
# [3] "It's easy to tell the depth of a well."                   
# [4] "These days a chicken leg is a rare dish." 
# ....

# 提取符合条件的全部match，返回list或矩阵，可以降维
results %>% # 也可以用sentences
  str_extract_all(noun) %>% 
  unlist() %>% 
  .[1:5]
# [1] "the smooth" "the sheet"  "the dark"   "the depth"  "a well"  


# 除了返回match还能区分出括号包裹的亚match，返回矩阵
results %>% 
  str_match(noun) %>% 
  head(3)
#       [,1]         [,2]  [,3]    
# [1,] "the smooth" "the" "smooth"
# [2,] "the sheet"  "the" "sheet" 
# [3,] "the depth"  "the" "depth" 
# 类似str_extract，只返回第一次匹配

# 用str_match_all返回一个字符串的全部匹配及亚匹配
results %>% 
  str_match_all(noun) %>% 
  .[[2]]
#       [,1]        [,2]  [,3]   
# [1,] "the sheet" "the" "sheet"
# [2,] "the dark"  "the" "dark"

类似str_match的tidyr::extract

If your data is in a tibble, it’s often easier to use tidyr::extract(). It works like str_match() but requires you to name the matches, which are then placed in new columns

tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE # 控制sentence是否移除
  ) %>% 
  filter(!is.na(article))

- Replacing matches

str_replace() and str_replace_all()

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

str_replace_all支持多种替换

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

也可以利用后向引用替换
如下例可以调换第2、3单词的顺序

sentences %>% 
  head(3)
# [1] "The birch canoe slid on the smooth planks." 
# [2] "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well." 

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(3)
# [1] "The canoe birch slid on the smooth planks."
# [2] "Glue sheet the to the dark blue background."
# [3] "It's to easy tell the depth of a well."

sentences %>% 
  str_replace_all("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(3)
# [1] "The canoe birch slid the on smooth planks." 
# [2] "Glue sheet the to dark the blue background."
# [3] "It's to easy tell depth the of well. a"

Q: 调换数据集words所有单词的首末字母，并调换后是否还存在于words中

req <- '^([A-Za-z])(.*)([A-Za-z])$'

results <- str_replace_all(words, req, '\\3\\2\\1')

results[results %in% words]
# 也可以取交集
intersect(words,results)

- Splitting

str_split结果返回list

sentences %>%
  head(3) %>% 
  str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

If you’re working with a length-1 vector, the easiest thing is to just extract the first element of the list

"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]

#> [1] "a" "b" "c" "d"

str_split的其他参数

# 通过simplify控制返回矩阵
sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
#>      [,9]   
#> [1,] ""     
#> [2,] ""     
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""


# n指定返回片段的最大个数
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#>      [,1]      [,2]    
#> [1,] "Name"    "Hadley"
#> [2,] "Country" "NZ"    
#> [3,] "Age"     "35"

通过boundary做词切割

x <- "This is a sentence.  This is another sentence."

str_split(x, " ")[[1]]
#> [1] "This"      "is"        "a"         "sentence." ""          "This"     
#> [7] "is"        "another"   "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
#> [8] "sentence"

# boundary识别单词，去除空格和标点

空字符""的分隔
""可以将字符串全部分割成单个字符

x <- c("apples, pears, and bananas")
str_split(x, "")[[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a"
# [22] "n" "a" "n" "a" "s"

- Find matches

str_locale
返回match的起始和终止位置，用_all返回全部匹配

?str_locate()

fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "$")
str_locate(fruit, "a")
str_locate(fruit, "e")
str_locate(fruit, c("a", "b", "p", "p"))

str_locate_all(fruit, "a")
str_locate_all(fruit, "e")
str_locate_all(fruit, c("a", "b", "p", "p"))

# Find location of every character
str_locate_all(fruit, "")

str_sub
根据索引提取子集

hw <- "Hadley Wickham"

str_sub(hw, 1, 6)
str_sub(hw, end = 6)
# [1] "Hadley"

str_sub(hw, -7)
# [1] "Wickham"
str_sub(hw, end = -9)
# [1] "Hadley"

str_sub('XiChen', seq_len(str_length('XiChen')))
# [1] "XiChen" "iChen"  "Chen"   "hen"    "en"     "n"
str_sub('XiChen', end = seq_len(str_length('XiChen')))
# [1] "X"      "Xi"     "XiC"    "XiCh"   "XiChe"  "XiChen"

# 还可以替换
x <- "XiChen"
str_sub(x, 1, 1) <- "A"; x
# [1] "AiChen"
str_sub(x, -1, -1) <- "K"; x
# [1] "AiCheK"
str_sub(x, -2, -2) <- "GHIJ"; x
# [1] "AiChGHIJK"
str_sub(x, 2, -2) <- ""; x
# [1] "AK"

# 其他见帮助文档

Other types of pattern

When you use a pattern that’s a string, it’s automatically wrapped into a call to regex()

# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))

regex可以设置ignore_case忽略大小写

str_subset(bananas,'banana')
# [1] "banana"
str_subset(bananas, regex("banana", ignore_case = TRUE))
# [1] "banana" "Banana" "BANANA"

multiline = TRUE allows ^ and $ to match the start and end of each line

x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"

str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"

comment=TRUE

phone <- regex("
  \\(?     # optional opening parens
  (\\d{3}) # area code
  [) -]?   # optional closing parens, space, or dash
  (\\d{3}) # another three numbers
  [ -]?    # optional space or dash
  (\\d{3}) # three more numbers
  ", comments = TRUE)

str_match("514-791-8141", phone)
#>      [,1]          [,2]  [,3]  [,4] 
#> [1,] "514-791-814" "514" "791" "814"

dotall = TRUE
允许.匹配一切包括\n

x <- "Line 1\nLine 2\nLine 3"

str_extract(x, '.*')
# [1] "Line 1"
str_extract(x, regex('.*',dotall = TRUE))
# [1] "Line 1\nLine 2\nLine 3"

fixed

fixed(): matches exactly the specified sequence of bytes. It ignores all special regular expressions and operates at a very low level.

str_subset(c("a\\b", "ab"), "\\\\")
# [1] "a\\b"

# 可以避免正则层面的转义并且速度更快
str_subset(c("a\\b", "ab"), fixed("\\"))
# [1] "a\\b"

Other uses of regular expressions

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,490评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,581评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,830评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,957评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,974评论 6赞 393
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,754评论 1赞 307
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,464评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,357评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,847评论 1赞 317
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,995评论 3赞 338
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,137评论 1赞 351
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,819评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,482评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,023评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,149评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,409评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,086评论 2赞 355

[R语言] stringr包 正则匹配《R for data science》 8-2