[R语言] stringr包 正则匹配《R for data science》 8-2

《R for Data Science》第十四章 strings 啃书知识点积累
参考链接:R for Data Science

涉及部分:stringr正则相关内容

Introduction

When you first look at a regexp, you’ll think a cat walked across your keyboard

Matching patterns with regular expressions

- Basic matches

利用str_view()可视化匹配结果

x <- c("apple", "banana", "pear")

str_view(x, ".a.")
str_view(x, ".a.*")
str_view(x, ".*a.*")
  • 正则中的转义要和字符串自身转义
    正则的转义是基于字符串转义\再区分,固用\\
dot <- "\\."
dot
# [1] "\\." # print依然会显示\\,真实情况用writelines

writeLines(dot) # 字符串的转义仅一个\
# \.

str_view(c("abc", "a.c", "bef"), "a\\.c")
# a.c

# 如果匹配 \ ,正则需要 \\ ,字符串自身需要\,因此需要4个\
x <- "a\\b"
str_view(x, "\\\\")
  • Exercises

Q: Explain why each of these strings don’t match a : "\", "\\", "\\\".

  1. \:字符串中的转义写法
  2. \\: 正则表达式中的转义写法
  3. \\\:前两个\\是正则转义写法,第三个\会转义下一个字符

- Anchors

自然文集可以用于练习 stringr::words

  • ^ to match the start of the string.
  • $ to match the end of the string.‘

If you begin with power (^), you end up with money ($).

\b可以匹配单词的边界,^ $是匹配全字符串边界

x <- c("apple pie", "apple", "apple cake")

str_view(x, "apple")
str_view(x, "^apple$")
str_view(x, "\\bapple\\b")

- Character classes and alternatives

  • 有时候用[]替代转义可以提高可读性
str_extract(c("abc", "a.c", "a*c", "a c"), ".[*]c")
# [1] NA    NA    "a*c" NA  

str_extract(c("abc", "a.c", "a*c", "a c"), ".\\*c")
# [1] NA    NA    "a*c" NA  

str_extract(c("abc", "a.c", "a*c", "a c"), ".[.]c")
# [1] NA    "a.c" NA    NA  

str_extract(c("abc", "a.c", "a*c", "a c"), ".\\.c")
# [1] NA    "a.c" NA    NA  

# str_extract 会保留NA,str_subset会去除NA
  • 关于|
    |的优先级很低

abc|xyz matches abc or xyz not abcyz or `abxyz.

str_extract(c("grey", "gray", "ggap"), "gr(e|a)y")
# [1] "grey" "gray" NA  
  • Q: Create regular expressions to find all words that:
    (1) Start with a vowel.
str_subset(words, '^[aeiou].*')

(2) That only contain consonants. (Hint: thinking about matching “not”-vowels.)

str_subset(words, '^[^aeiou]+$')

(3) End with ed, but not with eed.

# 需要考虑可能只有2个字符
str_subset(words, '(^|[^e])ed$')

(4) End with ing or ise.

str_subset(words, 'i(ng|se)$')

- Repetition

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
  • {n}: exactly n
  • {n,}: n or more
  • {,m}: at most m
  • {n,m}: between n and m

{n,m}默认是贪婪匹配,可匹配m就m,加上?可成为非贪婪模式

https://regexcrossword.com/

这个网站有各种难度和情境可以练习正则,可以一玩

- Grouping and backreferences

Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like \1, \2 etc.

str_subset(fruit, "(..)\\1")
# [1] "banana"      "coconut"     "cucumber"    "jujube"      "papaya"      "salal berry"
str_subset(fruit, "(.)(.)l\\2\\b")
# [1] "chili pepper"

Tools

- Detect matches

(1) str_detect: 返回匹配的逻辑值

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

FALSE becomes 0 and TRUE becomes 1.

sum(str_detect(words, '^(.).*\\1$'))
# [1] 36

mean(str_detect(words, '^(.).*\\1$'))
# [1] 0.03673469
  • 利用str_detect可以简化问题
    如:找到所有不含元音字母的单词
# 第一种方法更简便
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")

# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")

identical(no_vowels_1, no_vowels_2)
#> [1] TRUE
  • 取子集
words[!str_detect(words, "[aeiou]")]
# 等价于
str_subset(words,'^[^aeiou]+$')
  • str_detect可以配合dplyr::filter进行数据框列过滤
tibble(
  word = words, 
  i = seq_along(word)
) %>% 
  filter(str_detect(word, "x$"))
#> # A tibble: 4 x 2
#>   word      i
#>   <chr> <int>
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841

(2) str_count: 返回每个字符串(单词、词组)匹配正则的个数

str_count(c('chenxi','hello','world','apple'),'[aeiou]')
# [1] 2 2 1 2
str_count(c('chenxi','hello','world','apple'),'^[^aeiou]')
# [1] 1 1 1 0

# 计算words集中每个单词含元音字母的平均个数
mean(str_count(words, '[aeiou]'))
# [1] 1.991837
  • str_count可以配合dplyr::mutate
df <- tibble(
  id = seq_along(word),
  word = words
)

df %>% 
  mutate(
    vowels = str_count(word,'[aeiou]'),
    consonants = str_count(word,'[^aeiou]')
  )
  • 正则匹配不会相互重叠
str_count("abababa", "aba")
#> [1] 2

str_view_all("abababa", "aba")
  • 无序匹配大法:环视
# 匹配words数据集中含五个元音字母的单词
str_subset(words,
           '^(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u).*$')

# 简化问题也可以
words[str_detect(words,'a') &
      str_detect(words,'e') &
      str_detect(words,'i') &
      str_detect(words,'o') &
      str_detect(words,'u')]

- Extract matches

用的例子:stringr::sentences

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."
  • str_extract
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|");colour_match

has_colour <- str_subset(sentences, colour_match)
# extract the colour to figure out which one it is
matches <- str_extract(has_colour, colour_match)
# has_colour已经是匹配上的,如果无法匹配上str_extract会显示NA
head(matches)

Note that str_extract() only extracts the first match.

一个字符串中有多个匹配时也只返回第一个

  • str_extract_all
    可以返回多个匹配,且以list形式返回
more <- sentences[str_count(sentences, colour_match) > 1]

str_extract(more, colour_match)
# [1] "blue"   "green"  "orange"

str_extract_all(more, colour_match)
# [[1]]
# [1] "blue" "red" 
# 
# [[2]]
# [1] "green" "red"  
# 
# [[3]]
# [1] "orange" "red"  

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#    [,1] [,2] [,3]
# [1,] "a"  ""   ""  
# [2,] "a"  "b"  ""  
# [3,] "a"  "b"  "c" 

# 可见,形成矩阵时列数由单个字符串最高匹配数决定
# 其他行用“”空字符补齐
unlist(str_extract_all(sentences,'\\b[A-Za-z]*ing\\b'))
# 可以消除没返回的结果

- Grouped matches

  • str_subset str_extract str_match功能对比
# 以非空格开头即下一个单词
noun <- "(a|the) ([^ ]+)"

# 提取符合条件的整个字符串,返回向量
results <- sentences %>%
  str_subset(noun)
# [1] "The birch canoe slid on the smooth planks."               
# [2] "Glue the sheet to the dark blue background."              
# [3] "It's easy to tell the depth of a well."                   
# [4] "These days a chicken leg is a rare dish." 
# ....

# 提取符合条件的全部match,返回list或矩阵,可以降维
results %>% # 也可以用sentences
  str_extract_all(noun) %>% 
  unlist() %>% 
  .[1:5]
# [1] "the smooth" "the sheet"  "the dark"   "the depth"  "a well"  


# 除了返回match还能区分出括号包裹的亚match,返回矩阵
results %>% 
  str_match(noun) %>% 
  head(3)
#       [,1]         [,2]  [,3]    
# [1,] "the smooth" "the" "smooth"
# [2,] "the sheet"  "the" "sheet" 
# [3,] "the depth"  "the" "depth" 
# 类似str_extract,只返回第一次匹配

# 用str_match_all返回一个字符串的全部匹配及亚匹配
results %>% 
  str_match_all(noun) %>% 
  .[[2]]
#       [,1]        [,2]  [,3]   
# [1,] "the sheet" "the" "sheet"
# [2,] "the dark"  "the" "dark" 
  • 类似str_matchtidyr::extract

If your data is in a tibble, it’s often easier to use tidyr::extract(). It works like str_match() but requires you to name the matches, which are then placed in new columns

tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE # 控制sentence是否移除
  ) %>% 
  filter(!is.na(article))

- Replacing matches

str_replace() and str_replace_all()

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

str_replace_all支持多种替换

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"
  • 也可以利用后向引用替换
    如下例可以调换第2、3单词的顺序
sentences %>% 
  head(3)
# [1] "The birch canoe slid on the smooth planks." 
# [2] "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well." 

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(3)
# [1] "The canoe birch slid on the smooth planks."
# [2] "Glue sheet the to the dark blue background."
# [3] "It's to easy tell the depth of a well."

sentences %>% 
  str_replace_all("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(3)
# [1] "The canoe birch slid the on smooth planks." 
# [2] "Glue sheet the to dark the blue background."
# [3] "It's to easy tell depth the of well. a"   
  • Q: 调换数据集words所有单词的首末字母,并调换后是否还存在于words
req <- '^([A-Za-z])(.*)([A-Za-z])$'

results <- str_replace_all(words, req, '\\3\\2\\1')

results[results %in% words]
# 也可以取交集
intersect(words,results)

- Splitting

str_split结果返回list

sentences %>%
  head(3) %>% 
  str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

If you’re working with a length-1 vector, the easiest thing is to just extract the first element of the list

"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]

#> [1] "a" "b" "c" "d"
  • str_split的其他参数
# 通过simplify控制返回矩阵
sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
#>      [,9]   
#> [1,] ""     
#> [2,] ""     
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""


# n指定返回片段的最大个数
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#>      [,1]      [,2]    
#> [1,] "Name"    "Hadley"
#> [2,] "Country" "NZ"    
#> [3,] "Age"     "35"
  • 通过boundary做词切割
x <- "This is a sentence.  This is another sentence."

str_split(x, " ")[[1]]
#> [1] "This"      "is"        "a"         "sentence." ""          "This"     
#> [7] "is"        "another"   "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
#> [8] "sentence"

# boundary识别单词,去除空格和标点
  • 空字符""的分隔
    ""可以将字符串全部分割成单个字符
x <- c("apples, pears, and bananas")
str_split(x, "")[[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a"
# [22] "n" "a" "n" "a" "s"

- Find matches

  • str_locale
    返回match的起始和终止位置,用_all返回全部匹配
?str_locate()

fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "$")
str_locate(fruit, "a")
str_locate(fruit, "e")
str_locate(fruit, c("a", "b", "p", "p"))

str_locate_all(fruit, "a")
str_locate_all(fruit, "e")
str_locate_all(fruit, c("a", "b", "p", "p"))

# Find location of every character
str_locate_all(fruit, "")
  • str_sub
    根据索引提取子集
hw <- "Hadley Wickham"

str_sub(hw, 1, 6)
str_sub(hw, end = 6)
# [1] "Hadley"

str_sub(hw, -7)
# [1] "Wickham"
str_sub(hw, end = -9)
# [1] "Hadley"

str_sub('XiChen', seq_len(str_length('XiChen')))
# [1] "XiChen" "iChen"  "Chen"   "hen"    "en"     "n"
str_sub('XiChen', end = seq_len(str_length('XiChen')))
# [1] "X"      "Xi"     "XiC"    "XiCh"   "XiChe"  "XiChen"

# 还可以替换
x <- "XiChen"
str_sub(x, 1, 1) <- "A"; x
# [1] "AiChen"
str_sub(x, -1, -1) <- "K"; x
# [1] "AiCheK"
str_sub(x, -2, -2) <- "GHIJ"; x
# [1] "AiChGHIJK"
str_sub(x, 2, -2) <- ""; x
# [1] "AK"

# 其他见帮助文档

Other types of pattern

When you use a pattern that’s a string, it’s automatically wrapped into a call to regex()

# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
  • regex可以设置ignore_case忽略大小写
str_subset(bananas,'banana')
# [1] "banana"
str_subset(bananas, regex("banana", ignore_case = TRUE))
# [1] "banana" "Banana" "BANANA"
  • multiline = TRUE allows ^ and $ to match the start and end of each line
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"

str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"
  • comment=TRUE
phone <- regex("
  \\(?     # optional opening parens
  (\\d{3}) # area code
  [) -]?   # optional closing parens, space, or dash
  (\\d{3}) # another three numbers
  [ -]?    # optional space or dash
  (\\d{3}) # three more numbers
  ", comments = TRUE)

str_match("514-791-8141", phone)
#>      [,1]          [,2]  [,3]  [,4] 
#> [1,] "514-791-814" "514" "791" "814"
  • dotall = TRUE
    允许.匹配一切包括\n
x <- "Line 1\nLine 2\nLine 3"

str_extract(x, '.*')
# [1] "Line 1"
str_extract(x, regex('.*',dotall = TRUE))
# [1] "Line 1\nLine 2\nLine 3"
  • fixed

fixed(): matches exactly the specified sequence of bytes. It ignores all special regular expressions and operates at a very low level.

str_subset(c("a\\b", "ab"), "\\\\")
# [1] "a\\b"

# 可以避免正则层面的转义并且速度更快
str_subset(c("a\\b", "ab"), fixed("\\"))
# [1] "a\\b"

Other uses of regular expressions

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,490评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,581评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,830评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,957评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,974评论 6 393
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,754评论 1 307
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,464评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,357评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,847评论 1 317
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,995评论 3 338
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,137评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,819评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,482评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,023评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,149评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,409评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,086评论 2 355

推荐阅读更多精彩内容

  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,457评论 0 13
  • 注,有疑问 加QQ群..[174225475].. 共同探讨进步有偿求助请 出门左转 door , 合作愉快 st...
    飘舞的鼻涕阅读 1,148评论 0 0
  • 捕获 签名不仅仅是语法,它们是含有一列参数对象的 first-class 对象 。同样地,有一种含有参数集的数据...
    焉知非鱼阅读 562评论 0 0
  • 她六点起床,六点半上公交车,七点半到达火车站并改签,八点上火车,十点到达并转乘地铁,十点半到学校,十一点领到调档函...
    乐钗阅读 793评论 0 2
  • 我原本以为这个是不太可能的,之前用的也都是通过 URL Schemes来跳转 ,人家的Bundle ID也不会给你...
    神一样的队友阅读 717评论 0 0