stringr处理字符串与正则表达式

一、安装加载R包

if(!require("tidyverse"))install.packages("tidyverse")
if(!require("stringr"))install.packages("stringr")
library(tidyverse)
library(stringr)

二、 字符串基础

2.1 字符串长度

str_length()

e.g.

str_length(c("a", "R for data science", NA))
#> [1]  1 18 NA

2.2 字符串组合

str_c(..., sep = "", collapse = FALSE)

2.2.1 组合2个及以上字符串

e.g.1

str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"

e.g.2

str_c("x", "y", sep = ", ")
#> [1] "x, y"

e.g.3 注意str_c是向量化的,所以有循环补齐功能

str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

2.2.2 将字符向量合并为字符串

str_c(c("x", "y", "z"), collapse = ". ")
#> [1] "x. y. z"

2.3 字符串取子集

str_sub(string, start = 1L, end = -1L)

1、提取字符串的一部分

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"
# 负数代表从后往前数
str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"

2、提取后赋值

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple"  "banana" "pear"

2.4 练习

str_trim(string, side = c("both", "left", "right"))

作用:修剪字符串中的空格

str_pad(string, width, side = c("left", "right", "both"), pad = " ")

作用:增加字符串中的空格

三、正则表达式

3.1 基础匹配

重点:

  1. .可以代表任意字符
    
  2. 若想匹配字符.则需要添加转义符\ 所以字符.用正则表达式表示为 \. 然后再写成字符串的形式为"\\."
  3. 字符\ 用正则表达式为\\再写成字符串的形式为"\\\\"

e.g.

str_view(c("abc", "a.c", "bef"), "a\\.c")
11

3.2 锚点

重点:

  1. ^从字符串开头进行匹配
  2. $从字符串末尾进行匹配
  3. 可以使用\b来匹配单词的边界

3.3 练习

如何匹配字符串"^"

思路:

  1. 你需要一个转移符号告诉正则表达式你要的是字符而是所代表的特殊含义,所以其正则表达式为 \$\^\$
  2. 使用字符串表示正则表达式;而\在字符串中也表示转义,所以要再加一个转义符:"\\$\\^\\$"
  3. 最后加上锚点:"^\\$\\^\\$$"

3.4 字符类与字符选项

重点:

  • \d可以匹配任意数字
  • \s可以匹配任意空白字符(如空格、制表符和换行符)
  • [abc]可以匹配a、b或c
  • [^abc]可以匹配出a、b、c外的任意字符

牢记!因为\在字符串中也表示转义,所以创建\d的正则表达式需要在字符串中对\进行转义,因此需输入\\d

还可以使用字符选项创建多个可选的模式

注意:|的优先级很低!!!!

abc|xyz匹配的是abc或xyz,而不是abcyz或abxyz

3.5 重复

正则表达式一个强大的功能是可以控制一个模式能够匹配多少次

重点1:

  • ?:0次或1次
  • +:1次或多次
  • *:0次或多次

注意:

1、只重复其前方的一个字符!!!

2、这些字符优先级非常高

重点2:

  • {n}:匹配n次
  • {n, }:匹配n次或更多次
  • { ,m}:最多匹配m次
  • {n, m}:匹配n到m次

重点3:

默认的匹配方式是“贪婪的”:正则表达式会匹配尽量长的字符串。

通过在正则表达式后添加一个?,可以将匹配方式改为“懒惰的”,即匹配尽量短的字符串。

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, 'C{2,3}?')
12

3.6 分组与溯源引用

括号的作用:

1、消除复杂表达式中的歧义,阐明优先级

2、定义“分组”信息,同时可以通过回溯引用如(\1,\2等)来引用这些分组。

注意:

1个括号为1组,\1代表回溯引用第1组即第1个括号里的内容,\2代表回溯引用第2组即第2个括号里的内容,\3代表回溯引用第3组即第3个括号里的内容。

str_view(fruit, "(..)\\1", match = TRUE)
13

四、工具

学习stringr的多个函数,应用正则表达式:

  • 确定与某种模式相匹配的字符串
  • 找出匹配的位置str_detect
  • 提取出匹配的内容str_extract
  • 使用新值替换匹配内容str_replace
  • 基于匹配拆分字符串str_split

4.1 匹配检测

想要确定一个字符串向量能否匹配一种模式,使用str_detect函数,它返回一个与输入向量具有相同长度的逻辑向量。

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

str_detect配合逻辑值取子集和str_subset函数可起到一样的效果

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"

str_subset()函数

str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

重点:

然而,字符串通常式数据框的一列,此时我们可以用filter操作

df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
#> # A tibble: 4 x 2
#>   word      i
#>   <chr> <int>
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841

str_detect()函数的一种变体是str_count(),后者不是简单返回是或否,而是返回字符串中匹配的数量

str_count()

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1

str_count可以同mutate()函数一同使用:

计算word数据中元音字母和辅音字母的数量

df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
#> # A tibble: 980 x 4
#>   word         i vowels consonants
#>   <chr>    <int>  <int>      <int>
#> 1 a            1      1          0
#> 2 able         2      2          2
#> 3 about        3      3          2
#> 4 absolute     4      4          4
#> 5 accept       5      2          4
#> 6 account      6      3          4
#> # … with 974 more rows

4.2 提取匹配内容

利用stringr的内置数据集sentences做练习

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

1、创建一个颜色向量

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
#> [1] "red|orange|yellow|green|blue|purple"

2、选取包含这个颜色的句子

has_colour <- str_subset(sentences, colour_match)

3、提取这些句子里所包含的颜色,就知道句子里有哪些颜色了

matches <- str_extract(has_colour, colour_match)
head(matches)
#> [1] "blue" "blue" "red"  "red"  "red"  "blue"

注意:

str_extract()只提取了每个句子的第一个匹配!!(一个句子里可能有2个或以上的颜色单词)

这是stringr函数的一种通用模式,因为单个匹配可以使用更简单的数据结构。要想得到所有匹配,可以使用str_extract_all()函数,他会返回一个列表

str_extract_all(has_colour, colour_match) %>%
    .[1:3]
#[[1]]
#[1] "blue"
#
#[[2]]
#[1] "blue"
#
#[[3]]
#[1] "red"

如果设置参数simplify = TRUE,那么返回一个矩阵,其中较短的匹配循环补齐

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#>      [,1] [,2] [,3]
#> [1,] "a"  ""   ""  
#> [2,] "a"  "b"  ""  
#> [3,] "a"  "b"  "c"

4.3 分组匹配

同样探索内置数据集sentences,假设我们想从句子中提取出名词,可以换个思路一般a/the后面跟的是名词,所以a/the+ 空格+1个以上非空格字符即可

noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
#>  [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
#>  [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

str_extract()可以给出完整匹配,str_match()则可以给出每个独立分组(即每个括号里的内容),返回的不是向量而是矩阵,其中一列为完整匹配,后面的是每个分组的匹配

str_match()

has_noun %>% 
  str_match(noun)
#>       [,1]         [,2]  [,3]     
#>  [1,] "the smooth" "the" "smooth" 
#>  [2,] "the sheet"  "the" "sheet"  
#>  [3,] "the depth"  "the" "depth"  
#>  [4,] "a chicken"  "a"   "chicken"
#>  [5,] "the parked" "the" "parked" 
#>  [6,] "the sun"    "the" "sun"    
#>  [7,] "the huge"   "the" "huge"   
#>  [8,] "the ball"   "the" "ball"   
#>  [9,] "the woman"  "the" "woman"  
#> [10,] "a helps"    "a"   "helps"

与str_extract()函数一样,如果想要找出每个字符串的所有匹配,需要使用str_match_all()

tidyr::extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ...)

如果数据类型是tibble,使用tidyr::extract()更容易,其工作原理与str_match()相似,只是要求为每个分组提供一个名称,以作为新列放在tibble中

tibble(sentence = sentences) %>%
  extract(sentence, 
  into = c("article", "noun"), 
  "(a|the) ([^ ]+)",
  remove = FALSE)
#> # A tibble: 720 x 3
#>   sentence                                    article noun   
#>   <chr>                                       <chr>   <chr>  
#> 1 The birch canoe slid on the smooth planks.  the     smooth 
#> 2 Glue the sheet to the dark blue background. the     sheet  
#> 3 It's easy to tell the depth of a well.      the     depth  
#> 4 These days a chicken leg is a rare dish.    a       chicken
#> 5 Rice is often served in round bowls.        <NA>    <NA>   
#> 6 The juice of lemons makes fine punch.       <NA>    <NA>   
#> # … with 714 more rows

4.4 替换匹配内容

str_replace()和str_replace_all()函数可以使用新字符串替换匹配内容

str_replace(string, pattern, replacement)

1、使用固定字符串替换匹配内容

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

2、通过提供一个命名向量,使用str_replace_all()函数可以同时执行多个替换:

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

3、使用回溯引用来插入匹配中的分组,下面代码交换了第二个和第三个单词的顺序:

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
#> [1] "The canoe birch slid on the smooth planks." 
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."     
#> [4] "These a days chicken leg is a rare dish."   
#> [5] "Rice often is served in round bowls."

4.5 拆分

str_split()函数可以将字符串拆分为多个片段。

str_split(string, pattern, n = Inf, simplify = FALSE)

举例:

sentences %>%
  head(5) %>% 
  str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
#> 
#> [[4]]
#> [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
#> [8] "rare"    "dish."  
#> 
#> [[5]]
#> [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

因为字符向量的每个分量会包含不同数量的片段,所以str_split()返回一个列表,所以如果你拆分的是一个长度为1的向量,那么只要简单地提取列表第一个元素即可:

"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
#> [1] "a" "b" "c" "d"

或者将修改参数为simplify = TRUE,返回的是一个矩阵

sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
#>      [,9]   
#> [1,] ""     
#> [2,] ""     
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""

还可以设定拆分片段的最大数量:

下面代码设定拆分的片段最多为2个片段

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#>      [,1]      [,2]    
#> [1,] "Name"    "Hadley"
#> [2,] "Country" "NZ"    
#> [3,] "Age"     "35"

boundary()函数,可以通过字母、行、句子和单词边界来拆分字符串。

boundary(type = c("character", "line_break", "sentence", "word"),  skip_word_none = NA, ...)

e.g.

x <- "This is a sentence.  This is another sentence."

str_split(x, " ")[[1]]
#> [1] "This"      "is"        "a"         "sentence." ""          "This"     
#> [7] "is"        "another"   "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
#> [8] "sentence"

认真观察上面的结果,可以发现boundary()函数的拆分结果是要更好的。

五、正则表达式的其他应用

R基础包中两个常用函数,他们也可以使用正则表达式

  • apropos()函数可以在全局环境空间中搜索所有可用对象。当我们没法确切的想起函数名称是,这个函数很好用,举例:
apropos("replace")
#> [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
#> [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"
  • dir()函数可以列出一个目录下的所有文件。

dir()的pattern参数可以是一个正则表达式,此时他只返回与这个模式相匹配的文件名。举例:

dir(pattern = "\\.csv$")
[1] "clinical.csv"       "raw_ntnbr.csv"      "raw_tnbr.csv"      
[4] "total_clinical.csv"

这章内容都很重要,可以参考英文版R数据科学:https://r4ds.had.co.nz/strings.html

及其课后习题答案:https://jrnold.github.io/r4ds-exercise-solutions/strings.html#splitting

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容