一、安装加载R包
if(!require("tidyverse"))install.packages("tidyverse")
if(!require("stringr"))install.packages("stringr")
library(tidyverse)
library(stringr)
二、 字符串基础
2.1 字符串长度
str_length()
e.g.
str_length(c("a", "R for data science", NA))
#> [1] 1 18 NA
2.2 字符串组合
str_c(..., sep = "", collapse = FALSE)
2.2.1 组合2个及以上字符串
e.g.1
str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"
e.g.2
str_c("x", "y", sep = ", ")
#> [1] "x, y"
e.g.3 注意str_c是向量化的,所以有循环补齐功能
str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
2.2.2 将字符向量合并为字符串
str_c(c("x", "y", "z"), collapse = ". ")
#> [1] "x. y. z"
2.3 字符串取子集
str_sub(string, start = 1L, end = -1L)
1、提取字符串的一部分
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"
# 负数代表从后往前数
str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"
2、提取后赋值
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple" "banana" "pear"
2.4 练习
str_trim(string, side = c("both", "left", "right"))
作用:修剪字符串中的空格
str_pad(string, width, side = c("left", "right", "both"), pad = " ")
作用:增加字符串中的空格
三、正则表达式
3.1 基础匹配
重点:
.可以代表任意字符
- 若想匹配字符.则需要添加转义符\ 所以字符.用正则表达式表示为
\.
然后再写成字符串的形式为"\\."
- 字符\ 用正则表达式为
\\
再写成字符串的形式为"\\\\"
e.g.
str_view(c("abc", "a.c", "bef"), "a\\.c")
3.2 锚点
重点:
- ^从字符串开头进行匹配
- $从字符串末尾进行匹配
- 可以使用\b来匹配单词的边界
3.3 练习
如何匹配字符串""
思路:
- 你需要一个转移符号告诉正则表达式你要的是
所代表的特殊含义,所以其正则表达式为
\$\^\$
- 使用字符串表示正则表达式;而\在字符串中也表示转义,所以要再加一个转义符:
"\\$\\^\\$"
- 最后加上锚点:
"^\\$\\^\\$$"
3.4 字符类与字符选项
重点:
- \d可以匹配任意数字
- \s可以匹配任意空白字符(如空格、制表符和换行符)
- [abc]可以匹配a、b或c
- [^abc]可以匹配出a、b、c外的任意字符
牢记!因为\在字符串中也表示转义,所以创建\d的正则表达式需要在字符串中对\进行转义,因此需输入\\d
还可以使用字符选项创建多个可选的模式
注意:|的优先级很低!!!!
abc|xyz匹配的是abc或xyz,而不是abcyz或abxyz
3.5 重复
正则表达式一个强大的功能是可以控制一个模式能够匹配多少次
重点1:
- ?:0次或1次
- +:1次或多次
- *:0次或多次
注意:
1、只重复其前方的一个字符!!!
2、这些字符优先级非常高
重点2:
- {n}:匹配n次
- {n, }:匹配n次或更多次
- { ,m}:最多匹配m次
- {n, m}:匹配n到m次
重点3:
默认的匹配方式是“贪婪的”:正则表达式会匹配尽量长的字符串。
通过在正则表达式后添加一个?,可以将匹配方式改为“懒惰的”,即匹配尽量短的字符串。
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, 'C{2,3}?')
3.6 分组与溯源引用
括号的作用:
1、消除复杂表达式中的歧义,阐明优先级
2、定义“分组”信息,同时可以通过回溯引用如(\1,\2等)来引用这些分组。
注意:
1个括号为1组,\1代表回溯引用第1组即第1个括号里的内容,\2代表回溯引用第2组即第2个括号里的内容,\3代表回溯引用第3组即第3个括号里的内容。
str_view(fruit, "(..)\\1", match = TRUE)
四、工具
学习stringr的多个函数,应用正则表达式:
- 确定与某种模式相匹配的字符串
- 找出匹配的位置
str_detect
- 提取出匹配的内容
str_extract
- 使用新值替换匹配内容
str_replace
- 基于匹配拆分字符串
str_split
4.1 匹配检测
想要确定一个字符串向量能否匹配一种模式,使用str_detect函数,它返回一个与输入向量具有相同长度的逻辑向量。
x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE
str_detect配合逻辑值取子集和str_subset函数可起到一样的效果
words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset()函数
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"
重点:
然而,字符串通常式数据框的一列,此时我们可以用filter操作
df <- tibble(
word = words,
i = seq_along(word)
)
df %>%
filter(str_detect(word, "x$"))
#> # A tibble: 4 x 2
#> word i
#> <chr> <int>
#> 1 box 108
#> 2 sex 747
#> 3 six 772
#> 4 tax 841
str_detect()函数的一种变体是str_count(),后者不是简单返回是或否,而是返回字符串中匹配的数量
str_count()
x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
str_count可以同mutate()函数一同使用:
计算word数据中元音字母和辅音字母的数量
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)
#> # A tibble: 980 x 4
#> word i vowels consonants
#> <chr> <int> <int> <int>
#> 1 a 1 1 0
#> 2 able 2 2 2
#> 3 about 3 3 2
#> 4 absolute 4 4 4
#> 5 accept 5 2 4
#> 6 account 6 3 4
#> # … with 974 more rows
4.2 提取匹配内容
利用stringr的内置数据集sentences做练习
length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks."
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."
#> [4] "These days a chicken leg is a rare dish."
#> [5] "Rice is often served in round bowls."
#> [6] "The juice of lemons makes fine punch."
1、创建一个颜色向量
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
#> [1] "red|orange|yellow|green|blue|purple"
2、选取包含这个颜色的句子
has_colour <- str_subset(sentences, colour_match)
3、提取这些句子里所包含的颜色,就知道句子里有哪些颜色了
matches <- str_extract(has_colour, colour_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"
注意:
str_extract()只提取了每个句子的第一个匹配!!(一个句子里可能有2个或以上的颜色单词)
这是stringr函数的一种通用模式,因为单个匹配可以使用更简单的数据结构。要想得到所有匹配,可以使用str_extract_all()
函数,他会返回一个列表
str_extract_all(has_colour, colour_match) %>%
.[1:3]
#[[1]]
#[1] "blue"
#
#[[2]]
#[1] "blue"
#
#[[3]]
#[1] "red"
如果设置参数simplify = TRUE,那么返回一个矩阵,其中较短的匹配循环补齐
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#> [,1] [,2] [,3]
#> [1,] "a" "" ""
#> [2,] "a" "b" ""
#> [3,] "a" "b" "c"
4.3 分组匹配
同样探索内置数据集sentences,假设我们想从句子中提取出名词,可以换个思路一般a/the后面跟的是名词,所以a/the+ 空格+1个以上非空格字符即可
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
#> [1] "the smooth" "the sheet" "the depth" "a chicken" "the parked"
#> [6] "the sun" "the huge" "the ball" "the woman" "a helps"
str_extract()可以给出完整匹配,str_match()则可以给出每个独立分组(即每个括号里的内容),返回的不是向量而是矩阵,其中一列为完整匹配,后面的是每个分组的匹配
str_match()
has_noun %>%
str_match(noun)
#> [,1] [,2] [,3]
#> [1,] "the smooth" "the" "smooth"
#> [2,] "the sheet" "the" "sheet"
#> [3,] "the depth" "the" "depth"
#> [4,] "a chicken" "a" "chicken"
#> [5,] "the parked" "the" "parked"
#> [6,] "the sun" "the" "sun"
#> [7,] "the huge" "the" "huge"
#> [8,] "the ball" "the" "ball"
#> [9,] "the woman" "the" "woman"
#> [10,] "a helps" "a" "helps"
与str_extract()函数一样,如果想要找出每个字符串的所有匹配,需要使用str_match_all()
tidyr::extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ...)
如果数据类型是tibble,使用tidyr::extract()更容易,其工作原理与str_match()相似,只是要求为每个分组提供一个名称,以作为新列放在tibble中
tibble(sentence = sentences) %>%
extract(sentence,
into = c("article", "noun"),
"(a|the) ([^ ]+)",
remove = FALSE)
#> # A tibble: 720 x 3
#> sentence article noun
#> <chr> <chr> <chr>
#> 1 The birch canoe slid on the smooth planks. the smooth
#> 2 Glue the sheet to the dark blue background. the sheet
#> 3 It's easy to tell the depth of a well. the depth
#> 4 These days a chicken leg is a rare dish. a chicken
#> 5 Rice is often served in round bowls. <NA> <NA>
#> 6 The juice of lemons makes fine punch. <NA> <NA>
#> # … with 714 more rows
4.4 替换匹配内容
str_replace()和str_replace_all()函数可以使用新字符串替换匹配内容。
str_replace(string, pattern, replacement)
1、使用固定字符串替换匹配内容
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"
2、通过提供一个命名向量,使用str_replace_all()函数可以同时执行多个替换:
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"
3、使用回溯引用来插入匹配中的分组,下面代码交换了第二个和第三个单词的顺序:
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
#> [1] "The canoe birch slid on the smooth planks."
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."
#> [4] "These a days chicken leg is a rare dish."
#> [5] "Rice often is served in round bowls."
4.5 拆分
str_split()函数可以将字符串拆分为多个片段。
str_split(string, pattern, n = Inf, simplify = FALSE)
举例:
sentences %>%
head(5) %>%
str_split(" ")
#> [[1]]
#> [1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
#> [8] "planks."
#>
#> [[2]]
#> [1] "Glue" "the" "sheet" "to" "the"
#> [6] "dark" "blue" "background."
#>
#> [[3]]
#> [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
#>
#> [[4]]
#> [1] "These" "days" "a" "chicken" "leg" "is" "a"
#> [8] "rare" "dish."
#>
#> [[5]]
#> [1] "Rice" "is" "often" "served" "in" "round" "bowls."
因为字符向量的每个分量会包含不同数量的片段,所以str_split()返回一个列表,所以如果你拆分的是一个长度为1的向量,那么只要简单地提取列表第一个元素即可:
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
#> [1] "a" "b" "c" "d"
或者将修改参数为simplify = TRUE,返回的是一个矩阵
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
#> [2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background."
#> [3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a"
#> [4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare"
#> [5,] "Rice" "is" "often" "served" "in" "round" "bowls." ""
#> [,9]
#> [1,] ""
#> [2,] ""
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""
还可以设定拆分片段的最大数量:
下面代码设定拆分的片段最多为2个片段
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#> [,1] [,2]
#> [1,] "Name" "Hadley"
#> [2,] "Country" "NZ"
#> [3,] "Age" "35"
boundary()函数,可以通过字母、行、句子和单词边界来拆分字符串。
boundary(type = c("character", "line_break", "sentence", "word"), skip_word_none = NA, ...)
e.g.
x <- "This is a sentence. This is another sentence."
str_split(x, " ")[[1]]
#> [1] "This" "is" "a" "sentence." "" "This"
#> [7] "is" "another" "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This" "is" "a" "sentence" "This" "is" "another"
#> [8] "sentence"
认真观察上面的结果,可以发现boundary()函数的拆分结果是要更好的。
五、正则表达式的其他应用
R基础包中两个常用函数,他们也可以使用正则表达式
- apropos()函数可以在全局环境空间中搜索所有可用对象。当我们没法确切的想起函数名称是,这个函数很好用,举例:
apropos("replace")
#> [1] "%+replace%" "replace" "replace_na" "setReplaceMethod"
#> [5] "str_replace" "str_replace_all" "str_replace_na" "theme_replace"
- dir()函数可以列出一个目录下的所有文件。
dir()的pattern参数可以是一个正则表达式,此时他只返回与这个模式相匹配的文件名。举例:
dir(pattern = "\\.csv$")
[1] "clinical.csv" "raw_ntnbr.csv" "raw_tnbr.csv"
[4] "total_clinical.csv"
这章内容都很重要,可以参考英文版R数据科学:https://r4ds.had.co.nz/strings.html
及其课后习题答案:https://jrnold.github.io/r4ds-exercise-solutions/strings.html#splitting