《R for Data Science》第十四章 strings 啃书知识点积累
参考链接:R for Data Science涉及部分:
stringr
正则相关内容
Introduction
When you first look at a regexp, you’ll think a cat walked across your keyboard
Matching patterns with regular expressions
- Basic matches
利用str_view()
可视化匹配结果
x <- c("apple", "banana", "pear")
str_view(x, ".a.")
str_view(x, ".a.*")
str_view(x, ".*a.*")
- 正则中的转义要和字符串自身转义
正则的转义是基于字符串转义\
再区分,固用\\
dot <- "\\."
dot
# [1] "\\." # print依然会显示\\,真实情况用writelines
writeLines(dot) # 字符串的转义仅一个\
# \.
str_view(c("abc", "a.c", "bef"), "a\\.c")
# a.c
# 如果匹配 \ ,正则需要 \\ ,字符串自身需要\,因此需要4个\
x <- "a\\b"
str_view(x, "\\\\")
- Exercises
Q: Explain why each of these strings don’t match a : "\", "\\", "\\\".
- \:字符串中的转义写法
- \\: 正则表达式中的转义写法
- \\\:前两个\\是正则转义写法,第三个\会转义下一个字符
- Anchors
自然文集可以用于练习 stringr::words
-
^
to match the start of the string. -
$
to match the end of the string.‘
If you begin with power (^), you end up with money ($).
\b
可以匹配单词的边界,^ $
是匹配全字符串边界
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
str_view(x, "\\bapple\\b")
- Character classes and alternatives
- 有时候用
[]
替代转义可以提高可读性
str_extract(c("abc", "a.c", "a*c", "a c"), ".[*]c")
# [1] NA NA "a*c" NA
str_extract(c("abc", "a.c", "a*c", "a c"), ".\\*c")
# [1] NA NA "a*c" NA
str_extract(c("abc", "a.c", "a*c", "a c"), ".[.]c")
# [1] NA "a.c" NA NA
str_extract(c("abc", "a.c", "a*c", "a c"), ".\\.c")
# [1] NA "a.c" NA NA
# str_extract 会保留NA,str_subset会去除NA
- 关于
|
|
的优先级很低
abc|xyz
matchesabc
orxyz
notabcyz
or `abxyz.
str_extract(c("grey", "gray", "ggap"), "gr(e|a)y")
# [1] "grey" "gray" NA
- Q: Create regular expressions to find all words that:
(1) Start with a vowel.
str_subset(words, '^[aeiou].*')
(2) That only contain consonants. (Hint: thinking about matching “not”-vowels.)
str_subset(words, '^[^aeiou]+$')
(3) End with ed, but not with eed.
# 需要考虑可能只有2个字符
str_subset(words, '(^|[^e])ed$')
(4) End with ing or ise.
str_subset(words, 'i(ng|se)$')
- Repetition
-
?
: 0 or 1 -
+
: 1 or more -
*
: 0 or more -
{n}
: exactly n -
{n,}
: n or more -
{,m}
: at most m -
{n,m}
: between n and m
{n,m}
默认是贪婪匹配,可匹配m就m,加上?
可成为非贪婪模式
这个网站有各种难度和情境可以练习正则,可以一玩
- Grouping and backreferences
Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like
\1, \2
etc.
str_subset(fruit, "(..)\\1")
# [1] "banana" "coconut" "cucumber" "jujube" "papaya" "salal berry"
str_subset(fruit, "(.)(.)l\\2\\b")
# [1] "chili pepper"
Tools
- Detect matches
(1) str_detect
: 返回匹配的逻辑值
x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE
FALSE
becomes0
andTRUE
becomes1
.
sum(str_detect(words, '^(.).*\\1$'))
# [1] 36
mean(str_detect(words, '^(.).*\\1$'))
# [1] 0.03673469
- 利用
str_detect
可以简化问题
如:找到所有不含元音字母的单词
# 第一种方法更简便
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
#> [1] TRUE
- 取子集
words[!str_detect(words, "[aeiou]")]
# 等价于
str_subset(words,'^[^aeiou]+$')
-
str_detect
可以配合dplyr::filter
进行数据框列过滤
tibble(
word = words,
i = seq_along(word)
) %>%
filter(str_detect(word, "x$"))
#> # A tibble: 4 x 2
#> word i
#> <chr> <int>
#> 1 box 108
#> 2 sex 747
#> 3 six 772
#> 4 tax 841
(2) str_count
: 返回每个字符串(单词、词组)匹配正则的个数
str_count(c('chenxi','hello','world','apple'),'[aeiou]')
# [1] 2 2 1 2
str_count(c('chenxi','hello','world','apple'),'^[^aeiou]')
# [1] 1 1 1 0
# 计算words集中每个单词含元音字母的平均个数
mean(str_count(words, '[aeiou]'))
# [1] 1.991837
-
str_count
可以配合dplyr::mutate
df <- tibble(
id = seq_along(word),
word = words
)
df %>%
mutate(
vowels = str_count(word,'[aeiou]'),
consonants = str_count(word,'[^aeiou]')
)
- 正则匹配不会相互重叠
str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")
- 无序匹配大法:环视
# 匹配words数据集中含五个元音字母的单词
str_subset(words,
'^(?=.*?a)(?=.*?e)(?=.*?i)(?=.*?o)(?=.*?u).*$')
# 简化问题也可以
words[str_detect(words,'a') &
str_detect(words,'e') &
str_detect(words,'i') &
str_detect(words,'o') &
str_detect(words,'u')]
- Extract matches
用的例子:stringr::sentences
length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks."
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."
#> [4] "These days a chicken leg is a rare dish."
#> [5] "Rice is often served in round bowls."
#> [6] "The juice of lemons makes fine punch."
str_extract
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|");colour_match
has_colour <- str_subset(sentences, colour_match)
# extract the colour to figure out which one it is
matches <- str_extract(has_colour, colour_match)
# has_colour已经是匹配上的,如果无法匹配上str_extract会显示NA
head(matches)
Note that
str_extract()
only extracts the first match.
一个字符串中有多个匹配时也只返回第一个
-
str_extract_all
可以返回多个匹配,且以list形式返回
more <- sentences[str_count(sentences, colour_match) > 1]
str_extract(more, colour_match)
# [1] "blue" "green" "orange"
str_extract_all(more, colour_match)
# [[1]]
# [1] "blue" "red"
#
# [[2]]
# [1] "green" "red"
#
# [[3]]
# [1] "orange" "red"
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
# [,1] [,2] [,3]
# [1,] "a" "" ""
# [2,] "a" "b" ""
# [3,] "a" "b" "c"
# 可见,形成矩阵时列数由单个字符串最高匹配数决定
# 其他行用“”空字符补齐
unlist(str_extract_all(sentences,'\\b[A-Za-z]*ing\\b'))
# 可以消除没返回的结果
- Grouped matches
-
str_subset
str_extract
str_match
功能对比
# 以非空格开头即下一个单词
noun <- "(a|the) ([^ ]+)"
# 提取符合条件的整个字符串,返回向量
results <- sentences %>%
str_subset(noun)
# [1] "The birch canoe slid on the smooth planks."
# [2] "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well."
# [4] "These days a chicken leg is a rare dish."
# ....
# 提取符合条件的全部match,返回list或矩阵,可以降维
results %>% # 也可以用sentences
str_extract_all(noun) %>%
unlist() %>%
.[1:5]
# [1] "the smooth" "the sheet" "the dark" "the depth" "a well"
# 除了返回match还能区分出括号包裹的亚match,返回矩阵
results %>%
str_match(noun) %>%
head(3)
# [,1] [,2] [,3]
# [1,] "the smooth" "the" "smooth"
# [2,] "the sheet" "the" "sheet"
# [3,] "the depth" "the" "depth"
# 类似str_extract,只返回第一次匹配
# 用str_match_all返回一个字符串的全部匹配及亚匹配
results %>%
str_match_all(noun) %>%
.[[2]]
# [,1] [,2] [,3]
# [1,] "the sheet" "the" "sheet"
# [2,] "the dark" "the" "dark"
- 类似
str_match
的tidyr::extract
If your data is in a tibble, it’s often easier to use
tidyr::extract()
. It works likestr_match()
but requires you to name the matches, which are then placed in new columns
tibble(sentence = sentences) %>%
tidyr::extract(
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE # 控制sentence是否移除
) %>%
filter(!is.na(article))
- Replacing matches
str_replace()
and str_replace_all()
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"
str_replace_all
支持多种替换
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"
- 也可以利用后向引用替换
如下例可以调换第2、3单词的顺序
sentences %>%
head(3)
# [1] "The birch canoe slid on the smooth planks."
# [2] "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well."
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(3)
# [1] "The canoe birch slid on the smooth planks."
# [2] "Glue sheet the to the dark blue background."
# [3] "It's to easy tell the depth of a well."
sentences %>%
str_replace_all("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(3)
# [1] "The canoe birch slid the on smooth planks."
# [2] "Glue sheet the to dark the blue background."
# [3] "It's to easy tell depth the of well. a"
- Q: 调换数据集
words
所有单词的首末字母,并调换后是否还存在于words
中
req <- '^([A-Za-z])(.*)([A-Za-z])$'
results <- str_replace_all(words, req, '\\3\\2\\1')
results[results %in% words]
# 也可以取交集
intersect(words,results)
- Splitting
str_split
结果返回list
sentences %>%
head(3) %>%
str_split(" ")
#> [[1]]
#> [1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
#> [8] "planks."
#>
#> [[2]]
#> [1] "Glue" "the" "sheet" "to" "the"
#> [6] "dark" "blue" "background."
#>
#> [[3]]
#> [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
If you’re working with a length-1 vector, the easiest thing is to just extract the first element of the list
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
#> [1] "a" "b" "c" "d"
-
str_split
的其他参数
# 通过simplify控制返回矩阵
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
#> [2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background."
#> [3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a"
#> [4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare"
#> [5,] "Rice" "is" "often" "served" "in" "round" "bowls." ""
#> [,9]
#> [1,] ""
#> [2,] ""
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""
# n指定返回片段的最大个数
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#> [,1] [,2]
#> [1,] "Name" "Hadley"
#> [2,] "Country" "NZ"
#> [3,] "Age" "35"
- 通过
boundary
做词切割
x <- "This is a sentence. This is another sentence."
str_split(x, " ")[[1]]
#> [1] "This" "is" "a" "sentence." "" "This"
#> [7] "is" "another" "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This" "is" "a" "sentence" "This" "is" "another"
#> [8] "sentence"
# boundary识别单词,去除空格和标点
- 空字符
""
的分隔
""
可以将字符串全部分割成单个字符
x <- c("apples, pears, and bananas")
str_split(x, "")[[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a"
# [22] "n" "a" "n" "a" "s"
- Find matches
-
str_locale
返回match的起始和终止位置,用_all
返回全部匹配
?str_locate()
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "$")
str_locate(fruit, "a")
str_locate(fruit, "e")
str_locate(fruit, c("a", "b", "p", "p"))
str_locate_all(fruit, "a")
str_locate_all(fruit, "e")
str_locate_all(fruit, c("a", "b", "p", "p"))
# Find location of every character
str_locate_all(fruit, "")
-
str_sub
根据索引提取子集
hw <- "Hadley Wickham"
str_sub(hw, 1, 6)
str_sub(hw, end = 6)
# [1] "Hadley"
str_sub(hw, -7)
# [1] "Wickham"
str_sub(hw, end = -9)
# [1] "Hadley"
str_sub('XiChen', seq_len(str_length('XiChen')))
# [1] "XiChen" "iChen" "Chen" "hen" "en" "n"
str_sub('XiChen', end = seq_len(str_length('XiChen')))
# [1] "X" "Xi" "XiC" "XiCh" "XiChe" "XiChen"
# 还可以替换
x <- "XiChen"
str_sub(x, 1, 1) <- "A"; x
# [1] "AiChen"
str_sub(x, -1, -1) <- "K"; x
# [1] "AiCheK"
str_sub(x, -2, -2) <- "GHIJ"; x
# [1] "AiChGHIJK"
str_sub(x, 2, -2) <- ""; x
# [1] "AK"
# 其他见帮助文档
Other types of pattern
When you use a pattern that’s a string, it’s automatically wrapped into a call to
regex()
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
-
regex
可以设置ignore_case忽略大小写
str_subset(bananas,'banana')
# [1] "banana"
str_subset(bananas, regex("banana", ignore_case = TRUE))
# [1] "banana" "Banana" "BANANA"
-
multiline = TRUE
allows^
and$
to match the start and end of each line
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"
comment=TRUE
phone <- regex("
\\(? # optional opening parens
(\\d{3}) # area code
[) -]? # optional closing parens, space, or dash
(\\d{3}) # another three numbers
[ -]? # optional space or dash
(\\d{3}) # three more numbers
", comments = TRUE)
str_match("514-791-8141", phone)
#> [,1] [,2] [,3] [,4]
#> [1,] "514-791-814" "514" "791" "814"
-
dotall = TRUE
允许.
匹配一切包括\n
x <- "Line 1\nLine 2\nLine 3"
str_extract(x, '.*')
# [1] "Line 1"
str_extract(x, regex('.*',dotall = TRUE))
# [1] "Line 1\nLine 2\nLine 3"
fixed
fixed()
: matches exactly the specified sequence of bytes. It ignores all special regular expressions and operates at a very low level.
str_subset(c("a\\b", "ab"), "\\\\")
# [1] "a\\b"
# 可以避免正则层面的转义并且速度更快
str_subset(c("a\\b", "ab"), fixed("\\"))
# [1] "a\\b"