R语言中字符串的处理(2/3)-分割、提取、替换

本文转自微信公众号: 一遇之见的大作 R中字符串处理：函数实现。原文太长，分三次学习、消化。

字符串分割函数：`strsplit`，`str_split`和`str_split_fixed`

函数strsplit,str_split和str_split_fixed均可实现字符串的分割，但strsplit和str_split返回结果为列表，而str_split_fixed返回结果为矩阵。

fruits = c("Small Yellow Banana", " Red Apple", "Big Sweet Pear  ", "Sour PineApple")
strsplit(fruits, " ")
# [[1]]
# [1] "Small"  "Yellow" "Banana"

# [[2]]
# [1] ""      "Red"   "Apple"

# [[3]]
# [1] "Big"   "Sweet" "Pear"  ""     #其实这里是两个空格

# [[4]]
# [1] "Sour"      "PineApple"

library(stringr)
str_split(fruits, " ")
# [[1]]
# [1] "Small"  "Yellow" "Banana"

# [[2]]
# [1] ""      "Red"   "Apple"

# [[3]]
# [1] "Big"   "Sweet" "Pear"  ""      ""     #这个函数识别出了两个空格

# [[4]]
# [1] "Sour"      "PineApple"

str_split_fixed(fruits, " ", n = 3)
#      [,1]    [,2]        [,3]    
# [1,] "Small" "Yellow"    "Banana"
# [2,] ""      "Red"       "Apple" 
# [3,] "Big"   "Sweet"     "Pear  "
# [4,] "Sour"  "PineApple" ""

函数unlist可将函数strsplit和str_split返回结果列表转化为向量。

unlist(strsplit(fruits, " "))
# [1] "Small"     "Yellow"    "Banana"    ""          "Red"       "Apple"    
# [7] "Big"       "Sweet"     "Pear"      ""          "Sour"      "PineApple"
unlist(str_split(author, " "))

unlist(str_split(fruits, " "))
# [1] "Small"     "Yellow"    "Banana"    ""          "Red"       "Apple"    
# [7] "Big"       "Sweet"     "Pear"      ""          ""          "Sour"     
# [13] "PineApple"

三个字符串分割函数中，str_split_fixed的返回结果为数据框，方便对后期结果的引用。此外，函数str_split和str_split_fixed中都有参数n，但str_split中的参数可设置也可不设置，函数返回结果依旧是列表；str_split_fixed中参数n必须设置。其中参数n小于最大分割个数时，后面的不再分隔；参数n超过最大分割数时，后面内容为空。

str_split(fruits, " ", n = 2)
# [[1]]
# [1] "Small"         "Yellow Banana"

# [[2]]
# [1] ""          "Red Apple"

# [[3]]
# [1] "Big"          "Sweet Pear  "

# [[4]]
# [1] "Sour"      "PineApple"

str_split(fruits, " ", n = 5)
# [[1]]
# [1] "Small"  "Yellow" "Banana"

# [[2]]
# [1] ""      "Red"   "Apple"

# [[3]]
# [1] "Big"   "Sweet" "Pear"  ""      ""     

# [[4]]
# [1] "Sour"      "PineApple"

str_split_fixed(fruits, " ", n = 3)
#      [,1]    [,2]        [,3]    
# [1,] "Small" "Yellow"    "Banana"
# [2,] ""      "Red"       "Apple" 
# [3,] "Big"   "Sweet"     "Pear  "
# [4,] "Sour"  "PineApple" ""      

str_split_fixed(fruits, " ", n = 2)
#      [,1]    [,2]           
# [1,] "Small" "Yellow Banana"
# [2,] ""      "Red Apple"    
# [3,] "Big"   "Sweet Pear  " 
# [4,] "Sour"  "PineApple" 

str_split_fixed(fruits, " ", n = 5)
#      [,1]    [,2]        [,3]     [,4] [,5]
# [1,] "Small" "Yellow"    "Banana" ""   ""  
# [2,] ""      "Red"       "Apple"  ""   ""  
# [3,] "Big"   "Sweet"     "Pear"   ""   ""  
# [4,] "Sour"  "PineApple" ""       ""   ""

字符串提取

函数substr(x, start,stop)：对字符串x截取从start到stop的子字符串。
函数substring(text,first, last = 1000000L)：对字符串text截取从first到last的子字符串，last默认值为1000000，可以不传参。
str_sub(x, start = 1L, end = -1L)：对字符串x截取从start到end的子字符串，start和end有默认值，可以不传参。

txt <- c("Hello, the World!","I'm Chinese", "I love China.", "I come from China!")
substr(txt, 1, 5)

# [1] "Hello" "I'm C" "I lov" "I com"
substring(txt, 1, 5)
# [1] "Hello" "I'm C" "I lov" "I com"
str_sub(txt, 1, 3)
# [1] "Hel" "I'm" "I l" "I c"

substr(txt[1], c(1,2,3,4), c(2,3,4,5)) # 只对第一个元素有效
# [1] "He"
substr(txt, c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"

substring(txt[1], c(1,2,3,4), c(2,3,4,5)) # 重复短元素，在相同位置匹配
# [1] "He" "el" "ll" "lo"
substring(txt, c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"

str_sub(txt[1],  c(1,2,3,4), c(2,3,4,5)) # 重复短元素，在相同位置匹配
# [1] "He" "el" "ll" "lo"
str_sub(txt,  c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"

函数strtrim(x,width)：对字符串x从开头截取指定width的子字符串，参数均可循环使用。对于中文字符，一个字符的长度为2，因此width也要设置为2倍宽度。
stringr包中的函数word(string,start = 1L, end = start, sep = fixed(" "))：用于从语句中提取单词(字符串)。string为字符串或字符串向量；start为数值向量给出提取的开始位置；end为数值向量给出提取的结束位置；sep为单词间分隔符,默认为空格。

txt <- c("Hello, the World!","I'm Chinese", "I love China.", "I come from China!")

strtrim(txt, 7)
# [1] "Hello, " "I'm Chi" "I love " "I come "
strtrim(txt,  c(1,2,3,4))  # 重复短元素，在相同位置匹配
# [1] "H"    "I'"   "I l"  "I co"

word(txt, 2)
# [1] "the"     "Chinese" "love"    "come"   
word(txt, c(1,2))  # 重复短元素，在相同位置匹配 = (1,2,1,2)
# [1] "Hello,"  "Chinese" "I"       "come"

字符串替换

尽管sub和gsub，str_replace和str_replace_all可用于字符串的替换，但严格地说R语言没有字符串替换的函数，因为R语言不管什么操作对参数都是传值不传址。

text = c("Hellow, Adam Adam!", "Hi, Paul Adam !", "How are you, Adam, Ava.")

sub(pattern = "Adam", replacement = "world", text)
# [1] "Hellow, world Adam!"      "Hi, Paul world !"         "How are you, world, Ava."

gsub(pattern = "Adam", replacement = "world", text)
# [1] "Hellow, world world!"     "Hi, Paul world !"         "How are you, world, Ava."

可以看到：虽然说是“替换”，但原字符串并没有改变，要改变原变量我们只能通过再赋值的方式。sub和gsub的区别是: 前者只做一次替换（不管有几次匹配），而gsub把满足条件的匹配都做替换。

stringr包中也有类似函数sub的str_repalce函数做一次替换，以及类似函数gsub的str_repalce_all函数把满足条件的匹配都做替换。

sub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
# [1] "IacHgd" "aeIfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
# [1] "IacHgd" "aeIfgH" "defg"

gsub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
# [1] "IacIgd" "aeIfgI" "defg"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
# [1] "IacIgd" "aeIfgI" "defg"

与sub和gsub不同，stringr包中的函数str_repalce和str_replace_all不仅可以实现一个字符串的查询替换，也可以实现多个字符串在相同位置的针对查询替换。(其实本质是一样的，就是短的字符向量重复完成匹配)。

sub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg")) # 只有H参与了查询替换
## [1] "IacHgd" "aeIfgH" "defg"
## 1.Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## 2.Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used

str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## [1] "IacHgd" "beHfgH" "defg"
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length

gsub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg"))
## [1] "IacIgd" "aeIfgI" "defg"
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used

str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## [1] "IacIgd" "beHfgH" "defg"
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length

str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e")) #此时返回结果长度为4
## [1] "IacHgd" "beHfgH" "defH"   "HacHge"
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length

str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e"))#此时返回结果长度为4
## [1] "IacIgd" "beHfgH" "defH"   "HacHge"
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length

此外，函数str_repalce_all还可以实现多个字符串的同时替换(str_replac没有此功能)。

y = c(c("I", "b"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "IbcIgd" "beIfgI" "defg"

针对函数str_repalce_all的多个字符串的同时替换功能，有时会出现意想不到的结果，而mgsub::mgsub可以产生另外一种结果。

y = c(c("a", "H"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "HHcHgd" "HeHfgH" "defg"

mgsub::mgsub(c("HacHgd", "aeHfgH", "defg"), c("H","a"),c(c("a", "H")))
## [1] "aHcagd" "Heafga" "defg"

R语言中字符串的处理(2/3)-分割、提取、替换