R语言中字符串的处理(2/3)-分割、提取、替换

本文转自微信公众号: 一遇之见 的 大作 R中字符串处理:函数实现 。原文太长,分三次学习、消化。

字符串分割函数:strsplitstr_splitstr_split_fixed

函数strsplit,str_splitstr_split_fixed均可实现字符串的分割,但strsplitstr_split返回结果为列表,而str_split_fixed返回结果为矩阵。

fruits = c("Small Yellow Banana", " Red Apple", "Big Sweet Pear  ", "Sour PineApple")
strsplit(fruits, " ")
# [[1]]
# [1] "Small"  "Yellow" "Banana"

# [[2]]
# [1] ""      "Red"   "Apple"

# [[3]]
# [1] "Big"   "Sweet" "Pear"  ""     #其实这里是两个空格

# [[4]]
# [1] "Sour"      "PineApple"

library(stringr)
str_split(fruits, " ")
# [[1]]
# [1] "Small"  "Yellow" "Banana"

# [[2]]
# [1] ""      "Red"   "Apple"

# [[3]]
# [1] "Big"   "Sweet" "Pear"  ""      ""     #这个函数识别出了两个空格

# [[4]]
# [1] "Sour"      "PineApple"

str_split_fixed(fruits, " ", n = 3)
#      [,1]    [,2]        [,3]    
# [1,] "Small" "Yellow"    "Banana"
# [2,] ""      "Red"       "Apple" 
# [3,] "Big"   "Sweet"     "Pear  "
# [4,] "Sour"  "PineApple" "" 

函数unlist可将函数strsplitstr_split返回结果列表转化为向量。

unlist(strsplit(fruits, " "))
# [1] "Small"     "Yellow"    "Banana"    ""          "Red"       "Apple"    
# [7] "Big"       "Sweet"     "Pear"      ""          "Sour"      "PineApple"
unlist(str_split(author, " "))

unlist(str_split(fruits, " "))
# [1] "Small"     "Yellow"    "Banana"    ""          "Red"       "Apple"    
# [7] "Big"       "Sweet"     "Pear"      ""          ""          "Sour"     
# [13] "PineApple"

三个字符串分割函数中,str_split_fixed的返回结果为数据框,方便对后期结果的引用。此外,函数str_splitstr_split_fixed中都有参数n,但str_split中的参数可设置也可不设置,函数返回结果依旧是列表;str_split_fixed中参数n必须设置。其中参数n小于最大分割个数时,后面的不再分隔;参数n超过最大分割数时,后面内容为空。

str_split(fruits, " ", n = 2)
# [[1]]
# [1] "Small"         "Yellow Banana"

# [[2]]
# [1] ""          "Red Apple"

# [[3]]
# [1] "Big"          "Sweet Pear  "

# [[4]]
# [1] "Sour"      "PineApple"

str_split(fruits, " ", n = 5)
# [[1]]
# [1] "Small"  "Yellow" "Banana"

# [[2]]
# [1] ""      "Red"   "Apple"

# [[3]]
# [1] "Big"   "Sweet" "Pear"  ""      ""     

# [[4]]
# [1] "Sour"      "PineApple"

str_split_fixed(fruits, " ", n = 3)
#      [,1]    [,2]        [,3]    
# [1,] "Small" "Yellow"    "Banana"
# [2,] ""      "Red"       "Apple" 
# [3,] "Big"   "Sweet"     "Pear  "
# [4,] "Sour"  "PineApple" ""      

str_split_fixed(fruits, " ", n = 2)
#      [,1]    [,2]           
# [1,] "Small" "Yellow Banana"
# [2,] ""      "Red Apple"    
# [3,] "Big"   "Sweet Pear  " 
# [4,] "Sour"  "PineApple" 

str_split_fixed(fruits, " ", n = 5)
#      [,1]    [,2]        [,3]     [,4] [,5]
# [1,] "Small" "Yellow"    "Banana" ""   ""  
# [2,] ""      "Red"       "Apple"  ""   ""  
# [3,] "Big"   "Sweet"     "Pear"   ""   ""  
# [4,] "Sour"  "PineApple" ""       ""   ""  
字符串提取
  • 函数substr(x, start,stop):对字符串x截取从start到stop的子字符串。

  • 函数substring(text,first, last = 1000000L):对字符串text截取从first到last的子字符串,last默认值为1000000,可以不传参。

  • str_sub(x, start = 1L, end = -1L):对字符串x截取从start到end的子字符串,start和end有默认值,可以不传参。

txt <- c("Hello, the World!","I'm Chinese", "I love China.", "I come from China!")
substr(txt, 1, 5)

# [1] "Hello" "I'm C" "I lov" "I com"
substring(txt, 1, 5)
# [1] "Hello" "I'm C" "I lov" "I com"
str_sub(txt, 1, 3)
# [1] "Hel" "I'm" "I l" "I c"

substr(txt[1], c(1,2,3,4), c(2,3,4,5)) # 只对第一个元素有效
# [1] "He"
substr(txt, c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"

substring(txt[1], c(1,2,3,4), c(2,3,4,5)) # 重复短元素,在相同位置匹配
# [1] "He" "el" "ll" "lo"
substring(txt, c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"

str_sub(txt[1],  c(1,2,3,4), c(2,3,4,5)) # 重复短元素,在相同位置匹配
# [1] "He" "el" "ll" "lo"
str_sub(txt,  c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"
  • 函数strtrim(x,width):对字符串x从开头截取指定width的子字符串,参数均可循环使用。对于中文字符,一个字符的长度为2,因此width也要设置为2倍宽度。

  • stringr包中的函数word(string,start = 1L, end = start, sep = fixed(" ")):用于从语句中提取单词(字符串)。string为字符串或字符串向量;start为数值向量给出提取的开始位置;end为数值向量给出提取的结束位置;sep为单词间分隔符,默认为空格。

txt <- c("Hello, the World!","I'm Chinese", "I love China.", "I come from China!")

strtrim(txt, 7)
# [1] "Hello, " "I'm Chi" "I love " "I come "
strtrim(txt,  c(1,2,3,4))  # 重复短元素,在相同位置匹配
# [1] "H"    "I'"   "I l"  "I co"

word(txt, 2)
# [1] "the"     "Chinese" "love"    "come"   
word(txt, c(1,2))  # 重复短元素,在相同位置匹配 = (1,2,1,2)
# [1] "Hello,"  "Chinese" "I"       "come" 
字符串替换

尽管subgsubstr_replacestr_replace_all可用于字符串的替换,但严格地说R语言没有字符串替换的函数,因为R语言不管什么操作对参数都是传值不传址。

text = c("Hellow, Adam Adam!", "Hi, Paul Adam !", "How are you, Adam, Ava.")

sub(pattern = "Adam", replacement = "world", text)
# [1] "Hellow, world Adam!"      "Hi, Paul world !"         "How are you, world, Ava."

gsub(pattern = "Adam", replacement = "world", text)
# [1] "Hellow, world world!"     "Hi, Paul world !"         "How are you, world, Ava."

可以看到:虽然说是“替换”,但原字符串并没有改变,要改变原变量我们只能通过再赋值的方式。subgsub的区别是: 前者只做一次替换(不管有几次匹配),而gsub把满足条件的匹配都做替换。

stringr包中也有类似函数substr_repalce函数做一次替换,以及类似函数gsubstr_repalce_all函数把满足条件的匹配都做替换。

sub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
# [1] "IacHgd" "aeIfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
# [1] "IacHgd" "aeIfgH" "defg"

gsub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
# [1] "IacIgd" "aeIfgI" "defg"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
# [1] "IacIgd" "aeIfgI" "defg"

subgsub不同,stringr包中的函数str_repalcestr_replace_all不仅可以实现一个字符串的查询替换,也可以实现多个字符串在相同位置的针对查询替换。(其实本质是一样的,就是短的字符向量重复完成匹配)。

sub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg")) # 只有H参与了查询替换
## [1] "IacHgd" "aeIfgH" "defg"
## 1.Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## 2.Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used

str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## [1] "IacHgd" "beHfgH" "defg"
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length

gsub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg"))
## [1] "IacIgd" "aeIfgI" "defg"
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used

str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## [1] "IacIgd" "beHfgH" "defg"
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length

str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e")) #此时返回结果长度为4
## [1] "IacHgd" "beHfgH" "defH"   "HacHge"
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length

str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e"))#此时返回结果长度为4
## [1] "IacIgd" "beHfgH" "defH"   "HacHge"
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length

此外,函数str_repalce_all还可以实现多个字符串的同时替换(str_replac没有此功能)。

y = c(c("I", "b"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "IbcIgd" "beIfgI" "defg"

针对函数str_repalce_all的多个字符串的同时替换功能,有时会出现意想不到的结果,而mgsub::mgsub可以产生另外一种结果。

y = c(c("a", "H"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "HHcHgd" "HeHfgH" "defg"

mgsub::mgsub(c("HacHgd", "aeHfgH", "defg"), c("H","a"),c(c("a", "H")))
## [1] "aHcagd" "Heafga" "defg"
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,445评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,889评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,047评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,760评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,745评论 5 367
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,638评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,011评论 3 398
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,669评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,923评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,655评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,740评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,406评论 4 320
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,995评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,961评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,197评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,023评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,483评论 2 342

推荐阅读更多精彩内容