R学习笔记(7):使用stringr处理字符串(2)

目标:结合正则表达式,实现

确定与某种模式匹配的字符串
找出匹配位置
提取匹配内容
替换匹配内容
基于匹配拆分字符串

1. 匹配检测

1.1 str_detect()
#返回逻辑向量
> str_detect(c("huang","si","yuan"),"a")
[1]  TRUE FALSE  TRUE
#能匹配上几个向量元素
> sum(str_detect(c("huang","si","yuan"),"a"))
[1] 2
#匹配上的向量元素的占比
> mean(str_detect(c("huang","si","yuan"),"a"))
[1] 0.6666667
还能选取出匹配某种模式的元素

预备知识:逻辑取子集,如

> hsy <- c("huang","si","yuan")
> hsy[c(TRUE,TRUE,FALSE)]
[1] "huang" "si"

继续

> hsy <- c("huang","si","yuan")
> hsy[str_detect(hsy,"a")]
[1] "huang" "yuan"

其他方法

> str_subset(hsy,"a")
[1] "huang" "yuan"
更实用的场景

针对数据框的某一列,筛选出符合条件的行记录

> df <- tibble(
+   word=words,
+   i=seq_along(word) #添加行号
+ )
> df %>% filter(str_detect(word,"x$"))
# A tibble: 4 x 2
  word      i
  <chr> <int>
1 box     108
2 sex     747
3 six     772
4 tax     841
1.2 str_count()
#返回每一个元素匹配的次数
> str_count(hsy,"a")
[1] 1 0 1
#平均每个元素匹配的次数
> mean(str_count(hsy,"a"))
[1] 0.6666667

2. 提取匹配内容

这里是指匹配的内容,与上面的提取向量元素有区别

2.1 str_extract()

sentences数据集是stringr包自带的,为720个元素的字符串向量

先提取能匹配上的句子/行看看

> has_red_blue <- str_subset(sentences,"red|blue")
> head(has_red_blue)
[1] "Glue the sheet to the dark blue background."
[2] "Two blue fish swam in the tank."            
[3] "The colt reared and threw the tall rider."  
[4] "The wide road shimmered in the hot sun."    
[5] "See the cat glaring at the scared mouse."   
[6] "A wisp of cloud hung in the blue air."  

提取匹配内容, 注意str_extract()只会提取第一个匹配

> matches <- str_extract(has_red_blue,"red|blue") 
> head(matches)
[1] "blue" "blue" "red"  "red"  "red"  "blue"
2.2 str_extract_all()

如何提取多个匹配呢?
先来看看有没有多次匹配的行

> more <- has_red_blue[str_count(has_red_blue,"red|blue") > 1]
> more
[1] "It is hard to erase blue or red ink."

str_extract_all()提取

> str_extract_all(more,"red|blue") #返回列表
[[1]]
[1] "blue" "red" 
> str_extract_all(more,"red|blue",simplify = T) #返回矩阵
     [,1]   [,2] 
[1,] "blue" "red"
> head(str_extract_all(has_red_blue,"red|blue",simplify = T)) #每一行长度自动统一
     [,1]   [,2]
[1,] "blue" ""  
[2,] "blue" ""  
[3,] "red"  ""  
[4,] "red"  ""  
[5,] "red"  ""  
[6,] "blue" ""

3. 分组匹配

str_match()可以给出每个分组的详细匹配内容,比括号搭配\1, \2方便

> two_words <- "(a|the) ([^ ]+)"
> has_two_words <- sentences %>% str_subset(two_words) %>% head(10) 
> has_two_words %>% str_extract(two_words) #给出模式的完整匹配
 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked" "the sun"   
 [7] "the huge"   "the ball"   "the woman"  "a helps"   
> has_two_words %>% str_match(two_words) #给出完整匹配以及分组匹配
      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"

4. 替换匹配内容

str_replace()

> hsy <- c("huang","si","yuan")
> str_replace(hsy,"[aeiou]"," ")
[1] "h ang" "s "    "y an" 
> str_replace_all(hsy, "[aeiou]", " ")
[1] "h  ng" "s "    "y  n" 

同时执行多种替换

> x <- c("1 house", "2 cars", "3 people")
> str_replace_all(x, c("1" = "one","2" = "two", "3" = "three"))
[1] "one house"    "two cars"     "three people"

5. 拆分

str_split()
str_split()返回列表,加了simplify之后变为矩阵

> sentences %>% head(4) %>% str_split(" ",simplify = T)
     [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]          [,9]   
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."     ""     
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background." ""     
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"           "well."
[4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"        "dish."

如何提取str_split()返回的列表元素

> "a|b|c" %>% str_split("\\|") %>% .[[1]]
[1] "a" "b" "c"
> "a|b|c" %>% str_split("\\|") %>% .[[1]] %>% .[2]
[1] "b"

6. 定位匹配内容

str_locate()

> str_locate(hsy,"[aeiou]")
     start end
[1,]     2   2
[2,]     2   2
[3,]     2   2
> str_locate_all(hsy,"[aeiou]")
[[1]]
     start end
[1,]     2   2
[2,]     3   3

[[2]]
     start end
[1,]     2   2

[[3]]
     start end
[1,]     2   2
[2,]     3   3

7. 使用regex()调整模式规则

str_view_all(hsy,regex("[aeiou]",ignore_case = T,multiline = T,comments = T,dotall = T))

ignore_case = T:不区分大小写
multiline = T:^和$分别表示每一行的开头和结尾,而不是整个字符串的
comments = T:添加注释
dotall = T:点号.能够代表换行符

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,254评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,875评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,682评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,896评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,015评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,152评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,208评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,962评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,388评论 1 304
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,700评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,867评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,551评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,186评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,901评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,142评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,689评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,757评论 2 351