大数据开发之Hive篇12-Hive正则表达式

备注:
Hive 版本 2.1.1

一.Hive 正则表达式概述

Hive的正则表达式虽然没有关系型数据库的正则表达式那么强大，但是一样可以解决HQL开发过程中的诸多问题，数据工作者平时也离不开正则表达式。对此，特意做了个hive正则表达式的小结。所有代码都经过亲测，正常运行。

Hive支持如下三个正则表达式:

regexp
regexp_extract
regexp_replace

1.1 字符集合:

字符	匹配的字符
\d	从０-９的任一数字
\D	任一非数字字符
\w	任一单词字符，包括A-Z,a-z,0-9和下划线
\W	任一非单词字符
\s	任一空白字符，包括制表符，换行符，回车符，换页符和垂直制表符
\S	任一非空白字符
.	任一字符

字符簇:

 [[:alpha:]] 任何字母。
 [[:digit:]] 任何数字。
 [[:alnum:]] 任何字母和数字。
 [[:space:]] 任何空白字符。
 [[:upper:]] 任何大写字母。
 [[:lower:]] 任何小写字母。
 [[:punct:]] 任何标点符号。
 [[:xdigit:]] 任何16进制的数字，相当于[0-9a-fA-F]。

1.2 边界集合:

字符 l	描述
^	每一行的开头，单行模式下等价于字符串的开头
$	每一行的结尾，单行模式下等价于字符串的结尾

1.3 重复次数集合:

贪婪模式会获取尽可能多的字符，而非贪婪模式会获取尽可能少的字符。

贪婪	非贪婪	描述
*	*?	零次或多次
?	??	零次或一次
+	+?	一次或多次
{m}	{m}?	正好m次，贪婪与非贪婪一样的
{m,}	{m,}?	至少m次
{m, n}	{m, n}?	最少m最多n次

1.4 组合操作符:

优先级比较：圆括号>重复次数操作符>和>或

组合操作符	描述
[…]	方括号内任意字符或字符集合中的一个
[^…]	方括号内^为第一个字符时，表示与其后所有字符都不匹配的字符
(…)	圆括号，将复杂表达式当作单一表达式来处理
..\|..	或
abc	和。直接将字符连在一起写

1.5 匹配操作符:

匹配操作符	描述
\n	即后向引用。n为1~9，标识由圆括号里取得的匹配字符串。方向是从左到右在regexp_replace函数中，允许在模式表达式和替换表达式中都使用\n

1.6 转义操作符:

转义操作符	描述
\	将其后紧跟着的操作字符当作普通字符看待。例如 abcdef 可以匹配 abdef或abcccdef等，但无法匹配 abcdef，后者需要abc*def才能匹配

二.Hive 正则表达式案例

2.1 regexp

语法:

A REGEXP B
功能同RLIKE

如果字符串A或者字符串B为NULL，则返回NULL；如果字符串A符合表达式B 的正则语法，则为TRUE；否则为FALSE。B中字符”_”表示任意单个字符，而字符”%”表示任意数量的字符。

代码:

-- 匹配有8个连续数字的字符串
with tmp1 as
(
select '11145678abc' as rn
union all
select '111456789abc'
union all
select 'd111456789abc'
)
select rn
  from tmp1
 where rn regexp '\\d{8}';


-- 匹配开头有8个及以上连续数字的字符
with tmp1 as
(
select '11145678abc' as rn
union all
select '111456789abc'
union all
select 'd111456789abc'
)
select rn
  from tmp1
 where rn regexp '^\\d{8}';

-- 匹配开头只有8个连续数字的字符
with tmp1 as
(
select '11145678abc' as rn
union all
select '111456789abc'
union all
select 'd111456789abc'
)
select rn
  from tmp1
 where rn regexp '^\\d{8}\\D';

测试记录:

hive> 
    > -- 匹配有8个连续数字的字符串
    > with tmp1 as
    > (
    > select '11145678abc' as rn
    > union all
    > select '111456789abc'
    > union all
    > select 'd111456789abc'
    > )
    > select rn
    >   from tmp1
    >  where rn regexp '\\d{8}';
Query ID = root_20201217151846_3f0eebcd-6f5a-455d-b8c9-afd5cddbc358
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0295, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0295/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0295
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-17 15:18:52,063 Stage-1 map = 0%,  reduce = 0%
2020-12-17 15:18:58,242 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.73 sec
MapReduce Total cumulative CPU time: 1 seconds 730 msec
Ended Job = job_1606698967173_0295
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.73 sec   HDFS Read: 5150 HDFS Write: 162 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 730 msec
OK
rn
11145678abc
111456789abc
d111456789abc
Time taken: 13.291 seconds, Fetched: 3 row(s)
hive> -- 匹配开头有8个及以上连续数字的字符
hive> with tmp1 as
    > (
    > select '11145678abc' as rn
    > union all
    > select '111456789abc'
    > union all
    > select 'd111456789abc'
    > )
    > select rn
    >   from tmp1
    >  where rn regexp '^\\d{8}';
Query ID = root_20201217151946_c5102f51-5e70-4a80-afa6-3678f92091f0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0296, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0296/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0296
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:19:54,003 Stage-1 map = 0%,  reduce = 0%
2020-12-17 15:20:00,174 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.85 sec
MapReduce Total cumulative CPU time: 3 seconds 850 msec
Ended Job = job_1606698967173_0296
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 3.85 sec   HDFS Read: 10856 HDFS Write: 223 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 850 msec
OK
rn
11145678abc
111456789abc
Time taken: 15.275 seconds, Fetched: 2 row(s)
hive> -- 匹配开头只有8个连续数字的字符
hive> with tmp1 as
    > (
    > select '11145678abc' as rn
    > union all
    > select '111456789abc'
    > union all
    > select 'd111456789abc'
    > )
    > select rn
    >   from tmp1
    >  where rn regexp '^\\d{8}\\D';
Query ID = root_20201217152016_c920f47c-663c-4d4f-a9df-e9c77d20a126
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0297, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0297/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0297
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:20:22,628 Stage-1 map = 0%,  reduce = 0%
2020-12-17 15:20:28,804 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.61 sec
MapReduce Total cumulative CPU time: 3 seconds 610 msec
Ended Job = job_1606698967173_0297
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 3.61 sec   HDFS Read: 10990 HDFS Write: 198 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 610 msec
OK
rn
11145678abc
Time taken: 13.633 seconds, Fetched: 1 row(s)
hive>

2.2 regexp_replace

语法:

regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)

返回用替换实例替换INITIAL_STRING中与模式中定义的java正则表达式语法匹配的所有子字符串所产生的字符串。例如，regexp_replace("foobar"， "oo|ar"， "")返回'fb。'注意，在使用预定义字符类时需要注意:使用'\s'作为第二个参数将匹配字母s;'\s'必须匹配空格，等等。

测试数据

create table test_reg(id int,str string);

insert into test_reg values (1,'我在学习Hive，大数据。');
insert into test_reg values (2,'Hive，我来了，Coming666。');
insert into test_reg values (3,'666,Hive居然拥有关系型数据库的诸多特性。');
insert into test_reg values (4,'wuwuwu，Hive学习起来还是存在一定难度。');
insert into test_reg values (5,'Hive数据仓库，6666。');

2.2.1 截取字符串中汉字部分

代码:

select id,regexp_replace(str,'([^\\u4E00-\\u9FA5]+)','') new_str
  from test_reg;

测试记录:

hive> 
    > select id,regexp_replace(str,'([^\\u4E00-\\u9FA5]+)','') new_str
    >   from test_reg;
Query ID = root_20201217153549_3a7163ba-365b-4f65-adad-930d42ec385d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0303, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0303/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0303
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:35:56,518 Stage-1 map = 0%,  reduce = 0%
2020-12-17 15:36:02,702 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.32 sec
MapReduce Total cumulative CPU time: 6 seconds 320 msec
Ended Job = job_1606698967173_0303
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.32 sec   HDFS Read: 8951 HDFS Write: 495 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 320 msec
OK
id      new_str
3       居然拥有关系型数据库的诸多特性
4       学习起来还是存在一定难度
5       数据仓库
1       我在学习大数据
2       我来了
Time taken: 14.196 seconds, Fetched: 5 row(s)
hive>

2.2.2 截取字符串中数字部分

代码:

select id,regexp_replace(str,'([^0-9]+)','') new_str
  from test_reg;

测试记录:

hive> 
    > select id,regexp_replace(str,'([^0-9]+)','') new_str
    >   from test_reg;
Query ID = root_20201217153822_cf2389b9-8533-4b82-b472-57df3a1da418
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0304, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0304/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0304
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:38:30,220 Stage-1 map = 0%,  reduce = 0%
2020-12-17 15:38:37,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.19 sec
MapReduce Total cumulative CPU time: 6 seconds 190 msec
Ended Job = job_1606698967173_0304
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.19 sec   HDFS Read: 9139 HDFS Write: 259 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 190 msec
OK
id      new_str
3       666
4
5       6666
1
2       666
Time taken: 16.038 seconds, Fetched: 5 row(s)
hive>

2.2.3 截取字符串中字母部分

代码:

select id,regexp_replace(str,'([^a-zA-Z]+)','') new_str
  from test_reg;

测试记录:

hive> 
    > select id,regexp_replace(str,'([^a-zA-Z]+)','') new_str
    >   from test_reg;
Query ID = root_20201217154102_65e60007-0a8a-45d2-94de-bb0bfe11f9e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0305, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0305/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0305
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:41:08,696 Stage-1 map = 0%,  reduce = 0%
2020-12-17 15:41:14,864 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.02 sec
2020-12-17 15:41:15,891 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.13 sec
MapReduce Total cumulative CPU time: 6 seconds 130 msec
Ended Job = job_1606698967173_0305
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.13 sec   HDFS Read: 9145 HDFS Write: 281 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 130 msec
OK
id      new_str
3       Hive
4       wuwuwuHive
5       Hive
1       Hive
2       HiveComing
Time taken: 14.205 seconds, Fetched: 5 row(s)
hive>

2.2.4 截取字符串中的大小写字母和数字

代码:

select id,regexp_replace(str,'([^a-zA-Z0-9]+)','') new_str
  from test_reg;

测试记录:

hive> select id,regexp_replace(str,'([^a-zA-Z0-9]+)','') new_str
    >   from test_reg;
Query ID = root_20201217154722_738edcd7-4c2a-4aa1-a66a-acac268f199b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0310, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0310/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0310
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:47:29,743 Stage-1 map = 0%,  reduce = 0%
2020-12-17 15:47:36,950 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.13 sec
MapReduce Total cumulative CPU time: 6 seconds 130 msec
Ended Job = job_1606698967173_0310
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.13 sec   HDFS Read: 9123 HDFS Write: 291 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 130 msec
OK
id      new_str
3       666Hive
4       wuwuwuHive
5       Hive6666
1       Hive
2       HiveComing666
Time taken: 15.686 seconds, Fetched: 5 row(s)

2.3 regexp_extract

语法:

regexp_extract(string subject, string pattern, int index)

返回使用模式提取的字符串。例如，regexp_extract('foothebar'， 'foo(.*?)(bar)'， 2)返回'bar '。'注意，在使用预定义字符类时需要注意:使用'\s'作为第二个参数将匹配字母s;'\s'必须匹配空格，等等。

测试json数据

create table test_reg2(id int,str string);
insert into test_reg2 values (1,'{"filtertype":"29","filtername":"成人Node","filtertitle":"成人Group","filtersubtype":"","filterid":"29|1","filterValue":"1|4"}');
insert into test_reg2 values (2,'{"filtertitle":"钻级","filtertype":"16","filtersubtype":"","filtername":"四钻/高档","filterid":"16|4",}');

代码:

select id
      ,regexp_extract(str,'(filtertype"\\:")(\\d+)(",)',2)               as filtertype
      ,regexp_extract(str,'(filtername"\\:")((\\W*\\w*)|(\\W*))(",)',2)  as filtername
      ,regexp_extract(str,'(filtertitle"\\:")((\\W*\\w*)|(\\W*))(",)',2) as filtertitle
      ,regexp_extract(str,'(filterid"\\:")(\\d+\\|\\d+)(",)',2)          as filterid
 from test_reg2;

测试记录:

hive> 
    > select id
    >       ,regexp_extract(str,'(filtertype"\\:")(\\d+)(",)',2)               as filtertype
    >       ,regexp_extract(str,'(filtername"\\:")((\\W*\\w*)|(\\W*))(",)',2)  as filtername
    >       ,regexp_extract(str,'(filtertitle"\\:")((\\W*\\w*)|(\\W*))(",)',2) as filtertitle
    >       ,regexp_extract(str,'(filterid"\\:")(\\d+\\|\\d+)(",)',2)          as filterid
    >  from test_reg2;
Query ID = root_20201217164358_dbda76fc-2bc1-41d6-b140-72f590b05502
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0314, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0314/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0314
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-17 16:44:05,767 Stage-1 map = 0%,  reduce = 0%
2020-12-17 16:44:12,987 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.93 sec
MapReduce Total cumulative CPU time: 2 seconds 930 msec
Ended Job = job_1606698967173_0314
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.93 sec   HDFS Read: 5038 HDFS Write: 205 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 930 msec
OK
id      filtertype      filtername      filtertitle     filterid
1       29      成人Node        成人Group       29|1
2       16      四钻/高档       钻级    16|4
Time taken: 15.265 seconds, Fetched: 2 row(s)

参考：

1.https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-MathematicalFunctions
2.https://blog.csdn.net/weixin_37536446/article/details/81053172
3.https://www.cnblogs.com/db-record/p/11454325.html

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 217,734评论 6赞 505
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,931评论 3赞 394
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 164,133评论 0赞 354
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,532评论 1赞 293
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,585评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,462评论 1赞 302
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,262评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,153评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,587评论 1赞 314
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,792评论 3赞 336
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,919评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,635评论 5赞 345
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,237评论 3赞 329
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,855评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,983评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,048评论 3赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,864评论 2赞 354

大数据开发之Hive篇12-Hive正则表达式

一.Hive 正则表达式概述

1.1 字符集合:

1.2 边界集合:

1.3 重复次数集合:

1.4 组合操作符:

1.5 匹配操作符:

1.6 转义操作符:

二.Hive 正则表达式案例

2.1 regexp

2.2 regexp_replace

2.2.1 截取字符串中汉字部分

2.2.2 截取字符串中数字部分

2.2.3 截取字符串中字母部分

2.2.4 截取字符串中的大小写字母和数字

2.3 regexp_extract

参考：

推荐阅读更多精彩内容