正则表达式中的断言(assertions)
1、什么是断言?
广义上理解,断言(assertions),从字面上理解就是,判定是
,还是否
。在正则表达式的系统里,也就是匹配
或者不匹配
。随便写一个正则表达式,都能产生匹配
或者不匹配
的结果,所以,可以这样说,所有的正则表达式都可以叫断言。
有时候,我们也经常会看到看到这个概念,零宽断言(zero-width assertions)。普通的断言,比如\d+
(匹配一个或者多个数字),它所匹配的内容是由长度的;而有些断言比如^
和$
(分别匹配行开头和结尾)匹配的仅仅是一个位置,这样可以理解为它所匹配的内容长度为0。所以,称这类断言为零宽断言(zero-width assertions)。
然而,实际中,好多时候,提到断言,都是指零宽断言(Regular Expressions Explained)。(可以这样简单地理解:其它的断言比较简单,没有什么好说的。。。)所以,有时候,会看到下面的概念:
An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters.
From: php Assertions
翻译:
断言就是判断当前位置的前后是否匹配,但是不消耗任何字符。
下面是断言的一个解释:
Actually matches characters, but then gives up the match, returning only the result: match or no match. They do not consume characters in the string, but only assert whether a match is possible or not.
2、断言的分类
正则表达式中右两类断言:Anchors和Lookarounds。
2.1 Anchors
Anchors, or atomic zero-width assertions, cause a match to succeed or fail depending on the current position in the string, but they do not cause the engine to advance through the string or consume characters. The metacharacters listed in the following table are anchors.
Assertion | Description | Pattern | Matches |
---|---|---|---|
^ |
The match must start at the beginning of the string or line. | ^\d{3} |
901 in 901-333-
|
$ |
The match must occur at the end of the string or before \n at the end of the line or string. | -\d{3}$ |
-333 in -901-333
|
\A |
The match must occur at the start of the string. | \A\d{3} |
901 in 901-333-
|
\Z |
The match must occur at the end of the string or before \n at the end of the string. | -\d{3}\Z |
-333 in -901-333
|
\z |
The match must occur at the end of the string. | -\d{3}\z |
-333 in -901-333
|
\G |
The match must occur at the point where the previous match ended. | \G\(\d\) |
(1) , (3) , (5) in (1)(3)(5)[7](9)
|
\b |
The match must occur on a boundary between a \w (alphanumeric) and a \W (nonalphanumeric) character. | \b\w+\s\w+\b |
them theme , them them in them theme them them
|
\B |
The match must not occur on a \b boundary. | \Bend\w*\b |
ends , ender in end sends endure lender
|
From: Anchors in Regular Expressions
2.2 Lookarounds
Example | Lookaround Name | What it Does |
---|---|---|
(?=foo) |
Lookahead | Asserts that what immediately follows the current position in the string is foo. |
(?<=foo) |
Lookbehind | Asserts that what immediately precedes the current position in the string is foo. |
(?!foo) |
Negative Lookahead | Asserts that what immediately follows the current position in the string is not foo. |
(?<!foo) |
Negative Lookahead | Asserts that what immediately precedes the current position in the string is not foo. |
3、断言的使用举例
这里以IDEA为例,举例说明断言的使用。
在IDEA中新建一个text.txt
文本文件,然后输入如下测试文本:
in the house, there is a little horse.
finally, it won over a long race near the small inn.
all above is just a makeup story.
3.1 Lookahead
匹配前面紧跟着“al”的“in”,可以用正则表达式:
in(?=al)
效果如下:
3.2 Lookbehind
匹配后面是“in”的“al”,可以用正则表达式:
(?<=in)al
效果如下:
3.3 Negative Lookahead
匹配前面没有紧跟着“al”的“in”,可以用正则表达式:
in(?!al)
效果如下:
3.4 Negative Lookbehind
匹配后面不是“in”的“al”,可以用正则表达式:
(?<!in)al
效果如下:
3.5 组合使用
同时使用前向和后向断言可以实现对匹配内容两侧的控制。
匹配后面是”f“前面是”al“的”in“,可以用正则表达式:
(?<=f)in(?=al)
效果如下:
匹配后面不是”fin“前面是”ly“的”al“,可以用正则表达式:
(?<!fin)al(?=ly)
效果如下:
3.6 Practice
比如我们有如下的xml文档:
<?xml version="1.0" encoding="UTF-8"?>
<note>
<item
class="important">
<type>Reminder</type>
<headline>Weekend plan</headline>
<body>Don't forget swimming this weekend!</body>
</item>
<item class="vital">
<type>Event</type>
<headline>Exam</headline>
<body>Exam on tomorrow morning!</body>
</item>
</note>
(1) 匹配搜索每条item的内容
现在我们想匹配搜索出每条item的内容,可以采用如下的正则表达式:
(?<=<item\s{1,200}class=".{1,200}">\s{1,200})<(.|[\n\r])+?(?=\s+<\/item>)
效果如下图:
在上面的正则表达式中可以看出有写{1,200}
。实际上,这里只是用来取代+
(相当于{1,}
,出现1次或者一次以上)。由于这些模式是出现在后向断言的模式(出现在匹配内容的左侧)中,所以不能包含不限定上限次数的模式(可能和具体正则表达式的实现有关系),由于这里不可能高于200次,所以这里用它来取代次数无上限。
这有什么用呢?上图中的红框可以让IDEA选中所有的匹配,这时候,只需要复制粘贴,就可以将所有匹配到的内容一次性全部摘出来。效果如下:
<type>Reminder</type>
<headline>Weekend plan</headline>
<body>Don't forget swimming this weekend!</body>
<type>Event</type>
<headline>Exam</headline>
<body>Exam on tomorrow morning!</body>
(2) 匹配所有Event的class
现在想匹配所有Event的class,可以使用如下的正则表达式:
".+"(?=>\s+<type>Event</type>)
这里由于用的是前向断言,所以,这里可以采用+
这种不设次数上限的匹配模式。