正则表达式进阶

贪婪模式

*和+这两个限定符都可以标识匹配多个元素，在默认情况下它们会尽可能多的匹配文字，这就是所谓的“贪婪模式”

示例代码

###  测试代码
 @Test
    public void test() throws IOException {
        String str = "<xml>helloworld<xml>";
        String regex = "<.*>";

        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(str);

        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }

###输出结果

<xml>helloworld<xml>

Process finished with exit code 0

注意观察上述代码，按照匹配规则，"<.>"表示匹配以‘<’开头，‘>’结尾，中间有任意个字符的字符串。那么很明显第一个<xml>和最后一个</xml>也是满足匹配规则的，但是结果只输出了整个字符串。这就是限定符的贪婪模式，会尽可能多的匹配字符，也就是说它匹配了“xml>helloworld<xml”这一整串内容。

怎么消除这种现象让它能匹配到第一个满足条件的“<xml>”和最后一个“</xml>”呢？

示例代码


    @Test
    public void test() throws IOException {
        String str = "<xml>helloworld<xml>";
        String regex = "<.*?>";

        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(str);

        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }

### 输出结果
<xml>
<xml>

Process finished with exit code 0

将匹配的字符串从"<.>" 变更为 "<.?>"，也就是在限定符后面加“？”即可消除贪婪模式

分组即“()”在正则表达式中的作用

使限定符可以作用于成对的匹配规则

  @Test
    public void test() throws IOException {
        String str = "abcabcabc";
        String regex = "(abc){3}";

        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(str);

        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
    
### 输出结果
abcabcabc

Process finished with exit code 0

可以看到，通过"()"将“abc”字符包裹之后，限定符"{3}"的修饰内容变成了“abc”整体出现三次，于是成功匹配了我们的目标字符串“abcabcabc”.

检索字符串

所谓检索字符串，可以理解为提取目标字符串，例如：对于字符串“<xml>helloworld<xml>”,通过检索可以提取出<xml></xml>标签的内容“helloworld”。

在理解检索之前，顺带提一下java中matcher类,我们常用的方法有2个

matcher.matches();
matcher.find();

他们之间的区别在于：

matches()方法意为匹配，即目标字符串“整体”与我们的“正则表达式”进行匹配，返回结果是true或者false，这种情况通常用在例如：邮箱，电话号码等场景，用来检查输入的正确与否。

find()方法意为查找，也就是说它并不强调整体匹配是否正确，只是按照我们的“正则表达式”规则到目标字符串中查找是否有相匹配的字符串，其实就是部分匹配的意思。

示例代码

 @Test
    public void test() throws IOException {
        String str = "helloworld";
        String regex = "hello";

        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(str);

         System.out.print(matcher.matches());
//        while (matcher.find()) {
//            System.out.println(matcher.group());
//        }
    }
    
### 输出结果 false  这是因为"hello"字符串与"helloworld"是不相匹配的

上述代码更改为

 @Test
    public void test() throws IOException {
        String str = "helloworld";
        String regex = "hello";

        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(str);

//        System.out.print(matcher.matches());
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
    
### 输出结果 hello
这是因为在目标字符串"helloworld"中找到了与我们的匹配规则相符的字符串"hello"

find()方法的匹配是从左到右的，也就是说如果字符串中有多个满足表达式的子字符串，通过多次find可以全部找出。示例代码:

 @Test
    public void test2() throws IOException {
        String str = "helloworldhello";
        String regex = "hello";

        Matcher matcher = Pattern.compile(regex).matcher(str);
        //必须先调用matcher.find()，这样才会将匹配的结果放到group中
        int n = 0;
        while (matcher.find()) {
            System.out.println(n++);
            int count = matcher.groupCount();
            for (int i = 0; i <= count; i++) {
                System.out.println("matcher.group(" + i + "): " + matcher.group(i));
            }
        }
    }
    
###输出结果
0
matcher.group(0): hello
1
matcher.group(0): hello

以上代码第一次find找到了第一个hello，第二次find找到了world后面的hello

看完上述代码再来讲下如何使用"()"的检索功能

示例代码

@Test
    public void test() throws IOException {
        String str = "<xml>helloworld</xml>";
        String regex = "<xml>([a-z]*)</xml>";

        Matcher matcher = Pattern.compile(regex).matcher(str);
        //必须先调用matcher.find()，这样才会将匹配的结果放到group中
        if (matcher.find()) {
            int count = matcher.groupCount();
            for (int i = 0; i <= count; i++) {
                System.out.println("matcher.group(" + i + "): " + matcher.group(i));
            }
        }
    }
    
###输出结果

matcher.group(0): <xml>helloworld</xml>
matcher.group(1): helloworld

上面代码，通过给中间匹配加"([a-z])",matcher类会将这个分组的匹配结果放到group(1)，称为捕获组1，通过matcher.group(1)即可以检索出标签内容"helloworld"。记住group(0)永远是整个"正则表达式"的匹配结果，即捕获组0。在这个示例中，group(0)是"<xml>([a-z])</xml>"整串表达式的匹配结果。捕获组在group(几)取决于“()”在整个表达式中从左到右，从外到里的顺序(这个可以自行验证)

后向引用

在检索功能中我们知道了捕获组的概念，正则表达式后面的子表达式可以使用前面的捕获组，示例代码：

 @Test
    public void test() throws IOException {
        String str = "<xml>helloworld</xml>";
        
        
        //“\\1”(实际上是“\1”，第一个\是转义符号)用来引用捕获组1
        String regex = "<(xml)>.*</\\1>";

        Matcher matcher = Pattern.compile(regex).matcher(str);
        //必须先调用matcher.find()，这样才会将匹配的结果放到group中
        if (matcher.find()) {
            int count = matcher.groupCount();
            for (int i = 0; i <= count; i++) {
                System.out.println("matcher.group(" + i + "): " + matcher.group(i));
            }
        }
    }
    
    
### 输出结果

matcher.group(0): <xml>helloworld</xml>
matcher.group(1): xml

从上面代码应该很清楚在正则表达式中可以通过"\i"的形式来引用捕获组i 的内容即“xml“字符串。

注意事项:

1.这里为什么叫后向引用呢，这是因为“\i”所引用的捕获组i必须是在它前面所定义的，如果将上述表达式改为的"<\1>.*</xml>"，就匹配不到了。

2.后向引用所引用的是"内容",而非"表达式"。示例代码：

将上面例子中的代码改为:
 @Test
    public void test() throws IOException {
        String str = "<xml>helloworld</html>";
        String regex = "<([a-z]{3})>.*</\\1>";

        Matcher matcher = Pattern.compile(regex).matcher(str);
        //必须先调用matcher.find()，这样才会将匹配的结果放到group中
        if (matcher.find()) {
            int count = matcher.groupCount();
            System.out.println(count);
            for (int i = 0; i <= count; i++) {
                System.out.println("matcher.group(" + i + "): " + matcher.group(i));
            }
        }
    }

### 输出结果是空的

### 这是因为在匹配的过程中，“([a-z]{3})”首先匹配到了“xml”，于是后面“\\1”就变成了“xml”，于是整个表达式其实变成了 "<([a-z]{3})>.*</xml>"，这显然是无法匹配到"<xml>helloworld</html>"，所以整个表达式没有匹配结果。

断言

断言的概念比较模糊，我的理解是断言是特殊的限定符，用来对它所修饰的表达式添加匹配条件。示例代码：

 @Test
    public void test() throws IOException {
        String str = "hello2017";
        //“X1(?=X2)” 这是断言的一种写法，表示表达式X1匹配成功的前提是它后面有能够匹配表达式X2的内容。
        String regex = "hello(?=2017)";

        Matcher matcher = Pattern.compile(regex).matcher(str);
        //必须先调用matcher.find()，这样才会将匹配的结果放到group中
        if (matcher.find()) {
            int count = matcher.groupCount();
            for (int i = 0; i <= count; i++) {
                System.out.println("matcher.group(" + i + "): " + matcher.group(i));
            }
        }
    }
    
### 输出
matcher.group(0): hello

Process finished with exit code 0

我们前面说过group(0)是整个表达式匹配的结果，但这里为什么不是hello2017而只输出了hello呢? 断言有点类似限定符，并不是匹配元素，所以整个表达式其实就是"hello"，而断言给这个表达式匹配增加了条件。在这个例子中，hello所匹配成功的条件必须是后面跟着2017的hello，也就是只能是hello2017中的hello。将上述str改成"hello2018"，虽然hello字符串依然能够配对hello，但是它后面跟的不是2017，不满足断言条件，因而匹配结果为空。

1.断言是不消耗字符的，在上面的例子中，虽然2017参与了"hello(?=2017)"这个表达式的运算，但它并没有被消耗，仍旧可以被后面的表达式所匹配。示例代码：

 @Test
    public void test() throws IOException {
        String str = "hello2017";
        String regex = "hello(?=2017)2";

        Matcher matcher = Pattern.compile(regex).matcher(str);
        //必须先调用matcher.find()，这样才会将匹配的结果放到group中
        if (matcher.find()) {
            int count = matcher.groupCount();
            for (int i = 0; i <= count; i++) {
                System.out.println("matcher.group(" + i + "): " + matcher.group(i));
            }
        }
    }
    
###输出结果
matcher.group(0): hello2

从这个结果可以看出，2017虽然参与了"hello"字符串的匹配，但是并没有被消耗掉，依然可以在后面的表达式"2"中使用，于是输出结果是hello2

2.断言不是捕获组，前面所述的"()"的使用中说过"()"的内容会被放入捕获组，那么代码中怎么区分是捕获组还是断言呢？其实就是?的作用，形如"(?=X)"由于加入了一个"?"，因此代码执行时知道这是一个断言，不需要放入捕获组。

断言分为4种：

表达式	定义	说明
X1(?=X2)	零宽度正先行断言	仅当表达式X1后出现X2时，X1才能匹配成功。例如，“(hello)(?=\d)”匹配"hello1"成功，匹配"helloworld"不成功
X1(?!X2)	零宽度负先行断言	与上述相反。例“(hello)(?!\d)” 匹配"hello1"不成功，匹配"helloworld"成功。
(?<=X2)X1	零宽度正后发断言	仅当表达式X1前面出现X2时，X1才能匹配成功。例如，“(?<=\d)(hello)”匹配"88hello"成功，而匹配"aahello"不成功
(?<!X)	零宽度负后发断言	仅当表达式X1前面出现X2时，X1才能匹配成功。例如，“(?<!\d)(hello)”匹配"ahello"成功，而匹配"88hello"不成功

正则表达式进阶

正则表达式进阶

贪婪模式

分组即“()”在正则表达式中的作用

使限定符可以作用于成对的匹配规则

检索字符串

后向引用

断言

相关阅读更多精彩内容

友情链接更多精彩内容