HTML Tag Selector标签选择器设计（手稿PFC0200512）

用惯了xpath、scrapy或jsoup这类HTML 文档解析器，今天决定试试实现其tag selector标签选择器。

HTML是一个规范的文档。一般来说，HTML文档是对称闭合的，跟XML文档很相似，但不是严格的。HTML tag标签一般由“<”、“>”包围，标签开始符基本要素为左右尖括号界定符、tag标签名、tag标签属性和tag标签属性值，表达式为：<[tag name] [tag attribute name]=”[tag attribute value]”>，标签结束符表达式为：</[tag name]>。其中，tag标签属性和值为非必要。标签属性和值的赋值表达式为：[属性]=”[值]”或[属性]=[值]两种，一般推荐前者的使用方法。属性赋值表达式之间使用空格分隔。

HTML标签具有规范的属性，也可以自定义属性。自定义属性可以保存一定的数据，不能被标准的HTML解释器识别。一般情况下，优先考虑标准HTML的实现。

HTML文档解析器本质上是语法树的一种实现，简单来说，根据特定的标签词汇和语法规则实现对HTML文档的解析处理。底层的语法树一般是使用类似于递归的处理方法。像函数递归和快速排序都是很有意思的语法规则设计。都是使用简短的语句实现人的智慧精华。说了这么多，下面来试试从高级语言层面实现标签选择器。这里笔者主要是通过正则表达式来完成其实现。

标签选择器最基本的是标签的识别。前面提到，标签开始符和结束符的表达式分别为：<[tag]>、</[tag]>，[tag]为由英文字母组成的标签名。对应正则表达式：

<python>
 
tag_start_pattern=’<tag>’
 
tag_end_pattern=’</tag>’
 
</python>

标签属性赋值表达式为：[ ][attribute]=”[value]”，对应正则表达式：

<python>
tag_attribute_pattern=’[ ][attribute][=][“][value][“]’
</python>

应用模式除包含上述表达式外，还需进行标签最小配对或标签对称等优化。

以下是实现代码的草稿：

<python>
 
# -*- coding: UTF-8 -*-
 
# !/usr/bin/env python
 
'''
author: MRN6
blog: qq_21264377@blog.csdn.net
'''
 
import re
 
 
#定义path规则
 
def qpath(path=None, html=None):
 
    if path is None or html is None:
 
        return  []
 
    rules=path.split("//")
 
    matches=""
 
    for rule in rules:
 
        if len(rule.strip())<1:
 
            continue
 
        ruledatas=rule.split(':')
 
        tag=ruledatas[0]
 
        attributedatas=ruledatas[1].split('=')
 
        attribute=attributedatas[0]
 
        value=attributedatas[1]
 
        print('<'+tag+' '+attribute+'="'+value+'"')
 
        rules=len(ruledatas)
 
        if rules==2:
   
            matches=re.findall('(<'+tag+'[^<>]*'+attribute+'="'+value+'[^"]*"[^<>]*>((?!<'+tag+'[^<>]*'+attribute+'="'+value+'"[^<>]*>).)*</'+tag+'>$)', html, re.M|re.S|re.I)
 
        elif rules==3 and ruledatas[2]=='END':
 
            matches=re.findall('(<'+tag+'[^<>]*'+attribute+'="'+value+'[^"]*"[^<>]*>((?!</'+tag+'>).)*</'+tag+'>$)', html, re.M|re.S|re.I)
 
            #注：参考https://blog.csdn.net/iteye_13785/article/details/82638686
 
        #检查标签对称问题
 
        if len(matches)>0:
 
            for match in matches:
 
                smatches=re.findall('<'+tag, match[0], re.M|re.S|re.I)
 
                ematches=re.findall('</'+tag+'>', match[0], re.M|re.S|re.I)
 
                slen=len(smatches)
 
                elen=len(ematches)
 
                if slen!=elen:
 
                    print(match[0]+' greedy match: '+str(slen)+'-'+str(elen))
 
    return matches
 
 
 
html='''
<!DOCTYPE html>
<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge, chrome=1">
<title>标题</title>
</head>
<body>
    <div id="root">
        <div class="content-item first">
            <div class="content-title">title1</div>
            <div class="content-body">content1</div>
        </div>
        <div class="content-item">
            <div class="content-title">title_2</div>
            <div class="content-body">content2</div>
        </div>
        <div>&nbsp;</div>
        <div class="content-item">
            <div class="content-title">title3_</div>
            <div class="content-body">content3</div>
        </div>
</div>
<div>&nbsp;</div>
</body>
</html>
'''
 
mypath="//div:id=root//div:class=content-item"
 
mypath2="//div:id=root//div:class=content-item//div:class=content-title:END"
 
results=qpath(mypath, html)
 
print(len(results))
 
for result in results:
 
    print(result)
 
 
 
results2=qpath(mypath2, html)
 
print(len(results2))
 
for result in results2:
 
    print(result)
 
</python>

最后编辑于：2020.09.12 16:13:30