二、、Net_Crawler解析

一、正则匹配

匹配单个字符与数字

----------匹配单个字符与数字---------
.                匹配除换行符以外的任意字符
[0123456789]     []是字符集合，表示匹配方括号中所包含的任意一个字符
[good]           匹配good中任意一个字符
[a-z]            匹配任意小写字母
[A-Z]            匹配任意大写字母
[0-9]            匹配任意数字，类似[0123456789]
[0-9a-zA-Z]      匹配任意的数字和字母
[0-9a-zA-Z_]     匹配任意的数字、字母和下划线
[^good]          匹配除了good这几个字母以外的所有字符，中括号里的^称为脱字符，表示不匹配集合中的字符
[^0-9]           匹配所有的非数字字符
\d               匹配数字，效果同[0-9]
\D               匹配非数字字符，效果同[^0-9]
\w               匹配数字，字母和下划线,效果同[0-9a-zA-Z_]
\W               匹配非数字，字母和下划线，效果同[^0-9a-zA-Z_]
\s               匹配任意的空白符(空格，回车，换行，制表，换页)，效果同[ \r\n\t\f]
\S               匹配任意的非空白符，效果同[^ \f\n\r\t]

匹配边界字符

--------------锚字符(边界字符)-------------

^     行首匹配，和在[]里的^不是一个意思   startswith
$     行尾匹配                          endswith

\A    匹配字符串开始，它和^的区别是,\A只匹配整个字符串的开头，即使在re.M模式下也不会匹配它行的行首
\Z    匹配字符串结束，它和$的区别是,\Z只匹配整个字符串的结束，即使在re.M模式下也不会匹配它行的行尾

\b    匹配一个单词的边界，也就是值单词和空格间的位置   bounds
\B    匹配非单词边界

匹配分组

#匹配分组
#|   :或
#()   :整体
#search:会在字符串中从左向左进行查找，如果找到第一个符合条件的，则停止查找
#正则1|正则2：只要正则1或者正则2中的一个满足，则直接按照这个条件查找

模式修正

re.I:忽略大小写模式【ignorecase】
re.M：视为多行模式【more】
re.S：视为单行模式【single】

二、Xpath解析

在 XPath 中，有七种类型的节点：元素、属性、文本、命名空间、处理指令、注释以及文档（根）节点。XML 文档是被作为节点树来对待的。树的根被称为文档节点或者根节点。

test.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <title>测试页面</title>
</head>
<body>
    <ol>
        <li class="haha pp">醉卧沙场君莫笑，古来征战几人回</li>
        <li class="heihei">两岸猿声啼不住，轻舟已过万重山</li>
        <li id="hehe" class="nene">一骑红尘妃子笑，无人知是荔枝来</li>
        <li class="xixi">停车坐爱枫林晚，霜叶红于二月花</li>
        <li class="lala">商女不知亡国恨，隔江犹唱后庭花</li>
    </ol>
    <div id="pp">
        <div>
            <a href="http://www.baidu.com">李白</a>
        </div>
        <ol>
            <li class="huanghe">君不见黄河之水天上来，奔流到海不复回</li>
            <li id="tata" class="hehe">李白乘舟将欲行，忽闻岸上踏歌声</li>
            <li class="tanshui">桃花潭水深千尺，不及汪伦送我情</li>
        </ol>
        <div class="hh">
            <a href="http://mi.com">雷军</a>
        </div>
        <ol>
            <li class="dudu">are you ok</li>
            <li class="meme">会飞的猪</li>
        </ol>
    </div>
</body>
</html>

运用

from lxml import etree
#用etree把整个html字符串加载出来，生成一颗节点树
html = etree.HTML(r.text)   # r.text是文本类型

# 1、根据树形结构获取目标节点
res = html_tree.xpath('/html/body/ol/li[3]')
res = html_tree.xpath('/html/body/div/div[2]/a')

# 2、查找节点中的内容和属性
res = html_tree.xpath('/html/body/div/ol[1]/li[1]/text()')  # ['君不见黄河之水天上来，奔流到海不复回']
# xpath 语法中节点的属性需要用@符号修饰
res = html_tree.xpath('/html/body/div/div[2]/a/@href'） #['http://mi.com']

# 3、定位
#（1）层级定位  '/' 代表节点前面有一层  '//' 代表有若干层
res = html_tree.xpath('//li/text()')  

*** text()只获得当前节点的文本内容***
****string() 会获得当前节点下的子孙节点所有文本***

# (2)属性定位
# 获取页面中有class属性的li的元素
res = html_tree.xpath('//li[@id]')
# 获取所有的class值为hehe的li
res = html_tree.xpath('//li[@class="hehe"]')
# 如果一个节点的某个属性有多个值一定要把这些值写全
res = html_tree.xpath('//li[@class="haha pp"]')

# 4、模糊匹配
# 查找所有class值以h开头的li
res = html_tree.xpath('//li[starts-with(@class,"h")]')
# 查找所有class值中含有a的li
res = html_tree.xpath('//li[contains(@class,"h")]')

# 5、逻辑运算
# 查找所有class值为hehe并且id值为tata的li元素
res = html_tree.xpath('//li[@class="hehe" and @id="tata"]')
# 查找所有class值为hehe或者含有a的元素
res = html_tree.xpath('//li[@class="hehe" or contains(@class,"a")]')

obj = html_tree.xpath("//div[@id='pp']")[0]
#以obj为根节点，继续向内部查找
res = obj.xpath('//li/text()') # 无论以谁为根，以'//'开头都是以html为根节点
res = obj.xpath('.//li/text()')  # 以'.//'开头是以当前节点(obj)开始匹配

三、BS4解析

样本html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>soup测试</title>
</head>
<body>
    <div class="tang">
        <ul>
            <li><a href="http://www.baidu.com" title="出塞"><!--秦时明月汉时关，万里长征人未还，但使龙城飞将在，-->不教胡马度阴山</a></li>
            <li><a href="http://www.163.com" class="taohua">人面不知何处去，桃花依旧笑春风</a></li>
            <li><a href="http://mi.com" id="hong">去年今日此门中，人面桃花相映红</a></li>
            <li><a href="http://qq.com" name="he">故人西辞黄鹤楼，烟花三月下扬州</a></li>
        </ul>
    </div>
    <div id="meng">
        <p class="jiang">
            <span>三国猛将</span>
            <ol>
                <li>关羽</li>
                <li>张飞</li>
                <li>赵云</li>
                <li>马超</li>
                <li>黄忠</li>
            </ol>
            <div class="cao">
                <ul>
                    <li>典韦</li>
                    <li>许褚</li>
                    <li>张辽</li>
                    <li>张郃</li>
                    <li>于禁</li>
                    <li>夏侯惇</li>
                </ul>
            </div>
        </p>
    </div>
</body>
</html>

解析

from bs4 import BeautifulSoup

# 1）、把html字符串初始化成一个BeautifulSoup对象
soup = BeautifulSoup(open("./soup_test.html",encoding='utf-8'),'lxml')

# 参数1，一个htnml字符串 参数2,是一个解析器（bs4没有自己的解析器，如果加入其它的解析器，可以提高其解析效率 ）

# 1、根据标签名来查找对象，这类方法返回的是这类标签的第一个
# print(soup.title) #<title>soup测试</title>
# print(soup.li)

#2、获取标签的内容
obj = soup.a
# print(obj.string)  # 获取页面中字符串（包括被注释的内容），string属性如果有多个子节点，无法获取
# print(obj.get_text()) # 获取当前标签中的字符串（包括所有后代标签中的字符串），无法获取注释内容

# 3、获取属性
# print(obj.get('title')) # 用get方法获取属性内容
# print(obj['href'])   # 用字典键值获取
# print(obj.attrs)  # 获取标签的所有属性（得到一个字典）
# print(obj.name)  # 获取标签的名

# 4、获取子节点
# print(soup.body.children)  #<list_iterator object at 0x000002098C3E5EF0>

#获取直接子节点
# for child in soup.body.children:
#     print('------------------')
#     print(child)

print(soup.body.descendants)
# 获取当前节点的所有后代节点
# for i in soup.body.descendants:
#     print('------------------')
#     print(i)

# 5、根据相关函数查找节点
# 1） find函数，返回一个对象
# print(soup.find('a')) #寻找第一个a标签
# print(soup.find('',id='hong'))

# 2）find_all函数，返回的是一个列表
# print(soup.find_all('a'))
# print(soup.find_all(['a','span','li']))
# print(soup.find_all(['a','span','li'],limit = 3))
# print(soup.find_all(['a','span','li'],class='taohua'))

# 3)select函数，根据css选择器来查找
# print(soup.select('.taohua'))
# print(soup.select('.tang ul li')) #派生
# print(soup.select('li#hong'))  # 组合（先查找li标签然后找含有id是hong的li）
# print(soup.select('[name="he"]'))  # 属性选择器

四、jsonpath解析

样本json

{ "store": {
    "book": [ 
      { "category": "reference",
        "author": "李白",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "杜甫",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "白居易",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "苏轼",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}

解析

import json
import jsonpath


books  = json.load(open('./book.json','r',encoding='utf-8'))

# print(books['store'])
# print(books['store']['book'])
# print(books['store']['book'][1]['price'])
# 查找所有的book的价格

b = books['store']['book']
# for i in b:
#     print(i['price'])


# 用jsonpath查找
# /html/body/div

# 在jsonpath中$代表根节点、 "." 代表当前节点的子节点， ".." 代表当前节点的后代节点
res = jsonpath.jsonpath(books,"$.store.book[*]")
res = jsonpath.jsonpath(books,"$..author")
res = jsonpath.jsonpath(books,"$..book[:3]")
print(res)

二、、Net_Crawler解析

一、正则匹配

二、Xpath解析

三、BS4解析

四、jsonpath解析

推荐阅读更多精彩内容