BeautifulSoup的用法

此文档是根据BeautifulSoup4.4.0官方文档总结而来

BeautifulSoup中的对象

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

Tag(标签)

# tag就是Tag对象，对象的name是html标签名，也就是p
soup = BeautifulSoup('<p id="test_id" class="test-class">hello world!</p>','lxml')
tag = soup.p
# html里p的attributes也就变成了tag的attributes
tag['class']  ==>  ['test-class']  #class可能是多个，所以是个数组
tag['id']  ==>  'test_id' #id只会是一个

NavigableString(标签里的内容)

上面说的p标签转成tag对象，那p标签里的内容则会转成字符串对象

# str就是NavigableString对象，
str = tag.string
# tag中字符串不能被编辑，但是可以替换成其他字符串
str.replace_with('yes')
print(tag)  ==>  <p class="test-class" id="test_id">yes</p>

BeautifulSoup(文档对象)

BeautifulSoup对象和tag类似，但是它没有name和attribute属性

Comment(注释)

一个html页面，里面除了标签，还有注释部分，这就需要Comment对象

#Comment对象也是通过Tag对象来获取，得到的是注释里面的内容
soup = BeautifulSoup('<p><!-- 注释部分 --></p>','lxml')
comment = soup.p.string   ==>  注释部分

遍历文档树

示例html

html_doc = """
    <html>
        <head>
            <title>The Dormouse story</title>
        </head>
        <body>
            <p class="title">
                <b>The Dormouse story</b>
            </p>
            <p class="story">
                Once upon a time there were three little sisters; and their names were
                <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
                <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
                and they lived at the bottom of a well.
            </p>
            <p class="story">...</p>
        </body>
    </html>
    """

子节点

soup = BeautifulSoup(html_doc,'lxml')
# 获取第一个a标签
soup.a
# 获取所有的a标签
soup.find_all('a')
# tag的.contents属性可以将tag的子节点以列表方式输出
soup.head.contents  ==>  ['\n', <title>The Dormouse story</title>, '\n']
soup.head.contents[1].contents  ==>  ["The Dormouse story"]
# .children对tag直接子节点循环
for child in head.children:
    print (child)   ==>  <title>The Dormouse story</title>
# .descendants对tag所有子节点递归循环
for child in head.descendants:
    print (child)   ==>  <title>The Dormouse story</title> 和 The Dormouse story
# 当tag中有多个字符串时，使用strings，去除空格用stripped_strings
for string in soup.stripped_strings:
    print (repr(string))

父节点

# .parent获取元素父节点
soup.title.parent
# .parent递归查找所有父节点
soup.a.parents
# .next_sibling下一个兄弟节点，.previous_sibling上一个兄弟节点
# .next_siblings和.previous_siblings递归查找兄弟节点

搜索文档树

方法参数

# 字符串。查找所有b标签
soup.find_all('b')
# 正则表达式。查找所有以b开头的标签
import re
soup.find_all("^b")
# 列表。查找所有a标签和b标签
soup.find_all(['a','b'])
# True。查找所有
soup.find_all(True)
# 方法。可以将方法作为参数，方法返回True或者False，find_all()据此查找
# 查找所有包含class属性但不包含id属性的元素
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

find_all()

find_all(name,attrs,recursive,string,**kwargs)

# name参数。根据tag的name查找
soup.find_all('title')
# attrs参数。根据class查找，但是class是python的保留字，需要用class_
soup.find_all('a',class_='sister')
soup.find_all('a',attrs={'class':'sister'})
# recursive参数。recursive=False查找直接子节点
soup.head.find_all('title',recursive=False)
# string参数。与文档中字符串内容匹配。
soup.find_all('a',string='Elsie')
# kwargs参数。自定义参数
soup.find_all(id='link2')
soup.find_all(href=re.compile("elsie"))
soup.find_all(id=True)
# limit参数。
soup.find_all('a',limit=2)

find_all()是最常用的方法之一，因此可以简写，例如

#这两个等价，写法和上面的Tag一样
soup.find_all('a')  ==>  soup('a')

find()

find_all(name,attrs,recursive,string,**kwargs)

find()方法只返回一个结果

#这两个等价
soup.find_all('a',limit=1)  ==>  soup.find('a')

其他搜索

find_parents() 和 find_parent() 向上查找

find_next_siblings() 和 find_next_sibling() 向后查找

find_previous_siblings() 和 find_previous_sibling() 向前查找

CSS选择器

# .select()可以使用css语法查找Tag
soup.select('title')
soup.select('html head title')
soup.select('html>head>title')
soup.select('.sister')
soup.select('#link1')
soup.select_one('.sister')

输出

格式化输出

# prettify()将文档格式化输出
print(soup.prettify())

压缩输出

# unicode()和str()可以将文档压缩输出
str(soup)  ==>  返回UTF-8编码的字符串
unicode(soup)  ==>  返回Unicode编码字符串

输出文本内容

# get_text()输出tag包括子孙tag中的内容（Unicode字符串）
soup.get_text()
# 指定分隔符
soup.get_text('|')
# 去除前后空白
soup.get_text('|',strip=True)
# 获得文本列表
soup.stripped_strings

编码

任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode

# .original_encoding 显示编码结果
soup.original_encoding
# from_encoding 指定编码方式
soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
# exclude_encodings 不使用此编码方式
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])

输出编码

通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码

# 使用prettify()修改输出编码
print(soup.prettify("latin-1"))
# 调用encode()方法指定编码
soup.p.encode("utf-8")