BeautifulSoup的使用姿势

BeautifulSoup 是什么

BeautifulSoup库是解析、遍历、维护“标签树”的功能库

安装

pip3 install beautifulsoup4

注意：

在 PyPi 中还有一个名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的发布版本,因为很多项目还在使用 BS3, 所以 BeautifulSoup 包依然有效.但是如果你在编写新项目,那么你应该安装的 beautifulsoup4

HelloWorld

首先导入包

from bs4 import BeautifulSoup

BeautifulSoup 可以直接打开文件，并分析文件内容

soup = BeautifulSoup(open('index.html'), 'html.parser')

也可以直接分析文件内容

soup = BeautifulSoup('<html>data</html>', 'html.parser')

上面的 html.parser 部分是指定解析器，用来解析文件内容用的。BeautifulSoup 目前有以下几种解析器

html.parser
lxml
xml
html5lib

来演示一个例子。从 https://python123.io/ws/demo.html 上获取网页的 HTML 内容，然后使用 BeautifulSoup 解析

import requests
from bs4 import BeautifulSoup

r = requests.get('https://python123.io/ws/demo.html')
html_text =r.text
soup = BeautifulSoup(html_text, 'html.parser')

得到 HTML 内容如下

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

可以使用 prettify() 打印格式化后的 HTML 内容

print(soup.prettify())

得到

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

BeautifulSoup 类的基本元素

基本元素	类型	说明
Tag	bs4.element.Tag	标签，最基本的信息组织单元，分别用 <> 和 </> 标明开头和结尾
Name	str	标签的名字，<p>...</p> 的名字是 'p'，格式 <tag>.name
Attributes	dict	标签的属性，字典组织形式，格式 <tag>.attrs
NavigableString	bs4.element.NavigableString	标签内非属性字符串，<>...</> 中的字符串，格式 <tag>.string
Comment	bs4.element.Comment	标签内字符串的注释部分，一种特殊的 Comment 类型

用一张图来说明就是

接着以 https://python123.io/ws/demo.html 获取到的 HTML 内容来解释说明 BeautifulSoup 的基本元素

Tag标签

标签，最基本的信息组织单元，分别用 <> 和 </> 标明开头和结尾

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(type(soup.title)
print(soup.title)
print(soup.title.title)
print(soup.a)

输出

<class 'bs4.element.Tag'>
<title>This is a python demo page</title>
None
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

任何存在于 HTML 文档中的标签都可以用 soup.<tag> 访问获得
还可以使用 soup.<tag1>.<tag2> 类似的形式，获取 <tag1> 标签下的 <tag2> 标签
当 HTML 文档中存在多个相同 <tag> 对应内容时，soup.<tag> 返回第一个

Tag的name

标签的名字，<p>…</p> 的名字是'p'，格式：<tag>.name

查看 <a> 标签的名字

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(type(soup.a.name))
print(soup.a.name)

输出

<class 'str'>
a

Tag的attrs

标签的属性，字典形式组织，格式：<tag>.attrs

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(type(soup.p.attrs))
print(soup.p.attrs)
print(soup.a.attrs)

输出

<class 'dict'>
{'class': ['title']}
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

Tag的NavigableString

标签内非属性字符串，<>…</> 中字符串，格式：<tag>.string

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(type(soup.p.string))
print(soup.p.string)

输出

<class 'bs4.element.NavigableString'>
The demo python introduces several python courses.

Tag的Comment

from bs4 import BeautifulSoup

html_text = '''
<b><!--This is a comment--></b>
<p>This is not a comment</p>
'''
soup = BeautifulSoup(html_text, 'html.parser')
print(soup.b.string)
print(type(soup.b.string))
print(soup.p.string)
print(type(soup.p.string))

输出

This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>

可以看到虽然都是调用 <tag>.string 方法获取注视和标签内容，但是两者的类型是不一样的：
标签的 Comment 是特殊的 NavigableString 类型：bs4.element.Comment。

这个需要在将来的实际应用中特别注意，可以使用 if-else 语句来判断

if isinstance(tag.string, bs4.element.Comment):
    pass
else:
    pass

使用BeautifulSoup分析HTML树

从 https://python123.io/ws/demo.html 下载到的 HTML 文档内容如下

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

可以把它理解成一棵树，<html> 标签是根，其他标签接在根下面

于是，遍历标签树就有这几种方式

下行遍历
上行遍历
平行遍历

下行遍历

属性	说明
.contents	子节点的列表，将 <tag> 所有儿子节点存入列表
.children	子节点的迭代类型，与 .contents 类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

.contents

查看 head 标签的子节点

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')
print(soup.head.contents)

输出

[<title>This is a python demo page</title>]

查看 body 标签的子节点

print(soup.body.contents)

输出

['\n', <p class="title"><b>
The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']

可以发现 \n 换行符也当作是一个节点输出了，这个很重要，在 BeautifulSoup 里换行符也当作一个节点看待。比如

soup = BeautifulSoup('''
''', 'html.parser')
print(soup.contents)
for child in soup.contents:
    print(type(child))

输出结果是

['\n']
<class 'bs4.element.NavigableString'>

.children

用法和 .contents 一样，都是遍历节点下的子节点

soup = BeautifulSoup(html_text, 'html.parser')
for child in soup.head.children:
    print(child)

输出

<title>This is a python demo page</title>

.descendants

查看标签下的子孙节点

soup = BeautifulSoup(html_text, 'html.parser')
for child in soup.head.descendants:
    print(child)

输出

<title>This is a python demo page</title>
This is a python demo page

可以看到同样是 head 标签，调用 .children 和调用 .descendants 差别很大。.children 是获取标签下的直接节点，而 .descendants 是获取标签下的子孙节点。同时在 BeautifulSoup 中，字符串也是当作节点看待，所以就输出了

<title>This is a python demo page</title>
This is a python demo page

上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

.parent

print(soup.title.parent)
print(soup.html.parent)
print(soup.parent)

分别输出

<head><title>This is a python demo page</title></head>

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

None

.parents

for parent in soup.a.parents:
    print(type(parent), parent.name)

输出

<class 'bs4.element.Tag'> p
<class 'bs4.element.Tag'> body
<class 'bs4.element.Tag'> html
<class 'bs4.BeautifulSoup'> [document]

平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

需要注意，平行遍历，是发生在同一个父节点下的各节点间

遍历方式总结

查找节点

查找节点常用的方法有以下两个

find( name , attrs , recursive , text , **kwargs )
find_all( name , attrs , recursive , text , **kwargs )

find：查找第一个符合要求的节点
find_all：查找所有符合要求的节点

下面几个常用的例子

找到第一个 <a> 标签

a = soup.find('a')
print(a)

输出

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

找到所有 <a> 标签

for a in soup.find_all('a'):
    print(a)

输出

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

查找 class 属性值为 py1 的标签

a = soup.find('a', {'class': 'py1'})
print(a)

输出

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

查找所有以 p 字母开头的标签

import re

for tag in soup.find_all(re.compile(r'^p')):
    print(tag.name)

输出

p
p

查找所有 href 属性为 http://www.icourse163.org/course/ 开头的 a 标签

import re

for tag in soup.find_all('a', {'href': re.compile(r'^http://www.icourse163.org/course/.+')}):
    print(tag)

输出

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

总结

使用方法

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'html.parser')

基本元素

Tag
name
attrs
NavigableString
Comment

遍历方式

下行遍历
上行遍历
平行遍历

查找节点

find
find_all

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 221,888评论 6赞 515
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 94,677评论 3赞 399
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 168,386评论 0赞 360
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,726评论 1赞 297
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,729评论 6赞 397
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 52,337评论 1赞 310
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,902评论 3赞 421
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,807评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 46,349评论 1赞 318
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,439评论 3赞 340
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,567评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 36,242评论 5赞 350
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,933评论 3赞 334
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,420评论 0赞 24
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,531评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,995评论 3赞 377
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,585评论 2赞 359

BeautifulSoup的使用姿势

BeautifulSoup 是什么

安装

HelloWorld

BeautifulSoup 类的基本元素

Tag标签

Tag的name

Tag的attrs

Tag的NavigableString

Tag的Comment

使用BeautifulSoup分析HTML树

下行遍历

.contents

.children

.descendants

上行遍历

.parent

.parents

平行遍历

遍历方式总结

查找节点

总结

使用方法

基本元素

遍历方式

查找节点

推荐阅读更多精彩内容