python学习笔记（三）——BeautifulSoup框架

介绍

BeautifulSoup是Python的一个HTML或XML的解析库，我们可以用它来方便从网页中提取数据

安装

pip install beautifulsoup

导入

from bs4 import Beautifulsoup

使用

from bs4 import BeautifulSoup

# 定义一串字符串
html = '''
<!doctype html>
<html lang="en"><head><title class='title'>Document</title></head>
<body>
<div><!-- comment content --></div>
<p class="p1">hello world <span>!</span><span>?</span></p>
<p class="p2">world hello</p> 
</body>
</html>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.prettify())

image

可以看出通过soup.prettify()方法对我们的字符串格式化，这样提高美观性，而且说明里面包含的库有利于我们提取内容。

匹配标签

from bs4 import BeautifulSoup

html = '''
<!doctype html>
<html lang="en"><head><title class='title'>我的python学习之路</title></head>
<body>
<div><!-- comment content --></div>
<p class="p1">hello world <span>!</span><span>?</span></p>
<p class="p2">world hello</p> 
</body>
</html>
'''
soup = BeautifulSoup(html,'lxml')

# 提取p标签
print('soup.div:',soup.div)
# 提取body下的p标签下的span
print('soup.body.p.span:',soup.body.p.span)
# 提取p标签的class属性
print("soup.p['class']:",soup.p['class'])

结果为

soup.div: <div><!-- comment content --></div>
soup.body.p.span: <span>!</span>
soup.p['class']:['p1']

可以看出只会匹配到第一个，且会匹配出包括标签的所有内容，那肯定还不够知足的。

string(s)

提取标签内的内容肯定是最重要的，其提供了一个string(s)方法。让我们看看怎么用。

from bs4 import BeautifulSoup

html = '''
<!doctype html>
<html lang="en"><head><title class='title'>我的python学习之路</title></head>
<body>
<div><!-- comment content --></div>
<p class="p1">hello world <span>!</span><span>?</span></p>
<p class="p2">world hello</p> 
</body>
</html>
'''
soup = BeautifulSoup(html,'lxml')

# 提取title标签的内容
print('soup.title.string:',soup.title.string)
# 提取第一个div内的注释
print('soup.div.string:',soup.div.string)
# 提取第一个p标签的内容
print('soup.p.strings',list(soup.p.strings))

输出结果是

soup.title.string: 我的python学习之路
soup.div.string:  comment content
soup.p.strings ['hello world ', '!', '?']

由此看出string就是获取标签内的内容，strings就是获取多个内容。

当内容仅有一个嵌套，依旧会获取那个嵌套里的内容。

当内容里还嵌套几个标签，就得用strings，否则就会出错返回None。

而且仅仅获取到的是第一个目标

如果想获取所有目标，就得用find_all()方法，之后会将。

直接子节点

from bs4 import BeautifulSoup

html = '''
<!doctype html>
<html lang="en"><head><title class='title'>我的python学习之路</title></head>
<body>
<div><!-- comment content --></div>
<p class="p1">hello world <span>!</span><span>?</span></p>
<p class="p2">world hello</p> 
</body>
</html>
'''
soup = BeautifulSoup(html,'lxml')

# p标签下的子节点以列表的方式输出
print('soup.p.contents',soup.p.contents)
print('list(soup.p.contents)',list(soup.p.contents))
print('list(soup.p.contents[0])',soup.p.contents[0])
# p标签下的子节点以list_iterator的方式输出
print('soup.p.children',soup.p.children)
print('list(soup.p.children)',list(soup.p.children))

输出为

soup.p.contents ['hello world ', <span>!</span>, <span>?</span>]
list(soup.p.contents) ['hello world ', <span>!</span>, <span>?</span>]
list(soup.p.contents[0]) hello world 
soup.p.children <list_iterator object at 0x000002270F3AEFD0>
list(soup.p.children) ['hello world ', <span>!</span>, <span>?</span>]

可以看出contents属性是以列表方式输出，而children属性以列表迭代器输出，需要用list方式输出，不过两种方法都可以通过list输出。而且两者的功能是有点类似的。

父节点

from bs4 import BeautifulSoup

html = '''
<!doctype html>
<html lang="en"><head><title class='title'>我的python学习之路</title></head>
<body>
<div><!-- comment content --></div>
<p class="p1">hello world <span>!</span><span>?</span></p>
<p class="p2">world hello</p> 
</body>
</html>
'''
soup = BeautifulSoup(html,'lxml')

# 父节点
print('soup.p.parent',soup.p.parent)
# span的父节点
print('soup.span.parent',soup.span.parent)
# 所有父节点
print('soup.span.parents',soup.span.parents)
# 以上方法不对
for parent in soup.span.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

输出为

soup.p.parent <body>
<div><!-- comment content --></div>
<p class="p1">hello world <span>!</span><span>?</span></p>
<p class="p2">world hello</p>
</body>
soup.span.parent <p class="p1">hello world <span>!</span><span>?</span></p>
soup.span.parents <generator object parents at 0x000002F9AC27BF10>
p
body
html
[document]

通过 .parent 属性来获取某个元素的父节点

通过元素的. parents 属性可以递归得到元素的所有父辈节点

不能直接用parents属性，需要通过循环遍历输出

find_all方法

print(soup.find_all('title'))
print(soup.find_all('p'))
# 由于class是python的关键词，所以这里默认为class_
print(soup.find_all(class_='p2'))

结果为

[<title class="title">我的python学习之路</title>]
[<p class="p1">hello world <span>!</span><span>?</span></p>, <p class="p2">world hello</p>]
[<p class="p2">world hello</p>]

以列表输出

find_all(name , attrs , recursive , text , **kwargs)

find_all是爬虫常用的方法。

CSS选择器

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id 名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

这个也是用得比较频繁的！

标签名
类名
id号
组合查找
属性查找

注意：select完之后获得的是列表，一般用get_text()方法来获取它的内容。

很好理解，就不贴代码演示了！

not end

-python学习笔记-

python学习笔记（三）——BeautifulSoup框架