Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库. 使用十分方便, 先放两个链接, 官网教程有点太多,第二个blog链接很容易上手。
官方文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
博客教程: http://cuiqingcai.com/1319.html
我自己做个简要笔记Mark一下
安装
****一步安装****:
easy_install beautifulsoup4
pip install beautifulsoup4
****源码安装****:
http://www.crummy.com/software/BeautifulSoup/download/4.x/
sudo python setup.py install
(如果没有sudo权限可参考另外一篇blog)
****安转解析器****:
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是lxml.根据操作系统不同,可以选择下列方法来安装lxml:
pip install lxml 或者 pip install html5lib
![说明][2]
[2]: http://upload-images.jianshu.io/upload_images/5223866-35b53ffb567d03ff.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240
使用:
import re
import bs4
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">邮件 组</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 创建BeautifulSoup 对象
soup = BeautifulSoup(html_doc)
# 或者打开html文件 soup = BeautifulSoup(open('index.html'))
# 如果要解析xml, 前提要安装lxml
soup = BeautifulSoup(markup, "xml")
# 打印soup对象的内容, 格式化输出
print soup.prettify()
# 打印文本 去掉html符号:
print soup.get_text()
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .
未完待续 。。。