1.1 运行BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(str(html.read(), encoding='utf-8'), 'lxml')
print(bs.h1)
# 输出结果
<h1>An Interesting Title</h1>
bs = BeautifulSoup(str(html.read(), encoding='utf-8'), 'lxml')
bs = BeautifulSoup(str(html.read(), encoding='utf-8'), 'html5lib')
第一个参数为html信息,第二个参数为解析器参数,可供选择的解析器有(html.parser, lxml, html5lib)。各有优劣。
1.2 可靠的网络连接和异常处理
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bs = BeautifulSoup(str(html.read(), encoding='utf-8'), 'lxml')
title = bs.body.h1
except AtterbuteEroor as e:
return None
return title
title = gettITLE('https://www.pythonscraping.com/pages/page1.html')
if title == None:
print('title ccould not be found')
print(bs.h1)
# 输出结果
<h1>An Interesting Title</h1>
在写代码时,思考代码的总体布局,让代码既可以捕捉异常又容易阅读。