python 网络爬虫第三章-爬取维基百科（1）

3.1 遍历单个域名

目标：爬取Wikipedia Kevin Bacon网页的所有其他文章链接。

3.1.1 爬取任意维基百科网页

示例代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup


html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
soup = BeautifulSoup(html,'lxml')
for link in soup.find_all('a'):  #网页的所有链接都在‘a’标签下
    if 'href' in link.attrs:
        print(link.attrs['href']) #a标签下的href属性存放具体链接地址

输出结果如下：

....
/wiki/Michael_Douglas
/wiki/Miguel_Ferrer
/wiki/Albert_Finney
/wiki/Topher_Grace
....
https://wikimediafoundation.org/wiki/Privacy_policy
/wiki/Wikipedia:About
/wiki/Wikipedia:General_disclaimer
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute
https://wikimediafoundation.org/wiki/Cookie_statement
//en.m.wikipedia.org/w/index.php?title=Kevin_Bacon&mobileaction=toggle_view_mobile
https://wikimediafoundation.org/
//www.mediawiki.org/
[Finished in 4.7s]

从结果可以看出所有的链接都在，有一些不是我们需要的。比如：

title=Kevin_Bacon&mobileaction=toggle_view_mobile
https://wikimediafoundation.org/
//www.mediawiki.org/

我们用inspect查看一下网页的结构，可以发现文章网页有如下特点：
1. 他们都在div->bodyContent标签下
2. 文章的URL不包含冒号“：”
3.文章的URL以"/wiki/"开始

这三个特点可以用正则表达式来说明。
1. soup.find('div',{'id':'bodyContent'})
2.regex = re.compile(r'((?!:).)*$') # ?!是不包含的意思。
3. regex = re.complie(r'^(/wiki/)')

所以改进代码如下：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
soup = BeautifulSoup(html,'lxml')
regex=re.compile(r"^(/wiki/)((?!:).)*$")
for link in soup.find('div',{'id':'bodyContent'}).find_all('a', href=regex ):
    if 'href' in link.attrs:
        print(link.attrs['href'])

结果如下：

/wiki/Kevin_Bacon_(disambiguation)
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Footloose_(1984_film)
....
/wiki/International_Standard_Name_Identifier
/wiki/Integrated_Authority_File
/wiki/Syst%C3%A8me_universitaire_de_documentation
/wiki/Biblioth%C3%A8que_nationale_de_France
/wiki/MusicBrainz
/wiki/Biblioteca_Nacional_de_Espa%C3%B1a
/wiki/SNAC
[Finished in 8.4s]

最后编辑于：2018.02.23 10:51:30

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成，浏览时请结合常识与多方信息审慎甄别。
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

python 网络爬虫第三章-爬取维基百科（1）

python 网络爬虫第三章-爬取维基百科（1）

3.1 遍历单个域名

3.1.1 爬取任意维基百科网页

相关阅读更多精彩内容

友情链接更多精彩内容