对于Python爬虫,有很多成熟的框架,如scrapy、urllib2模块等,本人常用的requests+beautifulsoup,而且beautifulsoup只需要选择器select,个人认为对初学者比较容易掌握。下面以爬取行政区划代码为例,抛砖引玉!
目标网址:http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/52.html
1.安装requests+beautifulsoup
pip install requests
pip install beautifulsoup4
2.导入requests+beautifulsoup
import requests
from bs4 import BeautifulSoup
3.获取网页请求
url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/52.html'
res = requests.get(url=url).content
soup = BeautifulSoup(res, 'html.parser', from_encoding='GBK')
4.检查网页元素
光标移到行政区划代码上,右键选择“检查”,发现行政区划代码都在类名“citytr”的列表中。
5.获取采集元素并打印
citys = soup.select('.citytr a')
for city in citys:
print(city.string)
输出结果如下:
完整代码如下:
import requests
from bs4 import BeautifulSoup
url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/52.html'
res = requests.get(url=url).content
soup = BeautifulSoup(res, 'html.parser', from_encoding='GBK')
citys = soup.select('.citytr a')
for city in citys:
print(city.string)
下节,我们将介绍如何结构化采集数据并保存!