一、加载requests库和lxml库
import requests
from lxmlimport etree
二、打开目标网页进行分析
1.在51job网站上,搜索全国的python开发工程师的职位。我们的目标就是爬取职位名、公司名、工作地点、薪资、发布时间五条内容。
2、首先进行翻页处理,看一下每次翻页会有,网址是否发生变化。
第二页url:https://search.51job.com/list/000000,000000,0000,00,9,99,Python%25E5%25BC%2580%25E5%258F%2591%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588,2,2.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
第三页url:https://search.51job.com/list/000000,000000,0000,00,9,99,Python%25E5%25BC%2580%25E5%258F%2591%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588,2,3.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
根据网页变化我们发现,2,2.html?。。。。2,3.html?,每次翻页网址变化的地方是斜线处。
3.通过requests获取数据
#先获取单个网页上的信息
url ="https://search.51job.com/list/000000,000000,0000,00,9,99,Python%25E5%25BC%2580%25E5%258F%2591%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588,2,2.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=".format(i)
web_data = requests.get(url)
web_data.encoding = web_data.apparent_encoding
web_data.raise_for_status()
print(web_data.text)
4.通过xpath解析数据
#解析单页数据
#爬取职位名、公司名、工作地点、薪资、发布时间五条内容
e = etree.HTML(web_data.text)
jobnames = e.xpath('//*[@id="resultList"]/div/p/span/a/text()')
companys = e.xpath('//*[@id="resultList"]/div/span[1]/a/text()')
places = e.xpath('//*[@id="resultList"]/div/span[2]/text()')
salarys = e.xpath('//*[@id="resultList"]/div/span[3]/text()')
times = e.xpath('//*[@id="resultList"]/div/span[4]/text()')
print(jobnames,companys,places,salarys,times)
就这样,我们解析到了单个网页的中我们想要的数据。
5.多页的抓取与解析
通过for循环我们分别解析30页的信息
for i in range(0,30):
# 获取单页数据
url ="https://search.51job.com/list/000000,000000,0000,00,9,99,Python%25E5%25BC%2580%25E5%258F%2591%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588,2,{}.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=".format(i)
....
6.部分结果
三、全代码
四、总结
通过requests和xpath爬取解析网页,两大利器,爬取十分方便。另外一些招聘类的网站,网页结构简单,还是很容易爬取的 。