拉勾网数据加载的方式使用的是ajax异步加载的方式从后端加载数据,所以就需要分析加载的URL,如果有疑问可以看我的以前的文章爬取ajax异步网页数据,
找到数据的URL之后又有了麻烦,网站的反爬虫机制使我大为恼火,
找到了一些分析反爬虫文章看了一圈[1],然后就开始动手做了
具体做法打开chrome的network分析然后找到请求的Request headers将其中的请求变量全部都复制过来 特别是cookie,
运行爬虫脚本就ok了
核心代码,访问Github获取源码
headers = {'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/52.0.2743.82 Safari/537.36Content-Type: application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': 'user_trace_token=20170308132543-ad47299a-03bf-11e7-9229-5254005c3644; LGUID=20170308132543-ad472cba-03bf-11e7-9229-5254005c3644; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=9B7C15BE2C65CF24358F30F168876EBE; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; TG-TRACK-CODE=index_search; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1488950746,1489900831; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1489900857; _ga=GA1.2.1209512299.1488950743; LGSID=20170319132032-c6079e4e-0c63-11e7-9505-5254005c3644; LGRID=20170319132057-d4f51474-0c63-11e7-9505-5254005c3644; SEARCH_ID=ae3e7bf206f6479fb4c553f6e556682d'
}
f = requests.get(url,headers=headers)