- 哈喽大家好,砸门又见面了
- 这次我这个菜鸡爬的是拉勾网,他的反爬机制让我这个菜鸡踩了好多坑,不过我还是把他爬下来了,真是辛苦拉勾网的程序员了=-=
- 好了,话不多说上思路+代码,这次我换个说法,我把思路也说上
- 刚开始我上了拉钩后,页面里也有信息,但是不太好弄,还要匹正则或者啥的,然后我想到之前在一个群里看到的,抓包,然后我就打开了f12看了看,还真有就是这个
- 这里是我们想要的,然后看页面的结构
-
这时把这个url地址单独打开会发现
- 嗨呀,惊不惊喜,意不意外,这时,就要把上边说到的请求头放到代码中了。
user_agent = [
'Mozilla/5.0 (Windows NT 6.1; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (iPad; CPU OS 10_1_1 like Mac OS X) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0 Mobile/14B100 Safari/602.1',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0'
]
num = random.randint(0, 9) ##定义随机函数
user_agent = user_agent[num] ##用随机函数抽取
hearder = { ##然后是请求头,下边我是从我的请求头复制的
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '25',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=sug&fromSearch=true&suginput=p',
'User-Agent': user_agent,
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'X-Requested-With': 'XMLHttpRequest'
}
##这个cooking是模拟登陆的参数,是cookie
cooking = {
'cookie':这个好像要保密,所以这个是你的cookie。
}
##date就是post是的参数就是页数和关键字了
dates = {
'first': 'false',
'kd': self.kd,
'pn': self.page
}
html = requests.post(self.url, headers=hearder,cookies=cooking, data=dates)
- 抓取html吧。下边我上完整的代码:
'''
这次我试着爬的拉勾网然后在分析
菜鸡也想玩玩
此代码作者:高佳乐
'''
import requests ##导入requests库
import json ##导入json库以便解析js
import random ##导入随机函数库
from openpyxl import Workbook ##导入这个库,是管理excel
import time ##导入time是控制爬虫的睡眠时间,不会睡眠的爬虫就是耍流氓
class Reptilian(): ##定义类,这个单词是爬虫的意思,我百度翻译,这样更有b格
def __init__(self): ##结构函数,本来我想弄全局变量,有点丑我就弄在了结构函数
self.url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'##那个搜索页的url
self.pages = int(input('请输入你要获取的前几页')) ##post的参数前多少页
self.page = 1 ##从第一页开始
self.kd = input('你要获取的职位信息') ##post的关键字
self.number = 1 ##就是个计数器,全局变量有点丑就放这了
self.shuju = Workbook() ##这行是打开一个excel
self.shuju_one = self.shuju.active ##这里是excel的第一个表
def headers(self): ##请求头的结构,下边这个列表是放了10个user_agent然后下边用随机函数随机抽取
user_agent = [
'Mozilla/5.0 (Windows NT 6.1; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (iPad; CPU OS 10_1_1 like Mac OS X) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0 Mobile/14B100 Safari/602.1',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0'
]
num = random.randint(0, 9) ##定义随机函数
user_agent = user_agent[num] ##用随机函数抽取
hearder = { ##然后是请求头,下边我是从我的请求头复制的
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '25',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=sug&fromSearch=true&suginput=p',
'User-Agent': user_agent,
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'X-Requested-With': 'XMLHttpRequest'
}
##这个cooking是模拟登陆的参数,是cookie
cooking = {
'cookie': 'user_trace_token=20180710103626-b4c2ffdc-1f66-4faf-9d75-4722b6cfd916; LGUID=20180710103627-0baf62d4-83ea-11e8-8271-525400f775ce; WEBTJ-ID=20180710140917-16482cef3b816-07cbfad53c941d-5b4b2b1d-1327104-16482cef3b91cb; _gat=1; PRE_UTM=m_cf_cpt_baidu_pc; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2Fs%3Fie%3Dutf-8%26f%3D8%26rsv_bp%3D1%26rsv_idx%3D1%26tn%3Dbaidu%26wd%3D%25E6%258B%2589%25E5%258B%25BE%25E7%25BD%2591%26oq%3D%2525E7%252588%2525AC%2525E8%252599%2525AB%2525E6%25258B%252589%2525E5%25258B%2525BE%2525E7%2525BD%252591%26rsv_pq%3De97818190000b3ba%26rsv_t%3Deb2ei8ThN4xypS3meOdbjcF6svWBOdHFVTnNnKnHn64IbwKkxuhAYbl4Oxw%26rqlang%3Dcn%26rsv_enter%3D1%26inputT%3D437%26rsv_sug3%3D22%26rsv_sug1%3D13%26rsv_sug7%3D100%26rsv_sug2%3D0%26rsv_sug4%3D1747; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Flp%2Fhtml%2Fcommon.html%3Futm_source%3Dm_cf_cpt_baidu_pc; TG-TRACK-CODE=index_search; JSESSIONID=ABAAABAABEEAAJAEF19E7498CA3C7D6C23289D4F4DAFC62; X_HTTP_TOKEN=a320c7314b39615089e7c8d4e844cdcd; _putrc=51B6FCDBAC8CA5C6123F89F2B170EADC; login=true; unick=%E6%8B%89%E5%8B%BE%E7%94%A8%E6%88%B72721; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; gate_login_token=25dac49e83fe12a32f74b689b48b5d7ce91e70cb80e78cb5ed521cb200b461e9; _ga=GA1.2.994010031.1531190203; _gid=GA1.2.814166295.1531190203; LGSID=20180710140902-be2e38ab-8407-11e8-8281-525400f775ce; LGRID=20180710141058-0329240f-8408-11e8-8281-525400f775ce; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1531190203,1531202958; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1531203074; SEARCH_ID=1c3bb17eded44f4ba841ca9a0b08909d; index_location_city=%E5%85%A8%E5%9B%BD'
}
##date就是post是的参数就是页数和关键字了
dates = {
'first': 'false',
'kd': self.kd,
'pn': self.page
}
html = requests.post(self.url, headers=hearder,cookies=cooking, data=dates) ##html是post网址返回的json
return html
def josn(self): ##解析函数
html = self.headers() ##用上边的函数返回json后
html = html.text ##返回的json用text打印出来
html = json.loads(html) ##然后用loads解析
return html ##返回解析后的html
def save(self): ##保存函数
shuju = self.shuju ##shuju是构造函数里边的exc
shuju.save(self.kd+'.xlsx') ##然后保存以关键字为名字的excel文件
def content(self): ##到了正文部分,
while self.page<=self.pages: ##循环,条件是页数<交互给的总页数
html = self.josn() ##html是解析后的json,这个解析后的是每页解析的
try: ##如果对
html_content = html['content'] ##html_contentH是html里的content键对应的值
html_positionResult = html_content['positionResult'] ##html_positionResult是content里边positionResult的值
html_result = html_positionResult['result'] ##result是positionResult里result对应的值
self.shuju_one.title = '数据' ##excel表的第一个title是数据
for result in html_result: ##因为上边返回的最后有个result值是裂变所以用for循环遍历
positionName = result['positionName'] ##职位的名称
endcation = result['education'] ##职位的学历
city = result['city'] ##职位所在地
self.shuju_one['A%d' % self.number].value = positionName ##excel里A列的是职称名称,这里number是上边的构造里的以便是记录第几行
self.shuju_one['B%d' % self.number].value = endcation ##excel里B列的的是职称学历
self.shuju_one['C%d' % self.number].value = city ##excel里C列的是职称所在地
self.number += 1 ##然后计数的也就是记录几行的+1
print('第%d页保存完毕' % self.page) ##然后一页后保存完毕
except:
print('不让访问') ##如果错了就是不让访问
self.page+=1 ##然后页数+1
sleep = random.randint(28, 32) ##休息28-32,因为不然只能爬前三页,呸,不会睡眠的爬虫就是耍流氓
time.sleep(sleep) ##睡眠
self.save() ##保存表
######################################################################################################################################################
shuju = Reptilian()
shuju.content()
好了代码我也放上了,说说注意事项:
- 请求头是要有的,cookie是要有的,post和get要区分
- 爬虫要学会睡眠,前几天我在一个群里还看到个:不会睡眠的爬虫就是耍流氓=-=。所以砸门要有礼貌
- 就是看好第二个
- 我说第三个说得对
好了放几张数据图(数据分析我是用的在线工具)
-
因为我是学ui设计的所以我查的是ui,以及它需要的学历,和所在地区
稍微有点慢,如果用了多线程,多进程会快吧,我还不会呢。哈哈
好了,又到了说拜拜的时候了,砸门下次见,过几天见