该文章仅供学习,如有错误,欢迎指出
1.开始创建一个项目
mkdir lagou
2.进入到文件夹下创建python3的虚拟环境
pipenv install scrapy
3.进入pipenv 下使用scrapy命令创建爬虫项目
pipenv shell
scrapy startproject lagou
cd lagou
scrapy genspider -o crawl test www.lagou.com
4.分析网页
进入职位 url地址为https://www.lagou.com/zhaopin/Java/?labelWords=label
我们可以多复制几个看一看
https://www.lagou.com/zhaopin/Java/?labelWords=label
https://www.lagou.com/zhaopin/chanpinjingli1/?labelWords=label
https://www.lagou.com/zhaopin/xinmeitiyunying/?labelWords=label
可以看出他都带有一个zhaoping/的关键词,因此我们可以在rule中定义一个正则去匹配该页面中所有符合rule规则的url地址
Rule(LinkExtractor(allow=r'zhaopin/'),follow=True),
进入招聘内容之后呈现的是这样的画面,我们需要的内容是下面的内容,我们选几个进入看一下
https://www.lagou.com/jobs/4597655.html
https://www.lagou.com/jobs/4182278.html
https://www.lagou.com/jobs/4692422.html
可以看出有jobs/关键字因此我们添加的rule规则是这样的
Rule(LinkExtractor(allow=r'jobs/\d+.*'), callback='parse_item', follow=True),
2.编写提取item的代码
item = zhaoping_xinxi_itemloader()
item['name'] = response.css('.company::text')[0].extract()
item['title'] =response.css('.job-name .name::text')[0].extract()
item['salary'] =response.css('.salary::text')[0].extract()
item['city'] = response.css('.job_request span::text')[1].extract().split('/')[1]
item['jingyan'] = response.css('.job_request span::text')[2].extract().split('/')[0]
item['xueli'] = response.css('.job_request span::text')[3].extract().split('/')[0]
item['lists'] = response.css('.position-label li::text').extract()#多
item['miaosu'] = response.css('.job_bt ')[0].extract()
同样的在Items.py下也要建立Item
3.运行爬虫
scrapy crawl test -o test.json
发现给我们返回的内容为
https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fpassport.lagou.com%2Flogin%2Flogin.html%3Fmsg%3Dvalidation%26uStatus%3D2%26clientIp%3D101.66.185.15&t=1528418878&_ti=1
是一个登陆页面,也就是说我们可以爬取的是第一个rule下的内容,当我们要去爬取第二个页面的内容的时候,lagou会设置一个登陆才能访问的方法,那么这里我们可以用到post请求模拟登陆
4.模拟登陆
在spider的源码中,我们可以看到
class Spider(object_ref):
"""Base class for scrapy spiders. All spiders must inherit from this
class.
"""
name = None
custom_settings = None
def __init__(self, name=None, **kwargs):
if name is not None:
self.name = name
elif not getattr(self, 'name', None):
raise ValueError("%s must have a name" % type(self).__name__)
self.__dict__.update(kwargs)
if not hasattr(self, 'start_urls'):
self.start_urls = []
有一个变量叫做custom_settings,他就是用来配置访问客户的设置,也就是模拟请求
设置模拟请求
custom_settings = {
"COOKIES_ENABLED": False,
"DOWNLOAD_DELAY": 1,
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=ABAAABAAAFCAAEGBC99154D1A744BD8AD12BA0DEE80F320; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; _ga=GA1.2.1111395267.1516570248; _gid=GA1.2.1409769975.1516570248; user_trace_token=20180122053048-58e2991f-fef2-11e7-b2dc-525400f775ce; PRE_UTM=; LGUID=20180122053048-58e29cd9-fef2-11e7-b2dc-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; X_HTTP_TOKEN=7e9c503b9a29e06e6d130f153c562827; _gat=1; LGSID=20180122055709-0762fae6-fef6-11e7-b2e0-525400f775ce; PRE_HOST=github.com; PRE_SITE=https%3A%2F%2Fgithub.com%2Fconghuaicai%2Fscrapy-spider-templetes; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2F4060662.html; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516569758,1516570249,1516570359,1516571830; _putrc=88264D20130653A0; login=true; unick=%E7%94%B0%E5%B2%A9; gate_login_token=3426bce7c3aa91eec701c73101f84e2c7ca7b33483e39ba5; LGRID=20180122060053-8c9fb52e-fef6-11e7-a59f-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516572053; TG-TRACK-CODE=index_navigation; SEARCH_ID=a39c9c98259643d085e917c740303cc7',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
}
}
重新爬取,爬取成功!
6.简化我们的爬虫,使用Itemloader
def parse_item(self, response):
item_loader = putongItem(zhaoping_xinxi_itemloader(),response)
item_loader.add_css('name','.company::text',)
item_loader.add_css('title','.job-name .name::text')
item_loader.add_css('salary','.salary::text')
item_loader.add_css('city','.job_request span:nth-child(1)::text')
item_loader.add_css('jingyan','.job_request span:nth-child(2)::text')
item_loader.add_css('xueli','.job_request span:nth-child(3)::text')
item_loader.add_css('lists','.position-label li::text')
item_loader.add_css('miaosu','.job_bt')
yield item_loader.load_item()
class zhaoping_xinxi_itemloader(scrapy.Item):
name =scrapy.Field(
input_processor = MapCompose(change.change_string)
)
title=scrapy.Field()
salary=scrapy.Field()
city=scrapy.Field()
jingyan=scrapy.Field()
xueli=scrapy.Field()
lists=scrapy.Field()
miaosu=scrapy.Field()
class putongItem(ItemLoader):
default_output_processor = TakeFirst()