- 在执行机器上下载scrapy-redis
- 修改setting.py(把下边的内容复制进自己的setting中,可以适当修改)
# Scrapy settings for example project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'
USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'
# 过滤操作: 资源路径
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 调度器: 资源路径 (指定redis替换之后的调度器作用)
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 调度器持久化 = True 开启任务持久化
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
# 管道
ITEM_PIPELINES = {
# 'example.pipelines.ExamplePipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400, # 使用redis管道替换普通管道
}
# log信息等级
LOG_LEVEL = 'DEBUG'
# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
# 下载延迟
DOWNLOAD_DELAY = 1
# 设置连接的redis
REDIS_HOST = 'xxx.xxx.xxx.xxx'
REDIS_PORT = 6379
REDIS_PARAMS = {
'password': 'password',
}
- 修改spider(导入RedisSpider并继承,如果原来是crawlspider就进行双继承,删除starturl,增加redis_key)
from scrapy_redis.spiders import RedisSpider
class CrawlbiqugeSpider(RedisSpider, CrawlSpider):
name = 'crawlbiquge'
redis_key = 'lalala' # 在redis中的起始url
allowed_domains = ['biquge.lu']
- 运行执行机器上的爬虫任务
- 在redis上添加初始url
- lpush lalala https:xxx lalala和项目文件里的初始url对应即可