scrapyredis使用

在执行机器上下载scrapy-redis
- pip install scrapy_redis -i https://pypi.douban.com/simple
修改setting.py（把下边的内容复制进自己的setting中，可以适当修改）


# Scrapy settings for example project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'

USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'



# 过滤操作:         资源路径
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"


# 调度器:    资源路径 (指定redis替换之后的调度器作用)
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 调度器持久化 = True 开启任务持久化
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

# 管道
ITEM_PIPELINES = {
    # 'example.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,  # 使用redis管道替换普通管道
}

# log信息等级
LOG_LEVEL = 'DEBUG'

# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
# 下载延迟
DOWNLOAD_DELAY = 1
# 设置连接的redis
REDIS_HOST = 'xxx.xxx.xxx.xxx'
REDIS_PORT = 6379
REDIS_PARAMS = {
    'password': 'password',
}

修改spider（导入RedisSpider并继承，如果原来是crawlspider就进行双继承，删除starturl，增加redis_key）

from scrapy_redis.spiders import RedisSpider

class CrawlbiqugeSpider(RedisSpider, CrawlSpider):
    name = 'crawlbiquge'
    redis_key = 'lalala' # 在redis中的起始url
    allowed_domains = ['biquge.lu']

运行执行机器上的爬虫任务
- scrapy crawl xxx
在redis上添加初始url
- lpush lalala https:xxx lalala和项目文件里的初始url对应即可

scrapyredis使用

推荐阅读更多精彩内容