scrapy-splash是一个配合scrapy使用的爬取动态js的第三方库(包)
安装
pip install scrapy-splash
使用
配合上一篇docker的安装食用更美味。
我就假设你看完了docker的安装使用文章
进入docker容器中,使用docker pull scrapinghub/splash
加载splash镜像
docker run -p 8050:8050 scrapinghub/splash
启动splash服务
配置splash服务(以下操作全部在settings.py):
1)添加splash服务器地址:
SPLASH_URL = 'http://localhost:8050'
2)将splash middleware添加到DOWNLOADER_MIDDLEWARE中:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
3)Enable SplashDeduplicateArgsMiddleware:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
4)Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
5)a custom cache storage backend:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
scrapy.spider中使用:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector, Request
from scrapy_splash import SplashRequest
class DmozSpider(scrapy.Spider):
name = "bcch"
allowed_domains = ["http://bcch.ahnw.gov.cn"]
start_urls = [
"http://bcch.ahnw.gov.cn/default.aspx",
]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
resp_sel = Selector(response)
resp_sel.xpath('/')
使用容易,但是对于没搞过docker的朋友来讲,还是麻烦一点点