常规pyppeteer中间件
常规的pyppeteer中间件,尽管pyppeteer是基于asyncio的异步框架,但因为通过同步的方式调用,无法发挥其异步框架的优势,会将scrapy阻塞,相当于总并发降至1,参考github项目(https://github.com/Python3WebSpider/ScrapyPyppeteer.git)
import websockets
from scrapy.http import HtmlResponse
from logging import getLogger
import asyncio
import pyppeteer
import logging
from concurrent.futures._base import TimeoutError
class PyppeteerMiddleware():
def render(self, url, **kwargs):
async def async_render(url, **kwargs):
try:
page = await self.browser.newPage()
response = await page.goto(url, options={'timeout': int(timeout * 1000)})
content = await page.content()
return content, response.status
except TimeoutError:
return None, 500
finally:
if not page.isClosed():
await page.close()
return content, status
def process_request(self, request, spider):
if request.meta.get('render') == 'pyppeteer':
try:
html, status = self.render(request.url)
return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
status=status)
except websockets.exceptions.ConnectionClosed:
pass
异步pyppeteer中间件
将pyppeteer中间件弄成异步需要进行两步操作
- 在process_request方法中,将pyppeteer请求函数协程异步调用,并用Deferred.fromFuture将twisted deffered 改成asyncio的future
from twisted.internet.defer import Deferred
from scrapy.http import HtmlResponse
def as_deferred(f):
"""Transform a Twisted Deffered to an Asyncio Future"""
return Deferred.fromFuture(asyncio.ensure_future(f))
class PuppeteerMiddleware:
async def _process_request(self, request, spider):
"""Handle the request using Puppeteer"""
page = await self.browser.newPage()
......
return HtmlResponse(
page.url,
status=response.status,
headers=response.headers,
body=body,
encoding='utf-8',
request=request
)
def process_request(self, request, spider):
"""Check if the Request should be handled by Puppeteer"""
if request.meta.get('render') == 'pyppeteer':
return as_deferred(self._process_request(request, spider))
return None
- 由于scrapy是基于twisted,而pyppeteer基于asyncio,需要解决reactor的互通问题。
Twisted有一个解决方案,可以在asyncio上运行twisted,那就是asyncioreactor,不过要确保在导入scrappy或执行任何其他操作之前做处理,可以在导入execute之前先解决reactor问题
import asyncio
from twisted.internet import asyncioreactor
asyncioreactor.install(asyncio.get_event_loop())
'''
导入scrapy之前,必须先加上以上三行,否则无法对接asyncio
'''
from scrapy.cmdline import execute
execute("scrapy crawl spider_name".split())
参考github项目(https://github.com/clemfromspace/scrapy-puppeteer.git)
这样就可以兼容scrapy的并发设置了。