scrapy分页爬取

这里需要在浏览器里面先获取到下一页按钮的地址，这里我试用xpath，还是比较方便的。

打开开发者工具，然后选中下一页按钮，右键 Copy - Copy XPath

然后在可以在chrome中安装xpath插件来验证

这样可以取出点击下一页的js方法，然后在页面源码中找到js调用的方法，这里js直接传入参数提交form表单，scrapy中也需要提交form表单，具体代码如下：

def parse(self, response):

    baseUrl = 'http://zc.xatrm.com' selector = Selector(response)

    links = selector.xpath('//*[@id="listForm"]/div/div/div/div[1]/div/a/@href').extract()

    for link in links:

        url = baseUrl + link

        yield Request(url, callback=self.parse_news_detail)

    ## 是否有下一页

    next_pages = selector.xpath('//*[@id="listForm"]/nav/ul/li[@class="pagat pagt-next"]/a/@onclick').extract()

    if next_pages:

        page_url = "http://zc.xatrm.com/front_policy/list.action"

        begin = next_pages[0][next_pages[0].find("(") + 1 : next_pages[0].find(",")]

        length = next_pages[0][next_pages[0].find(",") + 2 : next_pages[0].find(")")]

        #POST提交请求参数

        form_data = {"begin":begin, "length":length}

        #FormRequest提交form表单

        yield FormRequest(page_url, formdata=form_data)

scrapy中其他方法不动，这样就实现了分页的爬取。

scrapy分页爬取

推荐阅读更多精彩内容