python爬虫

for each in response.json['顶层名称']【中间根据json层数决定】[‘数据层名称’]

例如：json格式

{"code":1,

"msg":"操作成功",

"data":

    {"pageNo":1,

    "hasNext":true,

    "list":    [{"docid":"DRQQ35F90511ELD5","boardid":"dy_wemedia_bbs","postid":null,"topicid":null,"recommendtids":null,"userid":null,"nickname":null,"userinfo":null,"title":"海湾被鲜血染成血红色：100多只海豚和鲸鱼惨遭法罗群岛渔民斩杀",}]

                }

}

代码：


for each in response.json['data']['list']

pyspider传参数
我这边没有利用save传参数

def on_start(self):
    self.crawl('http://www.example.org/',
    callback=self.callback, save={'a': 123})

def callback(self, response):
    return response.save['a']

直接利用上一步爬取的参数，然后回调参数获取

    def index_page(self, response):
        for each in response.json['data']['list']:
            docid=each['docid']
            title=each['title']
            imgsrc=each['imgsrc']
            self.crawl('http://www.***.com/***',callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        imgsrc=response.save['imgsrc']
        content=response.doc('#content').html()
        return {
            "content":content,
            "title": response.doc('h2').text(),
            "imgsrc":imgsrc
        }

这样就可以利用上一步的参数了

最后编辑于：2018.09.16 15:42:06

python爬虫

推荐阅读更多精彩内容