使用Scrapy爬取知乎的问题以及回答

知乎是一个真实的网络问答社区，社区氛围友好与理性，连接各行各业的精英。用户分享着彼此的专业知识、经验和见解，为中文互联网源源不断地提供高质量的信息。
准确地讲，知乎更像一个论坛：用户围绕着某一感兴趣的话题进行相关的讨论，同时可以关注兴趣一致的人。对于概念性的解释，网络百科几乎涵盖了你所有的疑问；但是对于发散思维的整合，却是知乎的一大特色。

为了膜拜“高学历、高收入、高消费”的大佬们学习，本鶸尝试用Scrapy模拟登录并爬取知乎上的问题以及其回答。

模拟登录

在使用Scrapy模拟登录之前，有过使用requests模拟登录的经历，其中用session和cookies帮我节约了不少时间。
在使用到Scrapy模拟登录时，需要使用到Scrapy自己的Request
在模拟登录的过程中，首先需要修改Scrapy默认的User-Agent，并且向登录的URL POST所需要的数据。通过查看页面和chrome开发者工具中的network，可以得到我们需要POST的URL以及数据。

    headers={
    "HOST":"www.zhihu.com",
    "Referer":"https://www.zhihu.com",
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0"
    }

Scrapy默认的User-Agent是无法爬取知乎这类有一定反爬虫的网站的，所以我们需要添加自己的headers

既然要模拟登录，需要向登录页面POST的数据肯定是不能少的。

import re
account = input("请输入账号\n--->")
password = input("请输入密码\n--->")
_xsrf = response.xpath('/html/body/div[1]/div/div[2]/div[2]/form/input/@value').extract_first()
if re.match("^1\d{10}", account):
    print("手机号码登录")
    post_url = "https://www.zhihu.com/login/phone_num"
    post_data = {
        "_xsrf": _xsrf,
        "phone_num": account,
        "password": password,
        "captcha":""
        }
else:
    if "@" in account:
        # 判断用户名是否为邮箱
        print("邮箱方式登录")
        post_url = "https://www.zhihu.com/login/email"
        post_data = {
        "_xsrf": _xsrf,
        "email": account,
        "password": password,
        "captcha":""
        }

通过正则表达式判断你输入的账号是手机号还是email。知乎对账号登录POST的地址会根据手机或email会有不同。

_xsrf是藏在登录页面中的一组随机密钥，可以使用正则或者Scrapy自己的XPath或者CSS选择器从页面提取出来
captcha就是验证码了。在登录时知乎会要求输入验证码。
具体模拟登录源码如下：

import scrapy
import re
from PIL import Image
import json
from urllib import parse
class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains=["www.zhihu.com"]
    start_urls = ['https://www.zhihu.com/explore']
    headers={
    "HOST":"www.zhihu.com",
    "Referer":"https://www.zhihu.com",
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0",
    }

    def parse(self, response):
          pass

    def start_requests(self):
        #因为要登录后才能查看知乎，所以要重写入口

        return [scrapy.Request("https://www.zhihu.com/#signin",headers=self.headers,callback=self.login)]

    def login(self,response):

        _xsrf = response.xpath('/html/body/div[1]/div/div[2]/div[2]/form/input/@value').extract_first()
        account = input("请输入账号\n--->")
        password = input("请输入密码\n--->")
        if re.match("^1\d{10}", account):
            print("手机号码登录")
            post_url = "https://www.zhihu.com/login/phone_num"
            post_data = {
                "_xsrf": _xsrf,
                "phone_num": account,
                "password": password,
                "captcha":""
            }
        else:
            if "@" in account:
                # 判断用户名是否为邮箱
                print("邮箱方式登录")
                post_url = "https://www.zhihu.com/login/email"
                post_data = {
                    "_xsrf": _xsrf,
                    "email": account,
                    "password": password,
                    "captcha":""
                }

        return [scrapy.FormRequest(
                url=post_url,
                formdata=post_data,
                headers=self.headers,
                meta={"post_data": post_data,
                      "post_url": post_url,
                      },
                callback=self.check_login
            )]
    def login_after_captcha(self,response):
        #获取验证码
        print(response.headers)
        post_data = response.meta.get("post_data","")
        post_url = response.meta.get("post_url","")
        with open('captcha.gif', 'wb') as f:
            f.write(response.body)
        try:
            im = Image.open("captcha.gif")
            im.show()
            captcha = input("please input the captcha:")
            post_data["captcha"] = captcha
        except:
            print("未打开验证码文件")
        return [scrapy.FormRequest(
            url=post_url,
            formdata=post_data,
            headers=self.headers,
            callback=self.check_login,
        )]
    def check_login(self,response):
        response_text = json.loads(response.body)
        if response_text["r"] == 0:
            headers = response.headers
            cookie = dict(headers)[b'Set-Cookie']
            cookie = [str(c, encoding="utf-8") for c in cookie]
            cookies = ";".join(cookie)
            #登录成功后才开始使用start_urls
            for url in self.start_urls:
                yield scrapy.Request(url,headers=self.headers,dont_filter=True)
        else:
            captcha_url = "https://www.zhihu.com/captcha.gif?&type=login"
            #因为scrapy是一个异步框架，所以为了保证验证码在同一个session下，就将这个request yield出去
            yield scrapy.Request(url=captcha_url,
                                     headers=self.headers,
                                     meta={"post_data":response.meta.get("post_data"),
                                           "post_url":response.meta.get("post_url"),
                                           },
                                     callback=self.login_after_captcha)

登录后，整个知乎就在你眼前了。

数据的爬取

如何遍历一个网站的所有我们需要的网页？这是一个很麻烦的问题，一般会选择深度优先遍历(DFS)或者广度优先遍历(BFS)。我试着利用Scrapy的异步机制，用DFS一直跟踪、下载我所能接触到的URL，这样总会将所有我需要的URL遍历一次。

    def parse(self, response):
        """
        提取出check_login中yield中的URL即为我提取知乎URL的一个入口
        将其中所有的URL中类似/question/xxxx的URL提取出来，然后下载后放入解析函数
        :param response:
        :return:
        """
        all_urls = response.css("a::attr(href)").extract()
        all_urls = [parse.urljoin(response.url, url) for url in all_urls]
        all_urls = filter(lambda x:True if x.startswith("https") else False,all_urls)
        for url in all_urls:
            print(url)
            match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*",url)
            #如果提取到question的URL则进行下载
            if match_obj:
                request_url = match_obj.group(1)
                question_id = match_obj.group(2)
                yield scrapy.Request(request_url,
                                     headers=self.headers,
                                     meta={"question_id":question_id},
                                     callback=self.parse_question)
            # 如果提取到的不是question的URL，则进行跟踪
            else:               
                yield scrapy.Request(url,headers=self.headers,callback=self.parse)

这样的找寻URL的逻辑在question页面也可以使用。将找到的形如/question/...的URL交给专门处理question页面的函数进行处理。

from ..items import ZhihuAnswerItem
    def parse_question(self,response):
        """
        处理question页面，从页面中取出我们需要的item
        :param response:
        :return:
        """
        question_id = response.meta.get("question_id")
        if "QuestionHeader-title" in response.text:
            #知乎的新版本
            item_loader = ItemLoader(item=ZhihuQuestionItem(),response=response)
            item_loader.add_css("title",".QuestionHeader-main .QuestionHeader-title::text")
            item_loader.add_css("topics",".TopicLink .Popover div::text")
            item_loader.add_css("content",".QuestionHeader-detail")
            item_loader.add_value("url",response.url)
            item_loader.add_value("zhihu_id",int(response.meta.get("question_id","")))
            item_loader.add_css("answer_num",".List-headerText span::text")
            item_loader.add_css("watch_user_num",'.NumberBoard-value::text')
            item_loader.add_css("click_num",'.NumberBoard-value::text')
            item_loader.add_css("comments_num",'.QuestionHeader-Comment button::text')

            QuestionItem = item_loader.load_item()
            #请求该问题的回答，这个URL会在后面给出。
            yield scrapy.Request(self.start_answer_urls.format(question_id,20,0),headers=self.headers,callback=self.parse_answer)
            yield QuestionItem
            #在question页面中找question的URL.可有可无，主要是上面提取数据的逻辑
            all_urls = response.css("a::attr(href)").extract()
            all_urls = [parse.urljoin(response.url, url) for url in all_urls]
            all_urls = filter(lambda x: True if x.startswith("https") else False, all_urls)
            for url in all_urls:
                print(url)
                match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", url)
                # 如果提取到question的URL则进行下载
                if match_obj:
                    request_url = match_obj.group(1)
                    question_id = match_obj.group(2)
                    yield scrapy.Request(request_url,
                                         headers=self.headers,
                                         meta={"question_id": question_id},
                                         callback=self.parse_question)
                # 如果提取到的不是question的URL，则进行跟踪
                else:
                    # pass
                    yield scrapy.Request(url, headers=self.headers, callback=self.parse)

        else:
            #知乎的老版本

            pass

知乎为我们开放了获取回答的一个公共信息的API。

点击之后，给我们展示的是一个json

里面会给我们很多有用的信息，比如paging里面的

is_end是判断该页的回答是否是该问题最后的回答
totals是显示该问题所有的回答
next是爬取知乎回答最重要的一个数据。它算是我们爬取知乎问题的一个入口，它有三个重要的数据，question/xxxxxx/....表明我们可以通过question_id来找到该问题的回答；limit即为每页回答的数量；offset是偏移量，表示页面回答在所有回答中偏移位置。

后面的数据中可以看到许多我们需要的数据。(我随便开的一个json，不小心截图到谁了请找我。)

class ZhihuSpider(scrapy.Spider):
        ....
    #answer第一页的请求URL
    start_answer_urls = "http://www.zhihu.com/api/v4/questions/{0}/answers?" \
                        "sort_by=default&include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2" \
                        "Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2" \
                        "Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2" \
                        "Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2" \
                        "Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2" \
                        "Cupvoted_followees%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%" \
                        "5B%3F%28type%3Dbest_answerer%29%5D.topics&limit={1}&offset={2}"
    def parse_answer(self,response):
        answer_json = json.loads(response.text)
        is_end = answer_json["paging"]["is_end"]
        total_anwsers = answer_json["paging"]["totals"]
        next_url = answer_json["paging"]["next"]
        AnswerItem = ZhihuAnswerItem()
        #提取answer的结构
        for answer in answer_json.get("data"):
            AnswerItem["zhihu_id"] = answer["id"]
            AnswerItem["url"] = answer["url"]
            AnswerItem["question_id"] = answer["question"]["id"]
            AnswerItem["author_id"] = answer["author"]["id"] if "id" in answer["author"] and answer["author"]["id"] is not "0" else None
            AnswerItem["author_name"] = answer["author"]["name"] if "id" in answer["author"] and  answer["author"]["id"] is not "0" else "匿名用户"
            AnswerItem["content"] = answer["content"] if "content" in answer else None
            AnswerItem["praise_num"] = answer["voteup_count"]
            AnswerItem["comments_num"] = answer["comment_count"]
            AnswerItem["update_time"] = answer["updated_time"]
            AnswerItem["create_time"] = answer["created_time"]
            AnswerItem["crawl_time"] = datetime.datetime.now()
            yield AnswerItem
        if not is_end:
            yield scrapy.Request(next_url,headers=self.headers,callback=self.parse_answer)

这样一个简单的`Scrapy`爬取知乎问题以及回答的爬虫就写好了。理论上可以爬取所有的页面，具体的尝试需要等到我把`pipeline`和数据的处理存储弄好后找台服务器试一下。

最后编辑于：2017.12.10 03:46:13

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 220,458评论 6赞 513
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 94,030评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 166,879评论 0赞 358
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,278评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,296评论 6赞 397
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 52,019评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,633评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,541评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 46,068评论 1赞 319
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,181评论 3赞 340
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,318评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,991评论 5赞 347
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,670评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,183评论 0赞 23
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,302评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,655评论 3赞 375
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,327评论 2赞 358

使用Scrapy爬取知乎的问题以及回答

模拟登录

登录后，整个知乎就在你眼前了。

数据的爬取

这样一个简单的Scrapy爬取知乎问题以及回答的爬虫就写好了。理论上可以爬取所有的页面，具体的尝试需要等到我把pipeline和数据的处理存储弄好后找台服务器试一下。

推荐阅读更多精彩内容

这样一个简单的`Scrapy`爬取知乎问题以及回答的爬虫就写好了。理论上可以爬取所有的页面，具体的尝试需要等到我把`pipeline`和数据的处理存储弄好后找台服务器试一下。