scrapy入门

来源:天涯明月笙的慕课笔记

准备工作

  • 系统windows7

  • 安装MYSQL
    提示:

    • 安装的时候, 选安装选项server only
    • 根据提示, 遇到安装界面没有下一步可以用键盘操作
      键盘操作
      b-back。n-next。x-execute。f-finish。c-cancel
    • 根据界面完成安装, 进入安装目录下, mysqld -initialize命令初始化, 用'mysql -uroot -p'进入shell
    • net start mysql启动mysql服务, 如果服务名无效
      cmd打开到mysql/bin目录,输入 mysqld -install. 同时在控制面板进入 服务 选项, 启动mysql 服务. 多试试吧
  • 安装pycharm
    开启pycharm会员模式

伯乐在线爬取所有文章

安装模块

scrapy, pymysql, pillow, pypiwin32

  • pymysql是插入数据库的模块
  • 用scrapy自带的ImagesPipeline需要pillow模块
  • 创建爬虫后, windows输入命令scrapy crawl jobbole会报错需要pypiwin32

爬虫结构

  • items: 爬虫的解析信息的 字段
    包含名称, 设置输入输出处理器
  • pipelines: 爬虫的管道, 用于将解析后消息持久化存储
    包含图片存储, Json文件的存储, 数据库的存储
  • settings: 爬虫各种相关设置
    包含 是否遵循ROBOTS_TXT, 爬虫下载网页时延, 爬虫图片下载存储的目录, 日志文件的存储目录, 管道的启用和优先级
  • spiders: 爬虫主体
    爬虫的爬取主要逻辑

基本命令

# 创建爬虫项目
scrapy startproject jobbole_article 
# 进入spiders目录下, 生成爬虫
scrapy genspider jobbole blog.jobbole.com
# 运行爬虫
scrapy crawl jobbole

最终的文件目录, 上述命令后images文件夹暂时没有

伯乐在线爬虫目录.png

jobbole.py

# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
from jobbole_article.items import ArticleItemLoader, JobboleArticleItem
from scrapy.http import Request


class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/all-posts']

    @staticmethod
    def add_num(value):
        return value if value else [0]

    def parse_deatail(self, response):
        response_url = response.url
        front_image_url = response.meta.get('front_image_url', '')
        item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response)
        item_loader.add_xpath('title', "//div[@class='entry-header']/h1/text()")
        item_loader.add_value('url', response_url)
        item_loader.add_value('url_object_id', response_url)
        item_loader.add_value('front_image_url', front_image_url)
        item_loader.add_xpath('content', "//div[@class='entry']//text()")
        # span_loader = loader.nested_path('//span[@class='href-style'])
        # 赞
        item_loader.add_xpath('praise_nums', "//span[contains(@class,'vote-post-up')]/h10/text()", self.add_num)
        # 评论
        item_loader.add_xpath('comment_nums', "//span[contains(@class, 'hide-on-480')]/text()", self.add_num)
        # 收藏
        item_loader.add_xpath('fav_nums', "//span[contains(@class, 'bookmark-btn')]/text()", self.add_num)
        item_loader.add_xpath('tags', "//p[@class='entry-meta-hide-on-mobile']/a[not(@href='#article-comment')]/text()")
        return item_loader.load_item()

    def parse(self, response):
        post_nodes = response.xpath("//div[@class='post floated-thumb']")
        for post_node in post_nodes:
            post_url = post_node.xpath(".//a[@title]/@href").extract_first("")
            img_url = post_node.xpath(".//img/@src").extract_first("")
            yield Request(url=parse.urljoin(response.url, post_url), meta={'front_image_url': img_url},
                          callback=self.parse_deatail)
        next_url = response.xpath('//a[@class="next page-numbers"]/@href').extract_first('')
        if next_url:
            yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)

模块

  • from urllib import parse
    该模块主要用于对不完整的url进行补全
url = parse.urljoin('http://blog.jobbole.com/', '10000')
#url输出为拼接后的'http://blog.jobbole.com/10000', 如果后面是完整的, 则不拼接
  • from jobbole_article.items import ArticleItemLoader, JobboleArticleItem
    是items.py中的类
  • from scrapy.http impot Request
    • 构造scrapy网页请求, 请求需要跟进的url.
    • meta参数,为字典形式. 主要是在Request中传送额外的变量给response.可以通过response.meta.get()获取
    • callback参数则是请求内容下载完毕后调用相应的解析函数

比如在http://blog.jobbole.com/all-posts/中需要获取文章内容, 则构造对下面图片中箭头所指url的请求.内容下载完毕后调用parse_detail方法进行处理. 处理函数可以获得Request中键front_image_url的值img_url
对应代码

yield Request(url=parse.urljoin(response.url, post_url), meta={'front_image_url': img_url},
                          callback=self.parse_deatail)
Request.png

JobboleSpider类

  • 该类继承scrapy.Spider, 其他的属性需要查看文档
@staticmethod
def add_num(value):

可暂时忽略,
该类的静态方法, 用在以下代码中, 作为输入处理器.主要作用是在解析相关字段为空值时返回默认值

item_loader.add_xpath('comment_nums', "//span[contains(@class, 'hide-on-480')]/text()", self.add_num)
  • 自定义方法parse_detail
  1. 作用:解析文章详情页的,提取相关字段值的方法, 文章详情页如http://blog.jobbole.com/114420/. 返回填充后的item
  2. 一些变量的解释
    response_url是响应内容的连接, 比如http://blog.jobbole.com/114420/
    front_image_url是http://blog.jobbole.com/all-posts图片连接
    item_loader是具有填充item方法的实例, 常用方法add_xpath, add_value, 注意填充后的item的值比如item['title']是一个列表
    • add_xpath
      用xpath解析response的方法, 第一个参数如'title'是item的键或者说字段, 第二个是xpath解析规则, 第三个是处理器
    • add_value
      直接赋予相应的值
    • load_item
      执行填充item
  • JobboleSpider的自带parse方法
  1. 作用: 与parse_detail相同都是解析response, 不同的是parse是爬虫默认调用的解析方法.
  2. response.xpath
    xpath解析规则, 返回Selector对象,用extract()获取所有的文本值列表[], 或者是用extract_first()获取第一个文本值
  • xpath规则
  1. 一些规则
    • 可以像url那样拼接规则, 但是注意的是第二个规则加.
post_nodes = response.xpath("//div[@class='post floated-thumb']")
        for post_node in post_nodes:
            #  .//a[@title]/@href
            post_url = post_node.xpath(".//a[@title]/@href").extract_first("")
            img_url = post_node.xpath(".//img/@src").extract_first("")
- xpath中不含某个属性"//div[not(@class='xx')]"
- xpath中包含某个属性"//div[contains(@class, 'xx')]" 
- @herf表示提取属性href的值, text()表示提取元素里的文本值
- //表示元素任意层下的子元素, /表示元素的直接子元素
  1. 调试方法
    可以在浏览器中输入相应的路径测试, 但是要写css规则


    css规则浏览器.png

    用scrapy shell命令测试

scrapy shell http://blog.jobbole.com/all-posts
# 然后输入相应的规则可以看返回的值
response.xpath("...").extract()
# 可以用fetch(url)更改下载的Response
fetch('http://blog.jobbole.com/10000')

或者打断点,运行爬虫用pycharm可以查看

items.py

import scrapy
import re
import hashlib
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join, Identity, Compose


def get_md5(value):
    if isinstance(value, str):
        value = value.encode(encoding='utf-8')
        # print('value--------------------------', value)
        m = hashlib.md5()
        m.update(value)
        return m.hexdigest()


def get_num(value):
    # print(value)
    if value:
        num = re.match(r".*?(\d+?)", value)
        try:
            # print("----------------",num.group(1), int(num.group(1)))
            return int(num.group(1))
        except (AttributeError, TypeError):
            return 0
    else:
        return 0

#多余
def return_num(value):
    # return value[0] if value else 0
    if value:
        return value
    else:
        return "1"


class JobboleArticleItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field(
        input_processor=MapCompose(get_md5)
    )
    front_image_url = scrapy.Field(
        output_processor=Identity()
    )
    front_image_path = scrapy.Field()
    content = scrapy.Field(
        output_processor=Join()
    )
    praise_nums = scrapy.Field(
        input_processor=MapCompose(get_num),
        # output_processor=MapCompose(return_num)
    )
    fav_nums = scrapy.Field(
        input_processor=MapCompose(get_num),
        # output_processor=MapCompose(return_num)
        # input_processor=Compose(get_num, stop_on_none=False)
    )
    comment_nums = scrapy.Field(
        input_processor=MapCompose(get_num),
        # output_processor=MapCompose(return_num)
        # input_processor=Compose(get_num, stop_on_none=False)
    )
    tags = scrapy.Field(
        output_processor=Join()
    )

    def get_insert_sql(self):
        insert_sql = """
                insert into jobbole(title, url, url_object_id, front_image_url, front_image_path,praise_nums, fav_nums, 
                comment_nums, tags, content)
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                """
        params = (
            self['title'], self['url'], self['url_object_id'], self['front_image_url'], self['front_image_path'],
            self['praise_nums'], self['fav_nums'], self['comment_nums'], self['tags'], self['content']
        )
        return insert_sql, params


class ArticleItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

模块

  • `import re'
    正则匹配模块
# match函数是从字符串开始处匹配
num = re.match(".*(\d)", 'xx')
# 如果上面没有匹配成功, 会出现AttributeError
num = num.group(1)
# 另外int([])会出现TypeError
  • import hashlib
    将字符串转换为 md5字符串, 必须经过utf-8编码.scrapy的值都是unicode编码
  • from scrapy.loader import ItemLoader
    继承scrapy的ItemLoader, 自定义ArticleItemLoader.
  • from scrapy.loader.processors import TakeFirst, MapCompose, Join, Identity, Compose
    一系列scrapy给定的处理器函数类,TakeFirst是获取列表第一个非空值, MapCompose的参数是多个函数, 能将列表中的每个值通过函数处理,并将处理结果汇成列表再进入下一个函数. Join将列表连接成一个字符串, Identity不作处理, Compose的参数是多个函数, 与MapCompose不同, 是将整个列表传入函数处理

get_num函数

jobbole.py中add_xpath()中加上add_num, if判断就多余了
TypeError也有点多余, 懒得改了

def get_num(value):
  num = re.match(r".*?(\d+?)", value)
  try:
     return int(num.group(1))
   except AttributeError:
      return 0

JobboleArticleItem类

定义item的字段和输入输出处理器, 输入输出处理器的作用时候不同
这里有点疑问:文档说输入处理器作用在解析出一个值后立即作用, 而输出处理器则是在整个列表完成后作用.假如我把Compose写在输出处理器里, Compose不是处理整个列表的吗?有点矛盾
注意的是如果scrapy.Field()中有output_processor将会使default_output_processor失效
另外MapCompose()中的函数是不处理空值.如果是空列表, 那么函数将不生效.
在scrapy源码可以看到用了一个for循环调用函数处理列表中的值

            for v in values:
                next_values += arg_to_iter(func(v))
  • get_insert_sql方法
    写入mysql数据库的语句和参数, 会在pipelines.py中用到

ArticleItemLoader类

为item每个字段赋予一个默认的输出处理器

pipelines.py

import pymysql
from twisted.enterprise import adbapi
from scrapy.pipelines.images import ImagesPipeline


class JobboleArticlePipeline(object):
    def process_item(self, item, spider):
        return item


class JobboleMysqlPipeline(object):

    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        params = dict(
            host=settings['MYSQL_HOST'],
            db=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            passwd=settings['MYSQL_PASSWORD'],
            charset='utf8',
            cursorclass=pymysql.cursors.DictCursor,
            use_unicode=True
        )
        dbpool = adbapi.ConnectionPool('pymysql', **params)
        return cls(dbpool)

    def process_item(self, item, spider):
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider)

    def do_insert(self, cursor, item):
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

    def handle_error(self, failure, item, spider):
        print(failure)


class ArticleImagePipeline(ImagesPipeline):

    def item_completed(self, results, item, info):
        # 注意这里的判断, 可能front_image_url为空
        if 'front_image_url' in item:
            for _, value in results:
                # print(value)
                image_file_path = value['path']
            item['front_image_path'] = image_file_path
        return item

模块

  • import pymysql
    连接和写入数据库的模块
import pymysql
# 连接pymysql
db = pymysql.connect('localhost', 'root', '123456', 'jobbole')
# 使用cursor()方法获取游标
cursor = db.cursor()
# sql插入语句
insert_sql = "insert into jobbole (字段)values('值')"
# 执行插入
try:
  cursor.execute(insert_sql)
  # 确认提交...
  db.commit()
except:
  # 错误就回滚
  cursor.rollback()
# 关闭连接
db.close()
  • from twisted.enterprise import adbapi
    异步, 不清楚, 先背着吧
  • from scray.pipelines.images import ImagesPipeline
    scrapy的图片存储管道, 需要手动添加pillow模块

JobboleArticlePipeline类

自动生成的管道类

JobboleMysqlPipeline类, 自定义异步写入mysql

  • settings在settings.py中设置
  • 异步连接mysql?
dbpool = adbapi.ConnectionPool('pymysql', **params)
  • 生成实例...
# 执行 __init__(dbpool), 生成实例
return cls(dbpool)
  • process_item管道处理item的方法
# 异步执行插入操作?
# 不需要db.commit()
query = self. dbpool.runInteraction(self.do_insert, item) 
# 看不懂
# 不用返回item?
query.addErrback(self.handle_error, item, spider)
  • do_insert
    cursor参数在ConnectionPool中获得吗?

ArticleImagePipeline

  • item_completed
    参数results, item, info
    主要是记录front_image_path

settings.py

通用

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1

mysql设置

MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_DBNAME = 'jobbole'
MYSQL_PASSWORD = '123456'

管道的启用和优先级

数字越低优先级越高, 对应的是Pipelines.py中编写的管道

ITEM_PIPELINES = {
# 'jobbole_article.pipelines.JobboleArticlePipeline': 300,
    'jobbole_article.pipelines.ArticleImagePipeline': 1,
    'jobbole_article.pipelines.JobboleMysqlPipeline': 2,
}

图片存储目录

import os
# 指定图片下载url的item字段
IMAGES_URLS_FIELD = 'front_image_url'
# 图片存储的父目录, 也是settings.py的父目录, __file__是settings.py?
#abspath绝对路径, dirname父目录
image_dir = os.path.abspath(os.path.dirname(__file__))
# 图片存储的文件夹
IMAGES_STORE = os.path.join(image_dir, 'images')

mysql需要用到的命令

# 查看数据库
show databases;
# 查看表格
show tables;
# 创建数据库
create database jobbole;
# 切换数据库
use jobbole;
# 创建表格
create table(
title varchar(200) not null,
url varchar(300) not null,
url_object_id varchar(50) primary key not null,
front_image_url varchar(200),
praise_nums int(11) not null,
fav_nums int(11) not null,
tags varchar(200),
content longtext not null
)
# 查看数据库编码信息
show variables like 'character_set_database';
# 查看表格第一条记录
select * from jobbole limit 1;
# 查看表格记录的数量
select count(title) from jobbole;
# 查看表格的大小
use information_schema
select concat(round(sum(DATA_LENGTH/1024/1024),2),'MB') as data from TABL
ES where table_schema='jobbole' and table_name='jobbole';
# 清空数据表记录
truncate table jobbole;
# 删除一个字段
alter table <tablename> drop column <column_name>;

问题

  • 第一次只爬取了1300多条文章爬虫就终止了, 不清楚具体原因
  • 封面图片数量明显少, 数据库记录9000多条, 图片只有6000多张
  • 封面图片url为空会报错
'fav_nums': 2,
 'front_image_url': [''],
 'praise_nums': 2,
 'tags': '职场 产品经理 程序员 职场',
 'title': '程序员眼里的 PM 有两种:有脑子的和没脑子的。后者占 90%',
 'url': 'http://blog.jobbole.com/92328/',
 'url_object_id': 'f74aa62b6a79fcf8f294173ab52f4459'}
Traceback (most recent call last):
  File "g:\py3env\bole2\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\media.py", line 79, in process_item
    requests = arg_to_iter(self.get_media_requests(item, info))
  File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\images.py", line 155, in get_media_requests
    return [Request(x) for x in item.get(self.images_urls_field, [])]
  File "g:\py3env\bole2\venv\lib\site-packages\scrapy\pipelines\images.py", line 155, in <listcomp>
    return [Request(x) for x in item.get(self.images_urls_field, [])]
  File "g:\py3env\bole2\venv\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
    self._set_url(url)
  File "g:\py3env\bole2\venv\lib\site-packages\scrapy\http\request\__init__.py", line 62, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url:
  • 文章中如果有emoji表情, 会出现编码错误.
  • 爬取的时候没有设置输出的日志文件
  • 当add_xpath()中path路径提取为空列表时, 输出输入处理器MapCompose()不起作用.
    解决办法是在add_xpath参数额外加上处理器

总结

  • 在能理解的基础上看英文文档要比机翻的中文文档好
  • 不能理解可以看看源码
  • 日志输出到文件
  • 如果不能很好的理解每个部分, 那么需要在看完整体后回顾

selenium登录知乎,爬取问答

编码问题

python 中str 和bytes(二进制)的互相转化

因为scray中的response.body是bytes,所以写入文件要转成string

str = 'abc'
# errors 有strick,ignore
byt = str.encode(encoding='utf8', errors='strick')
# bytes->str
str = byt.decode(encodeing='utf8',errors='ignore')

bytes写入文件中要注意的编码

因为在windows中,新文件默认编码是gbk,所以python解释器会用gbk解析网络数据流。此时往往会失败。要在打开文件时指定编码。

with open('c:test.txt', 'w', encoding='utf8') as f:
  f.write(response.body.decode('utf8', errors='ignore'))

base64图片编码

from PIL import image
from io import BytesIO
import base64
img_src = "data:image/jpg;base64,R0lGODdh.."
img_src = img_src.split(',')[1]
img_src = base64.b64encode(img_src)
img = image.open(BytesIO(img_src))
img.show()

爬虫的小技巧

手动构造response

from scrapy.http import HtmlResponse
body = open("example.html").read()
response = HtmlResponse(url='http://example.com', body=body.encode('utf-8'))

爬虫的url的拼接和跟进

def parse(self, response):
    yield {}
    for url in response.xpath().extract():
    yield scrapy.Request(url=response.urljoin(url), callback=self.parse)
//进一步简化,不要for中extract()和response.urljoin
//如果要对提取的Url作处理,url.extract()?
    for url in response.xpath():
        yield response.follow(url, callback=self.parse)

爬虫的日志

scrapy 文档 日志
爬虫日志信息的级别和python的是一样,debug,info,warning,error,critical
Spider类自带日志属性

class ZhihuSpider(scrapy.Spider):
  def func(self)
      self.logger.warning('this is a log')

在Spider类外可以

import logging
logging.warning('this is a log')
# 也可以写不同的logger
logger = logging.getlogger('mycustomlogger')
logger.warning('this is a log')

另外在settings.py中可以设置命令行信息输出的级别和输出的日志文件

LOG_FILE = 'dir'
LOG_LEVEL = logging.WARNING
# 命令行
--logfile FILE
--loglevel LEVEL

re匹配不包含字符串

注意(?=)不占匹配位

s = 'sda'
re.match('s(?=d)$', s) # 匹配失败
# 不能匹配s后含da字符串
re.match('s(?!da)', s)

selenium的使用

selenium python api文档

下载浏览器驱动

chrome版本6.0,最新版本会有missing arguments granttype错误

selenium的方法

from selenium import webdriver
driver = webdriver.Chrome(execute_path="驱动所在目录")
# driver.page_source页面源
# selenium等待
from selenium.webdriver.support.ui import WebDriverwait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
# 10是超时时间,until参数是一个函数,这个函数的参数是driver,返回真假
element = WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
            "//div[@class='SignContainer-switch']/span"))
# 同上,ec是selenium自带的等待函数
 WebDriverWait(driver, 10).until(ec.text_to_be_present_in_element(
            (By.XPATH, "//div[@class='SignContainer-switch']/span"), '注册'))

整个爬虫代码

settings.py

# -*- coding: utf-8 -*-
import logging
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'zhihu'

SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
MYSQL_HOST = '127.0.0.1'
MYSQL_DBNAME = 'zhihuSpider'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'

LOG_LEVEL = logging.WARNING
LOG_FILE = 'G:\py3env\bole2\zhihu\zhihu\zhihu_spider.log'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

COOKIES_ENABLED = True

# Override the default request headers:
# 必须
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0",

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'zhihu.pipelines.ZhihuPipeline': 300,
}

zhihu_login.py

# -*- coding: utf-8 -*-
import scrapy
from zhihu.items import ZhihuQuestionItem, ZhihuAnswerItem, ZhihuItem
import re
import json
import datetime
from selenium import webdriver
# 使文本能解析
#from scrapy.selector import Selector
# 用法:Seletor(text=driver.pager_source).css().extract()
# 打开base64编码的图片
#import base64
#from io import BytesIO, StringIO
import logging
# selenium等待加载相关的模块
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec


class ZhihuLoginSpider(scrapy.Spider):

    name = 'zhihu_login'
    allowed_domains = ['www.zhihu.com']
    # start_requests 初始url
    start_urls = ['https://www.zhihu.com/signup?next=%2F']
    # 获取问题答案的api
    start_answer_url = ["https://www.zhihu.com/api/v4/questions/{0}/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset={2}&limit={1}&sort_by=default"]

    def start_requests(self):
        driver = webdriver.Chrome(executable_path='C:/Users/Administrator/Desktop/chromedriver.exe')
        # 打开网址
        driver.get(start_urls[0])
        # 等待登录元素出现,超时10秒
        element = WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
            "//div[@class='SignContainer-switch']/span"))
        # 点击登录
        element.click()
        # 等待点击后显示“注册”文本
        WebDriverWait(driver, 10).until(ec.text_to_be_present_in_element(
            (By.XPATH, "//div[@class='SignContainer-switch']/span"), '注册'))
        # 模拟输入账号和密码
        driver.find_element_by_css_selector("div.SignFlow-account input").send_keys("你的账号")
        driver.find_element_by_css_selector("div.SignFlow-password input").send_keys("你的宻码")
        driver.find_element_by_css_selector("button.SignFlow-submitButton").click()
        # 等待页面中某个元素加载完成
        WebDriverWait(driver, 10).until(lambda x:x.find_element_by_xpath(
            "//div[@class='GlobalWrite-navTitle']"))
        # 获取cookie
        Cookies = driver.get_cookies()
        cookie_dict = {}
        for cookie in Cookies:
            cookie_dict[cookie['name']] = cookie['value']
        # 关闭驱动 
        driver.close()
        return [scrapy.Request('https://www.zhihu.com/', cookies=cookie_dict, callback=self.parse)]

    def parse(self, response):
        # 获取页面中所有的链接
        all_urls = response.css("a::attr(href)").extract()
        for url in all_urls:
            # 不匹配https://www.zhihu.com/question/13413413/log
            match_obj = re.match('.*zhihu.com/question/(\d+)(/|$)(?!log)', url)
            if match_obj:
                yield scrapy.Request(response.urljoin(url), callback=self.parse_question)
            else:
                yield scrapy.Request(response.urljoin(url), callback=self.parse)

    def parse_question(self, response):
        if "QuestionHeader-title" in response.text:
            match_obj = re.match(".*zhihu.com/question/(\d+)(/|$)", response.url)
            self.logger.warning('Parse function called on {}'.format(response.url))
            if match_obj:
                self.logger.warning('zhihu id is {}'.format(match_obj.group(1)))
                question_id = int(match_obj.group(1))
                item_loader = ZhihuItem(item=ZhihuQuestionItem(), response=response)
                # ::text前不带空格表示直接子节点的文本
                item_loader.add_css("title", "h1.QuestionHeader-title::text")
                item_loader.add_css("content", ".QuestionHeader-detail ::text")
                item_loader.add_value("url", response.url)
                item_loader.add_value("zhihu_id", question_id)
                # 点击查看全部答案和不点击 ,answer_num两个网页提取的css规则不同。
                # 这里将两个css都写上
                item_loader.add_css("answer_num", "h4.List-headerText span ::text")
                item_loader.add_css("answer_num", "a.QuestionMainAction::text")
                item_loader.add_css("comments_num", "div.QuestionHeader-Comment button::text")
                item_loader.add_css("watch_user_num", "strong.NumberBoard-itemValue::text")
                item_loader.add_css("topics", ".QuestionHeader-topics ::text")
                item_loader.add_value("crawl_time", datetime.datetime.now())
                question_item = item_loader.load_item()
        """没用
        else:
            match_obj = re.match(".*zhihu.com/question/(\d+)(/|$)", response.url)
            if match_obj:
                question_id = int(match_obj.group(1))
                item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
                item_loader.add_css("title",
                                "//*[id='zh-question-title']/h2/a/text()|//*[@id='zh-question-title']/h2/span/text()")
                item_loader.add_css("content", ".QuestionHeader-detail")
                item_loader.add_value("url", response.url)
                item_loader.add_value("zhihu_id", question_id)
                item_loader.add_css("answer_num", "#zh-question-answer-num::text")
                item_loader.add_css("comment_num", "#zh-question-meta-wrap a[name='addcomment']::text")
                item_loader.add_css("watch_user_num", "//*[@id='zh-question-side-header-wrap']/text()|"
                                                  "//*[@class='zh-question-followers-sidebar]/div/a/strong/text()")
                item_loader.add_css("topics", ".zm-tag-editor-labels a::text")
                question_item = item_loader.load_item()
        """
        # format(*args, **kwargs)
        # print("{1}{程度}{0}".format("开心", "今天", 程度="很")
        # 今天很开心
        yield scrapy.Request(self.start_answer_url[0].format(question_id, 20, 0),
                             callback=self.parse_answer)
        yield question_item

    def parse_answer(self, response):
        # 网页返回的是json字符串,转为字典对象
        ans_json = json.loads(response.text)
        is_end = ans_json["paging"]['is_end']
        next_url = ans_json["paging"]["next"]
        for answer in ans_json["data"]:
            # 用item直接赋值简单,却不能用processor
            answer_item =ZhihuAnswerItem()
            answer_item["zhihu_id"] = answer["id"]
            answer_item["url"] = answer["url"]
            answer_item["question_id"] = answer["question"]["id"]
            answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None
            answer_item["content"] = answer["content"] if "content" in answer else None
            answer_item["parise_num"] = answer["voteup_count"]
            answer_item["comments_num"] = answer["comment_count"]
            answer_item["create_time"] = answer["created_time"]
            answer_item["update_time"] = answer["updated_time"]
            answer_item["crawl_time"] = datetime.datetime.now()
            yield answer_item
        if not is_end:
            yield scrapy.Request(next_url,  callback=self.parse_answer)

下面图片中就是查看网页中api


图片.png

图片.png

图片.png

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from twisted.enterprise import adbapi


class ZhihuPipeline(object):

    def __init__(self, dbpool):
        self.dbpool = dbpool
    
    def process_item(self, item, spider):
        query = self.dbpool.runInteraction(self.do_insert_sql, item)
        query.addErrback(self.handle_error, item, spider)

    def do_insert_sql(self, cursor, item):
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

    def handle_error(self, failure, item, spider):
        print(failure)

    @classmethod
    def from_settings(cls, settings):
        params = dict(
            host=settings['MYSQL_HOST'],
            db=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            passwd=settings['MYSQL_PASSWORD'],
            charset='utf8',
            cursorclass=pymysql.cursors.DictCursor,
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("pymysql", **params)
        return cls(dbpool)

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import logging
import datetime
import re
import scrapy
from scrapy.loader.processors import TakeFirst, Join, Compose, MapCompose
from scrapy.loader import ItemLoader


# 提取关注数量,回答数量,评论数量文本中的数字
def extract_num(value):
    # 输出日志信息
    logging.warning('this is function extract_num value:{}'.format(value))
    for val in value:
        if val is not None:
            # 去掉数字中的,
            val = ''.join(val.split(','))
            match_obj = re.match(".*?(\d+)", val)
            if match_obj:
                logging.warning('this is one of value:{}'.format(match_obj.group(1)))
                return int(match_obj.group(1))
            break


# 重写ItemLoader,指定默认输出处理器
class ZhihuItem(ItemLoader):
    # 取列表第一个元素
    default_output_processor = TakeFirst()

class ZhihuQuestionItem(scrapy.Item):

    topics = scrapy.Field(
            # 将主题连接
            output_processor=Join(',')
            )
    url = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    answer_num = scrapy.Field(
            # 提取数字
            output_processor=Compose(extract_num)
            )
    comments_num = scrapy.Field(
            output_processor=Compose(extract_num)
            )
    # 关注者数量
    watch_user_num = scrapy.Field(
            output_processor=Compose(extract_num)
            )
    zhihu_id = scrapy.Field()
    crawl_time = scrapy.Field()

    def get_insert_sql(self):
        # on duplicate key update col_name=value(col_name)
        insert_sql = """
            insert into zhihu_question(zhihu_id, topics, url, title, content, answer_num, comments_num,
            watch_user_num,  crawl_time
            )
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
            on duplicate key update content=values(content), answer_num=values(answer_num), comments_num=values(
            comments_num), watch_user_num=values(watch_user_num)
            """
        # [Failure instance: Traceback: <class 'AttributeError'>: Use item['crawl_time'] = '2018-10-29 19:16:24' to set field value
        # self.crawl_time = datetime.datetime.now()
        # 用get处理相应的键为空的情况
        # 用datetime.datetime.now()返回的值可以插入数据库
        params = (self.get('zhihu_id'), self.get('topics','null'), self.get('url'), self.get('title'), self.get('content','null'), self.get('answer_num',0), self.get('comments_num',0),
                  self.get('watch_user_num',0),  self.get('crawl_time'))
        return insert_sql, params


class ZhihuAnswerItem(scrapy.Item):

    zhihu_id = scrapy.Field()
    url = scrapy.Field()
    question_id = scrapy.Field()
    author_id = scrapy.Field()
    content = scrapy.Field()
    # 赞
    parise_num = scrapy.Field()
    comments_num = scrapy.Field()
    # 创建时间
    create_time = scrapy.Field()
    update_time = scrapy.Field()
    crawl_time = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
            insert into zhihu_answer(zhihu_id, url, question_id, author_id, content, parise_num, comments_num,
            create_time, update_time, crawl_time
            )
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
            on duplicate key update content=values(content), comments_num=values(comments_num), parise_num=values(
            parise_num), update_time=values(update_time)
            """
        # fromtimestamp方法的将时间戳转为时间元组
        params = (
            self.get("zhihu_id"), self.get("url"), self.get("question_id"), self.get("author_id"), self.get("content"), self.get("parise_num", 0),
            self.get("commennts_num", 0), datetime.datetime.fromtimestamp(self.get("create_time")), datetime.datetime.fromtimestamp(self.get("update_time")), self.get("crawl_time"),
        )
        return insert_sql, params

总结

  1. 有些问题会反复的遇到
  2. 程序一步一步写,记得加注释
  3. 链接先记下
  4. 知乎不用selenium的都失效了,看到不用selenium可以登录的请告知
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,544评论 6 501
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,430评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,764评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,193评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,216评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,182评论 1 299
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,063评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,917评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,329评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,543评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,722评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,425评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,019评论 3 326
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,671评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,825评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,729评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,614评论 2 353