Python爬虫-Scrapy框架之下载文件和图片

背景：Scrapy为下载Item中包含的文件（比如在爬取到产品时，同时也想保存对应的图片）提供了一个可重用的item pipelines，这些pipeline有些共同的方法和结构（我们称之为media pipeline），一般来说你会使用Files Pipeline或者Images Pipeline。

1、为什么要选择使用Scrapy内置的下载文件的方法：

1、避免重新下载最近已经下载过的文件；
2、可以方便的指定文件存储的路径；
3、可以将下载的图片转换成通用的格式，比如png或jpg；
4、可以方便的生成缩略图；
5、可以方便的检测图片的宽和高，确保他们满足最小限制；
6、异步下载，效率非常高

2、下载文件的`Files Pipeline`

&emps; 当使用Files Pipeline下载文件的时候，按照以下步骤来完成：
&emps; 1、定义好一个item，然后在这个item中定义两个属性，分别为file_urls以及files，file_urls是用来存储需要下载的图片的url链接，需要给一个列表；
&emps; 2、当文件下载完成后，会把文件下载的相关信息存储到item的files属性中，比如下载路径、下载的url和文件的校验码等；
&emps; 3、在配置文件settings.py中配置FILES_STORE，这个配置是用来设置文件下载下来的路径；
&emps; 4、启动pipeline：在ITEM_PIPELINES中设置'scrapy.pipelines.files.FilesPipeline':1。

3、下载图片的`Images Pipeline`

当使用Images Pipeline下载图片的时候，按照以下步骤来完成：
&emps; 1、定义好一个item，然后在这个item中定义两个属性，分别为image_urls以及images，image_urls是用来存储需要下载的图片的url链接，需要给一个列表；
&emps; 2、当文件下载完成后，会把文件下载的相关信息存储到item的images属性中，比如下载路径、下载的url和文件的校验码等；
&emps; 3、在配置文件settings.py中配置IMAGES_STORE，这个配置是用来设置文件下载下来的路径；
&emps; 4、启动pipeline：在ITEM_PIPELINES中设置'scrapy.pipelines.images.ImagesPipeline':1。

4、常规实现

A）settings.py文件配置：

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}

ITEM_PIPELINES = {
   'autohome.pipelines.AutohomePipeline': 300,
}

B）start.py文件如下：

from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())

C）bmw5.py文件如下：

# -*- coding: utf-8 -*-
import scrapy
from autohome.items import AutohomeItem


class Bmw5Spider(scrapy.Spider):
    name = 'bmw5'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html#pvareaid=3454438']

    def parse(self, response):
        uiboxes = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxes:
            boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
            urls = uibox.xpath(".//ul/li/a/img/@src").getall()
            urls = list(map(lambda url: response.urljoin(url), urls))
            item = AutohomeItem(boxTitle=boxTitle, urls=urls)
            yield item

D）items.py文件如下：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class AutohomeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    boxTitle = scrapy.Field()
    urls = scrapy.Field()

E）pipelines.py文件如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
from urllib import request

class AutohomePipeline:
    def __init__(self):
        self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")
        if not os.path.exists(self.path):
            os.mkdir(self.path)

    def process_item(self, item, spider):
        boxTitle = item['boxTitle']
        urls = item['urls']
        boxTitlePath = os.path.join(self.path, boxTitle)
        if not os.path.exists(boxTitlePath):
            os.mkdir(boxTitlePath)
        for url in urls:
            imageName = url.split("_")[-1]
            request.urlretrieve(url, os.path.join(boxTitlePath, imageName))
        return item

5、Scrapy实现

A）settings.py文件配置：

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}

ITEM_PIPELINES = {
    # 'autohome.pipelines.AutohomePipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

# 图片下载的路径，供images pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')

B）start.py文件如下：

from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())

C）bmw5.py文件如下：

# -*- coding: utf-8 -*-
import scrapy

from autohome.items import AutohomeItem


class Bmw5Spider(scrapy.Spider):
    name = 'bmw5'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html#pvareaid=3454438']

    def parse(self, response):
        uiboxes = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxes:
            boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
            urls = uibox.xpath(".//ul/li/a/img/@src").getall()
            urls = list(map(lambda url: response.urljoin(url), urls))
            item = AutohomeItem(boxTitle=boxTitle, image_urls=urls)
            yield item

D）items.py文件如下：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class AutohomeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    boxTitle = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

E）pipelines.py文件如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# import os
# from urllib import request


# class AutohomePipeline:
#     def __init__(self):
#         self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")
#         if not os.path.exists(self.path):
#             os.mkdir(self.path)
#
#     def process_item(self, item, spider):
#         boxTitle = item['boxTitle']
#         urls = item['imnage_urls']
#         boxTitlePath = os.path.join(self.path, boxTitle)
#         if not os.path.exists(boxTitlePath):
#             os.mkdir(boxTitlePath)
#         for url in urls:
#             imageName = url.split("_")[-1]
#             request.urlretrieve(url, os.path.join(boxTitlePath, imageName))
#         return item

6、Scrapy实现（优化改进）

A）settings.py文件配置：

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}

ITEM_PIPELINES = {
    # 'autohome.pipelines.AutohomePipeline': 300,
    # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'autohome.pipelines.BMWImagesPipeline': 1,
}

# 图片下载的路径，供images pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')

B）start.py文件如下：

from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())

C）bmw5.py文件如下：

# -*- coding: utf-8 -*-
import scrapy
from autohome.items import AutohomeItem


class Bmw5Spider(scrapy.Spider):
    name = 'bmw5'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html#pvareaid=3454438']

    def parse(self, response):
        uiboxes = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxes:
            boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
            urls = uibox.xpath(".//ul/li/a/img/@src").getall()
            urls = list(map(lambda url: response.urljoin(url), urls))
            item = AutohomeItem(boxTitle=boxTitle, image_urls=urls)
            yield item

D）items.py文件如下：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class AutohomeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    boxTitle = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

E）pipelines.py文件如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
# from urllib import request
from scrapy.pipelines.images import ImagesPipeline
from autohome import settings


# class AutohomePipeline:
#     def __init__(self):
#         self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")
#         if not os.path.exists(self.path):
#             os.mkdir(self.path)
#
#     def process_item(self, item, spider):
#         boxTitle = item['boxTitle']
#         urls = item['imnage_urls']
#         boxTitlePath = os.path.join(self.path, boxTitle)
#         if not os.path.exists(boxTitlePath):
#             os.mkdir(boxTitlePath)
#         for url in urls:
#             imageName = url.split("_")[-1]
#             request.urlretrieve(url, os.path.join(boxTitlePath, imageName))
#         return item


class BMWImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        # 这个方法是在发送下载请求之前调用，其实这个方法本身就是去发送下载请求的
        request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)
        for request_obj in request_objs:
            request_obj.item = item
        return request_objs

    def file_path(self, request, response=None, info=None):
        # 这个方法是在图片将要被存储的时候调用，来获取这个图片存储的路径
        path = super(BMWImagesPipeline, self).file_path(request, response, info)
        boxTitle = request.item.get('boxTitle')
        imagesStore = settings.IMAGES_STORE
        boxTitlePath = os.path.join(imagesStore, boxTitle)
        if not os.path.exists(boxTitlePath):
            os.mkdir(boxTitlePath)
        imageName = path.replace("full/", "")
        imagePath = os.path.join(boxTitlePath, imageName)
        return imagePath

7、Scrapy实现（下载高清图片）

A）settings.py文件配置：

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}

ITEM_PIPELINES = {
    # 'autohome.pipelines.AutohomePipeline': 300,
    # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'autohome.pipelines.BMWImagesPipeline': 1,
}

# 图片下载的路径，供images pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')

B）start.py文件如下：

from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())

C）bmw5.py文件如下：

# -*- coding: utf-8 -*-

# import scrapy
from autohome.items import AutohomeItem
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor


# class Bmw5Spider(scrapy.Spider):
#     name = 'bmw5'
#     allowed_domains = ['car.autohome.com.cn']
#     start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
#
#     def parse(self, response):
#         uiboxes = response.xpath("//div[@class='uibox']")[1:]
#         for uibox in uiboxes:
#             boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
#             urls = uibox.xpath(".//ul/li/a/img/@src").getall()
#             urls = list(map(lambda url: response.urljoin(url), urls))
#             item = AutohomeItem(boxTitle=boxTitle, image_urls=urls)
#             yield item


class Bmw5Spider(CrawlSpider):
    name = 'bmw5'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
    rules = (Rule
             (LinkExtractor(allow=r'https://car.autohome.com.cn/pic/series/65.+'),
              callback="parse",
              follow=True
              ),
             )

    def parse(self, response):
        uiboxes = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxes:
            boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
            imageSrcs = uibox.xpath(".//ul/li/a/img/@src").getall()
            imageSrcs = list(map(lambda imageSrc: imageSrc.replace("c42_", ""), imageSrcs))
            imageSrcs = list(map(lambda imageSrc: imageSrc.replace("240x180_0", "1024x0_1"), imageSrcs))
            imageSrcs = list(map(lambda imageSrc: response.urljoin(imageSrc), imageSrcs))
            item = AutohomeItem(boxTitle=boxTitle, image_urls=imageSrcs)
            yield item

D）items.py文件如下：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class AutohomeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    boxTitle = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

E）pipelines.py文件如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
# from urllib import request
from scrapy.pipelines.images import ImagesPipeline
from autohome import settings


# class AutohomePipeline:
#     def __init__(self):
#         self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")
#         if not os.path.exists(self.path):
#             os.mkdir(self.path)
#
#     def process_item(self, item, spider):
#         boxTitle = item['boxTitle']
#         urls = item['imnage_urls']
#         boxTitlePath = os.path.join(self.path, boxTitle)
#         if not os.path.exists(boxTitlePath):
#             os.mkdir(boxTitlePath)
#         for url in urls:
#             imageName = url.split("_")[-1]
#             request.urlretrieve(url, os.path.join(boxTitlePath, imageName))
#         return item


class BMWImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        # 这个方法是在发送下载请求之前调用，其实这个方法本身就是去发送下载请求的
        request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)
        for request_obj in request_objs:
            request_obj.item = item
        return request_objs

    def file_path(self, request, response=None, info=None):
        # 这个方法是在图片将要被存储的时候调用，来获取这个图片存储的路径
        path = super(BMWImagesPipeline, self).file_path(request, response, info)
        boxTitle = request.item.get('boxTitle')
        imagesStore = settings.IMAGES_STORE
        boxTitlePath = os.path.join(imagesStore, boxTitle)
        if not os.path.exists(boxTitlePath):
            os.mkdir(boxTitlePath)
        imageName = path.replace("full/", "")
        imagePath = os.path.join(boxTitlePath, imageName)
        return imagePath

Python爬虫-Scrapy框架之下载文件和图片

1、为什么要选择使用Scrapy内置的下载文件的方法：

2、下载文件的Files Pipeline

3、下载图片的Images Pipeline

4、常规实现

5、Scrapy实现

6、Scrapy实现（优化改进）

7、Scrapy实现（下载高清图片）

2、下载文件的`Files Pipeline`

3、下载图片的`Images Pipeline`