背景:Scrapy为下载Item中包含的文件(比如在爬取到产品时,同时也想保存对应的图片)提供了一个可重用的
item pipelines
,这些pipeline
有些共同的方法和结构(我们称之为media pipeline
),一般来说你会使用Files Pipeline
或者Images Pipeline
。
1、为什么要选择使用Scrapy内置的下载文件的方法:
- 1、避免重新下载最近已经下载过的文件;
- 2、可以方便的指定文件存储的路径;
- 3、可以将下载的图片转换成通用的格式,比如png或jpg;
- 4、可以方便的生成缩略图;
- 5、可以方便的检测图片的宽和高,确保他们满足最小限制;
- 6、异步下载,效率非常高
2、下载文件的Files Pipeline
&emps; 当使用
Files Pipeline
下载文件的时候,按照以下步骤来完成:
&emps; 1、定义好一个item
,然后在这个item
中定义两个属性,分别为file_urls
以及files
,file_urls
是用来存储需要下载的图片的url链接, 需要给一个列表;
&emps; 2、当文件下载完成后,会把文件下载的相关信息存储到item
的files
属性中,比如下载路径、下载的url和文件的校验码等;
&emps; 3、在配置文件settings.py
中配置FILES_STORE
,这个配置是用来设置文件下载下来的路径;
&emps; 4、启动pipeline
:在ITEM_PIPELINES
中设置'scrapy.pipelines.files.FilesPipeline':1
。
3、下载图片的Images Pipeline
当使用
Images Pipeline
下载图片的时候,按照以下步骤来完成:
&emps; 1、定义好一个item
,然后在这个item
中定义两个属性,分别为image_urls
以及images
,image_urls
是用来存储需要下载的图片的url链接, 需要给一个列表;
&emps; 2、当文件下载完成后,会把文件下载的相关信息存储到item
的images
属性中,比如下载路径、下载的url和文件的校验码等;
&emps; 3、在配置文件settings.py
中配置IMAGES_STORE
,这个配置是用来设置文件下载下来的路径;
&emps; 4、启动pipeline
:在ITEM_PIPELINES
中设置'scrapy.pipelines.images.ImagesPipeline':1
。
4、常规实现
A)settings.py文件配置:
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}
ITEM_PIPELINES = {
'autohome.pipelines.AutohomePipeline': 300,
}
B)start.py文件如下:
from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())
C)bmw5.py文件如下:
# -*- coding: utf-8 -*-
import scrapy
from autohome.items import AutohomeItem
class Bmw5Spider(scrapy.Spider):
name = 'bmw5'
allowed_domains = ['car.autohome.com.cn']
start_urls = ['https://car.autohome.com.cn/pic/series/65.html#pvareaid=3454438']
def parse(self, response):
uiboxes = response.xpath("//div[@class='uibox']")[1:]
for uibox in uiboxes:
boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
urls = uibox.xpath(".//ul/li/a/img/@src").getall()
urls = list(map(lambda url: response.urljoin(url), urls))
item = AutohomeItem(boxTitle=boxTitle, urls=urls)
yield item
D)items.py文件如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class AutohomeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
boxTitle = scrapy.Field()
urls = scrapy.Field()
E)pipelines.py文件如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
from urllib import request
class AutohomePipeline:
def __init__(self):
self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")
if not os.path.exists(self.path):
os.mkdir(self.path)
def process_item(self, item, spider):
boxTitle = item['boxTitle']
urls = item['urls']
boxTitlePath = os.path.join(self.path, boxTitle)
if not os.path.exists(boxTitlePath):
os.mkdir(boxTitlePath)
for url in urls:
imageName = url.split("_")[-1]
request.urlretrieve(url, os.path.join(boxTitlePath, imageName))
return item
5、Scrapy实现
A)settings.py文件配置:
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}
ITEM_PIPELINES = {
# 'autohome.pipelines.AutohomePipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1,
}
# 图片下载的路径,供images pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
B)start.py文件如下:
from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())
C)bmw5.py文件如下:
# -*- coding: utf-8 -*-
import scrapy
from autohome.items import AutohomeItem
class Bmw5Spider(scrapy.Spider):
name = 'bmw5'
allowed_domains = ['car.autohome.com.cn']
start_urls = ['https://car.autohome.com.cn/pic/series/65.html#pvareaid=3454438']
def parse(self, response):
uiboxes = response.xpath("//div[@class='uibox']")[1:]
for uibox in uiboxes:
boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
urls = uibox.xpath(".//ul/li/a/img/@src").getall()
urls = list(map(lambda url: response.urljoin(url), urls))
item = AutohomeItem(boxTitle=boxTitle, image_urls=urls)
yield item
D)items.py文件如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class AutohomeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
boxTitle = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
E)pipelines.py文件如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# import os
# from urllib import request
# class AutohomePipeline:
# def __init__(self):
# self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")
# if not os.path.exists(self.path):
# os.mkdir(self.path)
#
# def process_item(self, item, spider):
# boxTitle = item['boxTitle']
# urls = item['imnage_urls']
# boxTitlePath = os.path.join(self.path, boxTitle)
# if not os.path.exists(boxTitlePath):
# os.mkdir(boxTitlePath)
# for url in urls:
# imageName = url.split("_")[-1]
# request.urlretrieve(url, os.path.join(boxTitlePath, imageName))
# return item
6、Scrapy实现(优化改进)
A)settings.py文件配置:
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}
ITEM_PIPELINES = {
# 'autohome.pipelines.AutohomePipeline': 300,
# 'scrapy.pipelines.images.ImagesPipeline': 1,
'autohome.pipelines.BMWImagesPipeline': 1,
}
# 图片下载的路径,供images pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
B)start.py文件如下:
from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())
C)bmw5.py文件如下:
# -*- coding: utf-8 -*-
import scrapy
from autohome.items import AutohomeItem
class Bmw5Spider(scrapy.Spider):
name = 'bmw5'
allowed_domains = ['car.autohome.com.cn']
start_urls = ['https://car.autohome.com.cn/pic/series/65.html#pvareaid=3454438']
def parse(self, response):
uiboxes = response.xpath("//div[@class='uibox']")[1:]
for uibox in uiboxes:
boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
urls = uibox.xpath(".//ul/li/a/img/@src").getall()
urls = list(map(lambda url: response.urljoin(url), urls))
item = AutohomeItem(boxTitle=boxTitle, image_urls=urls)
yield item
D)items.py文件如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class AutohomeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
boxTitle = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
E)pipelines.py文件如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
# from urllib import request
from scrapy.pipelines.images import ImagesPipeline
from autohome import settings
# class AutohomePipeline:
# def __init__(self):
# self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")
# if not os.path.exists(self.path):
# os.mkdir(self.path)
#
# def process_item(self, item, spider):
# boxTitle = item['boxTitle']
# urls = item['imnage_urls']
# boxTitlePath = os.path.join(self.path, boxTitle)
# if not os.path.exists(boxTitlePath):
# os.mkdir(boxTitlePath)
# for url in urls:
# imageName = url.split("_")[-1]
# request.urlretrieve(url, os.path.join(boxTitlePath, imageName))
# return item
class BMWImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
# 这个方法是在发送下载请求之前调用,其实这个方法本身就是去发送下载请求的
request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)
for request_obj in request_objs:
request_obj.item = item
return request_objs
def file_path(self, request, response=None, info=None):
# 这个方法是在图片将要被存储的时候调用,来获取这个图片存储的路径
path = super(BMWImagesPipeline, self).file_path(request, response, info)
boxTitle = request.item.get('boxTitle')
imagesStore = settings.IMAGES_STORE
boxTitlePath = os.path.join(imagesStore, boxTitle)
if not os.path.exists(boxTitlePath):
os.mkdir(boxTitlePath)
imageName = path.replace("full/", "")
imagePath = os.path.join(boxTitlePath, imageName)
return imagePath
7、Scrapy实现(下载高清图片)
A)settings.py文件配置:
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}
ITEM_PIPELINES = {
# 'autohome.pipelines.AutohomePipeline': 300,
# 'scrapy.pipelines.images.ImagesPipeline': 1,
'autohome.pipelines.BMWImagesPipeline': 1,
}
# 图片下载的路径,供images pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
B)start.py文件如下:
from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())
C)bmw5.py文件如下:
# -*- coding: utf-8 -*-
# import scrapy
from autohome.items import AutohomeItem
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
# class Bmw5Spider(scrapy.Spider):
# name = 'bmw5'
# allowed_domains = ['car.autohome.com.cn']
# start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
#
# def parse(self, response):
# uiboxes = response.xpath("//div[@class='uibox']")[1:]
# for uibox in uiboxes:
# boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
# urls = uibox.xpath(".//ul/li/a/img/@src").getall()
# urls = list(map(lambda url: response.urljoin(url), urls))
# item = AutohomeItem(boxTitle=boxTitle, image_urls=urls)
# yield item
class Bmw5Spider(CrawlSpider):
name = 'bmw5'
allowed_domains = ['car.autohome.com.cn']
start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
rules = (Rule
(LinkExtractor(allow=r'https://car.autohome.com.cn/pic/series/65.+'),
callback="parse",
follow=True
),
)
def parse(self, response):
uiboxes = response.xpath("//div[@class='uibox']")[1:]
for uibox in uiboxes:
boxTitle = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
imageSrcs = uibox.xpath(".//ul/li/a/img/@src").getall()
imageSrcs = list(map(lambda imageSrc: imageSrc.replace("c42_", ""), imageSrcs))
imageSrcs = list(map(lambda imageSrc: imageSrc.replace("240x180_0", "1024x0_1"), imageSrcs))
imageSrcs = list(map(lambda imageSrc: response.urljoin(imageSrc), imageSrcs))
item = AutohomeItem(boxTitle=boxTitle, image_urls=imageSrcs)
yield item
D)items.py文件如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class AutohomeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
boxTitle = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
E)pipelines.py文件如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
# from urllib import request
from scrapy.pipelines.images import ImagesPipeline
from autohome import settings
# class AutohomePipeline:
# def __init__(self):
# self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")
# if not os.path.exists(self.path):
# os.mkdir(self.path)
#
# def process_item(self, item, spider):
# boxTitle = item['boxTitle']
# urls = item['imnage_urls']
# boxTitlePath = os.path.join(self.path, boxTitle)
# if not os.path.exists(boxTitlePath):
# os.mkdir(boxTitlePath)
# for url in urls:
# imageName = url.split("_")[-1]
# request.urlretrieve(url, os.path.join(boxTitlePath, imageName))
# return item
class BMWImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
# 这个方法是在发送下载请求之前调用,其实这个方法本身就是去发送下载请求的
request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)
for request_obj in request_objs:
request_obj.item = item
return request_objs
def file_path(self, request, response=None, info=None):
# 这个方法是在图片将要被存储的时候调用,来获取这个图片存储的路径
path = super(BMWImagesPipeline, self).file_path(request, response, info)
boxTitle = request.item.get('boxTitle')
imagesStore = settings.IMAGES_STORE
boxTitlePath = os.path.join(imagesStore, boxTitle)
if not os.path.exists(boxTitlePath):
os.mkdir(boxTitlePath)
imageName = path.replace("full/", "")
imagePath = os.path.join(boxTitlePath, imageName)
return imagePath