爬取360摄影美图
参考来源:《Python3网络爬虫开发实战》 第497页 作者:崔庆才
目的:使用Scrapy爬取360摄影美图,保存至MONGODB数据库并将图片下载至本地
目标网址:http://image.so.com/z?ch=photography
分析/知识点:
爬取难度:
a. 入门级,静态网页中不含图片信息,通过AJAX动态获取图片并渲染,返回结果为JSON格式;图片下载处理:使用内置的ImagesPipeline,进行少量方法改写;
MONGODB存储;
实际步骤:
- 创建Scrapy项目/images(spider)
Terminal: > scrapy startproject images360
Terminal: > scrapy genspider images image.so.com
- 配置settings.py文件
# MONGODB配置
MONGO_URI = 'localhost'
MONGO_DB = 'images360'
# 下载图片默认保存目录(ImagePipelin要用到)
IMAGES_STORE = './images'
# 嘿嘿嘿...
ROBOTSTXT_OBEY = False
# headers
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# 启用Pipeline(ImagePipeline优先级要最高)
ITEM_PIPELINES = {
'images360.pipelines.ImagePipeline': 300,
'images360.pipelines.MongoPipeline': 301,
}
- 编写items.py文件
from scrapy import Item, Field
# 图片信息全部获取
class MovieItem(Item):
cover_height = Field()
cover_imgurl = Field()
cover_width = Field()
dsptime = Field()
group_title = Field()
grpseq = Field()
id = Field()
imageid = Field()
index = Field()
label = Field()
qhimg_height = Field()
qhimg_thumb_url = Field()
qhimg_url = Field()
qhimg_width = Field()
tag = Field()
total_count = Field()
-
编写pipelines.py文件
a) ImagePipeline: 根据Scrapy官方文档修改:
Downloading and processing files and images:
# 图片下载Pipeline
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
'''
重写file_path方法,获取图片名
'''
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
'''
将下载失败的图片剔除,不保存至数据库
'''
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Downloaded Failed')
return item
def get_media_requests(self, item, info):
'''
重新请求图片url,调度器重新安排下载
'''
yield Request(url=item['qhimg_url'])
b) MongoPipeline: 根据Scrapy官方文档修改:https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=mongo 代码略
5. 编写spiders > images.py文件
注意:
a) 重写start_requests(self);
b) 动态获取请求url;动态Field赋值并生成对应的ImageItem
# 每张图片动态赋值并生产ImageItem
for image in images:
item = ImageItem()
for field in item.fields:
if field in image.keys():
item[field] = image.get(field)
yield item
c) 完整代码如下:
import json
from scrapy import Spider, Request
from images360.items import ImageItem
class ImagesSpider(Spider):
name = 'images'
# allowed_domains = ['image.so.com']
# start_urls = ['http://image.so.com/z?ch=photography']
url = 'http://image.so.com/zj?ch=photography&sn={sn}&listtype=new&temp=1'
# 重写
def start_requests(self):
# 循环生产请求前1200张照片(sn = [1-41])
for sn in range(1, 41):
yield Request(url=self.url.format(sn=sn * 30), callback=self.parse)
def parse(self, response):
results = json.loads(response.text)
# 判断list是否在results的keys中
if 'list' in results.keys():
images = results.get('list')
# 每张图片动态赋值并生产ImageItem
for image in images:
item = ImageItem()
for field in item.fields:
if field in image.keys():
item[field] = image.get(field)
yield item
6. 运行结果
小结
- 入门级项目,进一步熟悉Scrapy的使用流程;
- 熟悉网页AJAX返回结果的获取和解析;
- 初步了解ImagesPipeline的使用方法,以及学会如何根据需要进行改写。