用Scrapy框架做一个爬虫,将结果保存到MongoDB
本文用Scrapy框架实现一个爬虫的例子,Scrapy的安装和配置在这里不赘述,请自行百度相关文档,本例采用PyCharm作为集成开发工具,在Mac上开发测试通过。 本例子爬取的目标站点是:http://quotes.toscrape.com本文的目的是通过一个简单的例子,让读者迅速了解Srapy创建爬虫的过程,从而获得一个感性认识。
下载本工程
1 创建Scrapy项目
使用3个命令: 生成工程:
scrapy startproject mini_scrapy
进入工程目录
cd mini_scrapy
生成爬虫
scrapy genspider quotes quotes.toscrape.com
以下为运行命令的情况:
➜ PyProjects which scrapy
/Library/Frameworks/Python.framework/Versions/3.6/bin/scrapy
➜ PyProjects scrapy startproject mini_scrapy
New Scrapy project 'mini_scrapy', using template directory '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project', created in:
/Users/thomas/PyProjects/mini_scrapy
You can start your first spider with:
cd mini_scrapy
scrapy genspider example example.com
➜ PyProjects cd mini_scrapy
➜ mini_scrapy ll
total 8
drwxr-xr-x 9 thomas staff 288B 4 4 15:30 mini_scrapy
-rw-r--r-- 1 thomas staff 265B 4 4 15:30 scrapy.cfg
➜ mini_scrapy scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
mini_scrapy.spiders.quotes
用PyCharm打开项目,可以看到目录结构如下
为项目设置一个Python虚拟机,
在IDE的terminnal处运行爬虫,体会一下。
(venv) ➜ mini_scrapy scrapy crawl quotes
2019-04-04 15:44:34 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: mini_scrapy)
2019-04-04 15:44:34 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 (v3.6.3:2c5fed86e0, Oct 3 2017, 00:32:08) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-18.5.0-x86_64-i386-64bit
2019-04-04 15:44:34 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'mini_scrapy', 'NEWSPIDER_MODULE': 'mini_scrapy.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['mini_scrapy.spiders']}
2019-04-04 15:44:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2019-04-04 15:44:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-04 15:44:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-04 15:44:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-04 15:44:34 [scrapy.core.engine] INFO: Spider opened
2019-04-04 15:44:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-04 15:44:34 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-04 15:44:36 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-04 15:44:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2019-04-04 15:44:36 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-04 15:44:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 4, 7, 44, 36, 482556),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'memusage/max': 65634304,
'memusage/startup': 65634304,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 4, 7, 44, 34, 345877)}
2019-04-04 15:44:36 [scrapy.core.engine] INFO: Spider closed (finished)
从上面可以看出启动过程依次访问的路径:settings->middleware->pipelines->spider, pipelines和spider通过items关联起来。
2 解析获取到的网页
通过Chrome或Firefox浏览器的开发调试页面(开发人员都懂的),获取需要抓取内容的CSS选择器,本例我们抓取名言、作者、标签,代码如下
quotes = response.css('.quote')
for quote in quotes:
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
为了调试方便,有时也会使用在IDE环境的底部的terminal使用命令行运行,进行手工调试:
(venv) ➜ scrapy shell quotes.toscrape.com
3 定义并使用Item
爬虫和管道的对接点在 item,在items.py中,将需要抓取的内容作如下定义:
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
然后在爬虫文件quotes.py中引入item,并使用它:
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
至此完成了对当前页面的数据解析工作,由于本页面上还有下一页的按钮,也需要进行爬取和处理,所以进行第4步
4 爬取Next页面
首先判断页面是否有Next按钮,如果有,获取对应的url,并提交处理请求,对于页面反馈的处理,仍然使用parse函数进行处理,可以看出是一个典型的递归处理过程,代码如下:
next_page_url = response.css('.pager .next a::attr(href)').extract_first()
if next_page_url:
url = response.urljoin(next_page_url)
yield scrapy.Request(url=url, callback=self.parse)
至此,你已经完成所有页面的爬取和处理工作,可以在下一步看看爬取的结果了。
5 运行并保存结果
运行爬虫有几种方式:命令行方式,通过在python文件里调用,还有web调用方式,本例子先讲述前两种,这一步我们使用命令行方式,可以使用-o选项来保存爬虫爬取结果,可以支持输出 json, jl, csv, xml, marshal格式, 也可以存到远程文件服务器(如,ftp://user:pass@ftp.example.com/path/quotes.csv) ,以下命令保存到json文件中:
(venv) ➜ scrapy crawl quotes -o quotes.json
6 增加 main.py,以便于运行或调试爬虫
import sys
import os
from scrapy.cmdline import execute
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy", "crawl", "quotes"])
可以通过main.py运行一下爬虫
7 保存到MongoDB
到目前位置,爬虫的结果还未保存到数据库,下面演示如何保存到MongoDB,在Scrapy框架中,提交Item后,如果在setting文件里配置了管道pipeline, item就会交给pipeline处理。
1、首先在pipelines.py里,编写管道如下:
这里定义来两个管道,TextPipeline管道负责对text文本长度进行判断处理;MongoDBPipeline负责将数据存入数据库
from scrapy.exceptions import DropItem
import pymongo
class TextPipeline(object):
def __init__(self):
self.limit = 50
def process_item(self, item, spider):
if item['text']:
if len(item['text']) > self.limit:
item['text'] = item['text'][0:self.limit].rstrip() + '...'
return item
else:
return DropItem('Missing Text')
class MongoDBPipeline(object):
def __init__(self, host, port, db):
self.host = host
self.port = port
self.db = db
@classmethod
def from_crawler(cls, crawler):
return cls(host=crawler.settings.get("MONGODB_SERVER"), port=crawler.settings.get("MONGODB_PORT"),
db=crawler.settings.get("MONGODB_DB"))
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.host, self.port)
# set the db to be operated
self.db_conn = self.client[self.db]
def process_item(self, item, spider):
# name = item.__class__.__name__
name = item.getCollection()
# 这个name是Collection的名字,相当于表明,这就要求item名字要以表名来命名
count = self.db_conn[name].find(item.getId()).count()
if count == 0:
print('插入数据...')
self.db_conn[name].insert(dict(item))
else:
# print('重复数据,更新! praise_nums %d' % item.get_random_id())
# self.db_conn[name].update({"url_object_id": item.getId()},
# {"$set": {"praise_nums": item.get_random_id()}})
print('重复数据,忽略')
return item
def close_spider(self, spider):
self.client.close()
上面代码里面,需要从setting里读取数据库链接信息,所以需要在setting.py里进行配置,注意修改成你的MongoDB连接信息;同时开启管道配置,把上述管道配置到ITEM_PIPELINES下,注意,数字越低优先级越高,如下:
ITEM_PIPELINES = {
'mini_scrapy.pipelines.TextPipeline': 100,
'mini_scrapy.pipelines.MongoDBPipeline': 150
}
MONGODB_SERVER = "48.97.213.218"
MONGODB_PORT = 27017
MONGODB_DB = "tuan"
2、增加Item唯一性判断,在实际项目开发中往往需要反复运行爬虫,所以需要给每个爬取到的item项增加一个唯一性标记,以便重复运行爬虫时,爬取到同样内容不用再插入数据库,同时增加Item相关的方法;主要代码如下,这里还要说明一下,如果单就这一个例子来说没有必要写的那么复杂,但是再接下来的例子中,你会体会到这样写的好处。
# -*- coding: utf-8 -*-
import scrapy
from abc import ABCMeta, abstractmethod
class BaseItem(metaclass=ABCMeta):
field_list = [] # 字段名
@abstractmethod
def clean_data(self):
pass
@staticmethod
@abstractmethod
def help_fields(fields: list):
pass
class MongoDBItem(BaseItem):
@abstractmethod
def getCollection(self):
pass
@abstractmethod
def getId(self):
pass
class QuoteItem(scrapy.Item, MongoDBItem):
field_list = ['text', 'author', 'tags', 'url_object_id']
table_name = 'sample_saying'
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
url_object_id = scrapy.Field()
def getCollection(self):
return self.table_name
def getId(self):
return {'url_object_id': self["url_object_id"]}
def help_fields(self):
for field in self.field_list:
print(field, "= scrapy.Field()")
def clean_data(self):
pass
8 小结
这个例子可以让你快速了解使用Scrapy框架开发爬虫的步骤,通过例子,可以看到框架代码中各个文件的作用,以及调用逻辑,所以并没有花时间去讲述正则表达式,如需要代码请下载本工程