本次Scrapy爬虫的目标是爬取“融360”网站上所有银行理财产品的信息,并存入MongoDB中。网页的截图如下,全部数据共12多万条。
我们不再过多介绍Scrapy的创建和运行,只给出相关的代码。关于Scrapy的创建和运行,有兴趣的读者可以参考:Scrapy爬虫(4)爬取豆瓣电影Top250图片。
修改items.py,代码如下,用来储存每个理财产品的相关信息,如产品名称,发行银行等。
import scrapy
class BankItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
bank = scrapy.Field()
currency = scrapy.Field()
startDate = scrapy.Field()
endDate = scrapy.Field()
period = scrapy.Field()
proType = scrapy.Field()
profit = scrapy.Field()
amount = scrapy.Field()
创建爬虫文件bankSpider.py,代码如下,用来爬取网页中理财产品的具体信息。
import scrapy
from bank.items import BankItem
class bankSpider(scrapy.Spider):
name = 'bank'
start_urls = ['https://www.rong360.com/licai-bank/list/p1']
def parse(self, response):
item = BankItem()
trs = response.css('tr')[1:]
for tr in trs:
item['name'] = tr.xpath('td[1]/a/text()').extract_first()
item['bank'] = tr.xpath('td[2]/p/text()').extract_first()
item['currency'] = tr.xpath('td[3]/text()').extract_first()
item['startDate'] = tr.xpath('td[4]/text()').extract_first()
item['endDate'] = tr.xpath('td[5]/text()').extract_first()
item['period'] = tr.xpath('td[6]/text()').extract_first()
item['proType'] = tr.xpath('td[7]/text()').extract_first()
item['profit'] = tr.xpath('td[8]/text()').extract_first()
item['amount'] = tr.xpath('td[9]/text()').extract_first()
yield item
next_pages = response.css('a.next-page')
if len(next_pages) == 1:
next_page_link = next_pages.xpath('@href').extract_first()
else:
next_page_link = next_pages[1].xpath('@href').extract_first()
if next_page_link:
next_page = "https://www.rong360.com" + next_page_link
yield scrapy.Request(next_page, callback=self.parse)
为了将爬取的数据储存到MongoDB中,我们需要修改pipelines.py文件,代码如下:
# pipelines to insert the data into mongodb
import pymongo
from scrapy.conf import settings
class BankPipeline(object):
def __init__(self):
# connect database
self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])
# using name and password to login mongodb
# self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])
# handle of the database and collection of mongodb
self.db = self.client[settings['MONGO_DB']]
self.coll = self.db[settings['MONGO_COLL']]
def process_item(self, item, spider):
postItem = dict(item)
self.coll.insert(postItem)
return item
其中的MongoDB的相关参数,如MONGO_HOST, MONGO_PORT在settings.py中设置。修改settings.py如下:
- ROBOTSTXT_OBEY = False
- ITEM_PIPELINES = {'bank.pipelines.BankPipeline': 300}
- 添加MongoDB连接参数
MONGO_HOST = "localhost" # 主机IP
MONGO_PORT = 27017 # 端口号
MONGO_DB = "Spider" # 库名
MONGO_COLL = "bank" # collection名
# MONGO_USER = ""
# MONGO_PSW = ""
其中用户名和密码可以根据需要添加。
接下来,我们就可以运行爬虫了。运行结果如下:
共用时3小时,爬了12多万条数据,效率之高令人惊叹!
最后我们再来看一眼MongoDB中的数据:
Perfect!本次分享到此结束,欢迎大家交流~~