实验内容
使用scrapy爬取格言网的内容
实验环境
操作系统:win7 32位操作系统
python版本:python3.6.5
实验步骤
1. 观察网页结构,制定爬取逻辑
使用Chrome浏览器打开格言网,使用开发者工具观察网页结构,获取相应元素的xpath路径,并观察作者信息页面的url特点。
通过观察网页,制定爬取逻辑如下。
对格言的爬取逻辑:
- 爬取每一个列表项;
- 提取列表项中的格言、作者、标签信息;
- 提取下一页的url链接;
- 跳转访问下一页。
对作者信息的爬取逻辑:
- 利用CrawlSpider,通过对url的特点加以限定,实行全站爬取作者信息;
- 爬取作者的姓名、出生日期、出生地和简介。
2. 主要代码
在命令行中键入以下命令新建scrapy项目
scrapy startproject scarpy_quote
最终此次项目的目录结构如下:
items.py的详细代码:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapyQuoteItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
quotes = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
bornDate = scrapy.Field()
bornLocation = scrapy.Field()
description = scrapy.Field()
quote.py的详细代码:
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy_quote.items import ScrapyQuoteItem
class quoteSpider(scrapy.Spider):
name = 'quote'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
re_s = re.compile('\s')
for line in response.xpath('//div[@class="quote"]'):
item = ScrapyQuoteItem()
item['quotes'] = re_s.sub('', line.xpath('./span[@class="text"]/text()').extract_first())
item['author'] = re_s.sub(' ', line.xpath('./span[2]/small[@class="author"]/text()').extract_first())
item['tags'] = line.xpath('./div[@class="tags"]/a[@class="tag"]/text()').extract()
yield item
next_page = response.xpath('//nav/ul[@class="pager"]/li[@class="next"]/a/@href').extract_first()
if next_page:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
author.py的详细代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy_quote.items import ScrapyQuoteItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class quoteSpider(CrawlSpider):
name = 'author'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com']
rules = (
Rule(LinkExtractor(allow=(r'http://quotes.toscrape.com/page/\d/'))),
Rule(
LinkExtractor(allow=(r'http://quotes.toscrape.com/author/.*?')),
callback='parse_item'
),
)
# 第一个rule是限定可以访问列表页,第二个rule是限定,当匹配
# 作者信息页时,调用parse_item对信息页进行解析
def parse_item(self, response):
item = ScrapyQuoteItem()
item['author'] = response.xpath('//h3[@class="author-title"]/text()').extract_first()
item['bornDate'] = response.xpath('//span[@class="author-born-date"]/text()').extract_first()
item['bornLocation'] = response.xpath('//span[@class="author-born-location"]/text()').extract_first()
item['description'] = response.xpath('//div[@class="author-description"]/text()').extract_first()
yield item
为了不直接在json文件中输出unicode编码,在settings.py中加入:
FEED_EXPORT_ENCODING = 'utf-8'
实验结果
1. 爬取作者信息
如下图,最终一共爬取到44条作者信息
2. 爬取格言信息
如下图,最终一共爬取到100条格言信息