爬虫实战（二）之 CrawlSpider 爬取新闻网

前面我们已经使用Scrapy实现过自动爬取网页功能的实现，其实，在 Scrapy 中，提供了一种自带的自动爬取网页的爬虫 CrawlSpider，我们可以使用 CrawlSpider 轻松实现网页的自动爬取，关于 CrawlSpider 的基础知识请参照官网：http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/spiders.html#crawlspider

下面我们直接上案例，目标：新闻网符合链接的新闻标题，发布时间、来源（http://news.sohu.com/）

首先在指定目录下创建一个爬虫项目：scrapy startprojects mycwpjt
接下来进入spisers目录中先查看下爬虫模板：scrapy genspider -l

12.png

然后根据 crawl 模板创建爬虫文件：scrapy genspider -t crawl crawlsohu sohu.com

项目建立好以及爬虫文件也创建了之后，我们根据需要爬取的信息信息效果图编写 Items.py 文件:

10.png

代码如下：

import scrapy

class MycwpjtItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #新闻网信息
    title = scrapy.Field()
    time = scrapy.Field()
    source = scrapy.Field()

编写爬虫文件之前我们在观察下需要请求的链接的规律：

9.png

从图中的红色边框可以看出我们需要爬取的规律：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from mycwpjt.items import MycwpjtItem


class CrawlsohuSpider(CrawlSpider):
    name = 'crawlsohu' #爬虫名
    # allowed_domains = ['lagou.com'] #指定爬取的网站域名
    start_urls = ['http://news.sohu.com/'] #爬取的起始网址

    # rules 自动爬行的规则 
    # LinkExtractor 为链接提取器，一般可以用来提取页面中满足条件的链接，以供下一次爬行使用
    # parse_item 编写爬虫的处理过程
    # follow 是否循环抓取
    rules = (
        Rule(LinkExtractor(allow=('http://www.sohu.com/a/230.*?')), callback='parse_item', follow=True),
    )


    def parse_item(self, response):
        
        item_news = MycwpjtItem()

        #根据 XPath 表达式提取新闻网页中的标题、发布时间、信息来源
        item_news['title'] = response.xpath('//div[@class="text-title"]/h1/text()').extract_first()
        item_news['time'] = response.xpath('//span[@id="news-time"]/text()').extract_first()
        item_news['source'] = response.xpath('//span[@data-role="original-link"]/a/text()').extract_first()

        return item_news

获取得我们想要爬取的数据之后，我们就可以对数据进行处理了，打开我们的管道文件 pipelines.py :

import json

class MycwpjtPipeline(object):

    def __init__(self):#初始化
        self.f = open("news.json", "w", encoding="utf-8")


    def process_item(self, item, spider):
        #先转换成字典然后保存为json文件
        content = json.dumps(dict(item), ensure_ascii = False) +",\n" #换行并且没有分隔符
        #输出测试
        print("输出-> %s"%content)
        #写入本地
        self.f.write(content)

        return item


    def close_spider(self, spider):#关闭
        self.f.close()

最后一步我们在配置文件中设置启用管道文件以及常用的设置：

#设置请求头  - 用户代理
DEFAULT_REQUEST_HEADERS = {
  'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}

#禁用 COOKIE
COOKIES_ENABLED = False

ITEM_PIPELINES = {
   'mycwpjt.pipelines.MycwpjtPipeline': 300,
}

#设置下载延时
DOWNLOAD_DELAY = 0.3 

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

搞到这里，项目就算完成了，其实和我们 basic 模板的不同之处最主要就是爬虫文件实现的机制不一样而已，其余的基本一致，好了，我们运行一下项目就可以看到效果了：

11.png

爬虫实战（二）之 CrawlSpider 爬取新闻网

爬虫实战（二）之 CrawlSpider 爬取新闻网

下面我们直接上案例，目标：新闻网符合链接的新闻标题，发布时间、来源（http://news.sohu.com/）

项目建立好以及爬虫文件也创建了之后，我们根据需要爬取的信息信息效果图编写 Items.py 文件:

代码如下：

编写爬虫文件之前我们在观察下需要请求的链接的规律：

从图中的红色边框可以看出我们需要爬取的规律：

获取得我们想要爬取的数据之后，我们就可以对数据进行处理了，打开我们的管道文件 pipelines.py :

最后一步我们在配置文件中设置启用管道文件以及常用的设置：

相信这个案例之后，同学们对 CrawlSpider 也算入门了，嘿嘿，接下来使用这个模板去爬取一下 “拉勾网的职位信息” 吧！！！祝你好运，^_^

相关阅读更多精彩内容

友情链接更多精彩内容