python_day9爬虫之Scrapy框架

Scrapy框架

1.官方文档


官网链接:https://docs.scrapy.org/en/latest/topics/architecture.html

The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.
  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
  3. The Scheduler returns the next Requests to the Engine.
  4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
  7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
  8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests from the Scheduler.

2.scrapy各组件

Components:

1、引擎(EGINE)
引擎负责控制系统所有组件之间的数据流,并在某些动作发生时触发事件。有关详细信息,请参见上面的数据流部分。

2、调度器(SCHEDULER)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址。

3、下载器(DOWLOADER)
用于下载网页内容, 并将网页内容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的。

4、爬虫(SPIDERS)
SPIDERS是开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求。

5、项目管道(ITEM PIPLINES)
在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作。

6、下载器中间件(Downloader Middlewares)位于Scrapy引擎和下载器之间,主要用来处理从EGINE传到DOWLOADER的请求request,已经从DOWNLOADER传到EGINE的响应response,
你可用该中间件做以下几件事:
  (1) process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
  (2) change received response before passing it to a spider;
  (3) send a new Request instead of passing received response to a spider;
  (4) pass response to a spider without fetching a web page;
  (5) silently drop some requests.

7、爬虫中间件(Spider Middlewares)
位于EGINE和SPIDERS之间,主要工作是处理SPIDERS的输入(即responses)和输出(即requests)


3.scrapy基本使用cmd

1、进入终端cmd:
        -scrapy

2、创建scrapy项目
    1.创建文件夹存放scrapy项目
        -D:\Scrapy_project\

    2.cmd终端输入命令
    -scrapy starproject Spider_Project
    会在D:\Scrapy_project\下生成文件夹
        -Spider_Project :Scrapy项目文件

    3.创建好后会提示
        -cd Spider_Project     #切换到scrapy项目目录下
                          #爬虫程序名称   #目标网站域名
        -scrapy genspider  baidu          www.baidu.com     #创建爬虫程序

3.启动scrapy项目,执行爬虫程序
    # 找到爬虫程序文件执行
    scrapy runspider 爬虫程序.py
    # 切换到爬虫程序执行目录下
        -cd D:\Scrapy_project\Spider_Project\Spider_Project\spiders
        -scrapy runspider baidu.py

    # 根据爬虫名称找到相应的爬虫程序执行
    scrapy crawl 爬虫程序名称
        # 切换到项目目录下
        - cd D:\Scrapy_prject\Spider_Project
        - scrapy crawl baidu

** Scarpy在pycharm中的使用 **
    1、创建一个py文件
        from scrapy.cmdline import execute
        execute()  # 写scrapy执行命令

4.Scrapy在Pycharm中使用

'''
main.py
'''
from scrapy.cmdline import execute

# 写终端命令
# scrapy crawl baidu
# 执行baidu爬虫程序
# execute(['scrapy', 'crawl', 'baidu'])

# 创建爬取链家网程序
# execute(['scrapy', 'genspider', 'lianjia', 'lianjia.com'])

# --nolog     去除日志
execute('scrapy crawl --nolog lianjia'.split(' '))

'''
Scrapy在Pycharm中使用
1.创建scrapy项目
在settings.py文件中有
    -ROBOTSTXT_OBEY = True     #默认遵循robot协议
修改为:
    -ROBOTSTXT_OBEY = False
'''


'''
lianjia.py
'''
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request

# response的类

class LianjiaSpider(scrapy.Spider):
    name = 'lianjia'  # 爬虫程序名
    # 只保留包含lianjia.com的url
    allowed_domains = ['lianjia.com']  # 限制域名

    # 存放初始请求url
    start_urls = ['https://bj.lianjia.com/ershoufang/']

    def parse(self, response):  # response返回的响应对象
        # print(response)
        # print(type(response))
        # 获取文本
        # print(response.text)
        # print(response.url)
        # //*[@id="position"]/dl[2]/dd/div[1]

        # 获取城区列表url
        area_list = response.xpath('//div[@data-role="ershoufang"]/div/a')

        # 遍历所有区域列表
        for area in area_list:
            # print(area)
            '''
            .extract()提取多个
            .extract_first()提取一个
            '''
            # 1、区域名称
            area_name = area.xpath('./text()').extract_first()

            # 2、区域二级url
            area_url = 'https://bj.lianjia.com/' + area.xpath('./@href').extract_first()

            # 会把area_url的请求响应数据交给parse_area方法
            # yield后面跟着的都会添加到生成器中
            yield Request(url=area_url, callback=self.parse_area)

    def parse_area(self, response):
        # print(response)

        # 获取主页房源ul标签对象
        house_list = response.xpath('//ul[@class="sellListContent"]')
        # print(house_list)
        if house_list:
            for house in house_list:
                # 房源名称
                # //*[@id="leftContent"]/ul/li[1]/div/div[1]/a
                house_name = house.xpath('.//div[@class="title"]/a/text()').extract_first()
                print(house_name)

                # 房源价格
                # //*[@id="leftContent"]/ul/li[1]/div/div[4]/div[2]/div[1]/span
                house_cost = house.xpath('.//div[@class="totalPrice"]/span/text()').extract_first() + '万'
                print(house_cost)

                # 房源单价
                # //*[@id="leftContent"]/ul/li[1]/div/div[4]/div[2]/div[2]/span
                house_price = house.xpath('.//div[@class="unitPrice"]/span/text()').extract_first()
                print(house_price)

                # yield Request(url='下一级url', callback=self.parse_area)
                pass


5.微信好友统计

from wxpy import Bot
from pyecharts import Pie
import webbrowser

# 实例化一个微信机器人对象
bot = Bot()

# 获取到微信的所有好友
friends = bot.friends()

# 设定男性\女性\位置性别好友名称
attr = ['男朋友', '女朋友', '人妖']

# 初始化对应好友数量
value = [0, 0, 0]

# 遍历所有的好友,判断这个好友是男性还是女性
for friend in friends:
    if friend.sex == 1:
        value[0] += 1
    elif friend.sex == 2:
        value[1] += 1
    else:
        value[2] += 1

# 实例化一个饼状图对象
pie = Pie('Forver的好友们!')

# 图表名称str,属性名称list,属性所对应的值list,is_label_show是否现在标签
pie.add('', attr, value, is_label_show=True)

# 生成一个html文件
pie.render('friends.html')

# 打开html文件
webbrowser.open('friends.html')
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。