效果展示如下:
糗事百科爬取结果.png
前提条件:
Python3.6
Scrapy2.1.0
创建 scrapy 项目:
scrapy startproject scrapy_demo
项目结构如下:
scrapy_demo
├── scrapy.cfg
└── scrapy_demo
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-37.pyc
│ └── settings.cpython-37.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-37.pyc
│ └── first.cpython-37.pyc
└── first.py
修改项目配置文件:
cat settings.py
# 修改默认浏览器标示为以下内容
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
# 忽略robots.txt规则
ROBOTSTXT_OBEY = False
编写代码逻辑:
cat first.py
# -*- coding: utf-8 -*-
import scrapy
class FirstSpider(scrapy.Spider):
name = 'first'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
all_data = []
div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list[0])
for div in div_list:
author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract().strip()
content = div.xpath('./a[1]/div/span[1]/text()').extract()
content = ''.join(content).strip(' \n \t')
dic = {
'author': author,
'content':content
}
all_data.append(dic)
return all_data
运行项目,并导出结果:
# 进入项目根目录
cd scrapy_demo
# 将运行结果导入到csv文件中
scrapy crawl first -o qiubai.csv