scrapy默认需要两个方法:
- start_requests(self): (可用start_urls = [xxx]代替)
- parse(self, response):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "test1" #spider_name
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
#def start_requests(self): #等同于start_urls
#urls = [
#'http://quotes.toscrape.com/page/1/',
#'http://quotes.toscrape.com/page/2/',
# ]
#for url in urls:
#yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.xpath(
'//div[@class="quote"]'): # response.css('div.quote') .xpath('//div[@class="quote"]')
yield {
'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
'author': quote.xpath('.//small[@class="author"]/text()').get(),
}
#爬取更多页面
#next_page = response.xpath('//li[@class="next"]/a/@href').get() #/page/2/
#if next_page is not None: #创建请求的捷径
#yield response.follow(next_page, callback=self.parse)
控制台运行:
方法一:scrapy runspider <spider_file> (没有搭建scrapy框架)
方法二:
- 1 创建项目:
scrapy startproject myproject
- 2 复制以上代码到 /myproject/myproject/spider/自己建一个first_scrapy.py,拷贝上面代码 (或系统创建一个示例文件:scrapy genspider <spider_name> <domain_name> 再用以上代码覆盖)
scrapy genspider test1 test.com
- 3 运行代码:scrapy crawl <spider_name>
scrapy crawl test1
存储文件:
- 若你使用的方法一:scrapy runspider <spider_file> -o test.jl (jl文件可重复写入,不会损毁)
scrapy runspider first_scrapy.py -o test.jl
- 若你使用的方法二:scrapy crawl <spider_name> -o test.jl
scrapy crawl test1 -o test.jl