Scrapy爬取1000本书

效果如下:

从左到右依次是书的upc编码,名字,类型,储存量,价格,评分,评分数目,简介

网址是这个http://books.toscrape.com/

使用scrapy shell来操作一个爬虫,先简单进行爬取实验,把网页分析好

scrapy shell http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

进入交互框


In [2]: sel = response.css('div.col-sm-6.product_main')

In [4]: sel.xpath('./h1/text()').extract_first()

Out[4]: u'A Light in the Attic'#name

In [5]: sel.css('p.price_color').extract_first()

Out[5]: u'\xa351.77#price

In [30]: response.xpath('//*[@id="content_inner"]/article/p/text()').extract_first()

Out[30]: u"It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more"

#jianjie

In [31]: response.xpath('//*[@id="default"]/div/div/ul/li[3]/a/text()').extract_first()

Out[31]: u'Poetry'

#leixing

In [46]: table2 = response.css("table.table.table-striped")

In [48]: table2.xpath("(.//tr)[1]/td/text()").extract_first()

Out[48]: u'a897fe39b1053632'#upc编码

In [49]: table2.xpath("(.//tr)[last()-1]/td/text()").extract_first()

Out[49]: u'In stock (22 available)'#库存

In [56]: from scrapy.linkextractors import LinkExtractorIn [57]: le = LinkExtractor(restrict_css='article.product_pod')In [58]: leOut[58]:In [59]: le.extract_links(response)

Out[59]:

[Link(url='http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', text=u'', fragment='', nofollow=False),

Link(url='http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', text=u'', fragment='', nofollow=False),

Link(url='http://books.toscrape.com/catalogue/soumission_998/index.html', text=u'', fragment='', nofollow=False)................以下连接省略

每一本书都会有自己的主页的,我们需要获取到全部书的主页,再进去分析。

首先

scrapy startproject toscrappe_book

创建项目

再初始化项目,设定爬去域在我们要的网站内

scrapy genspider books books.toscrape.com

设计思路:

1.根据刚刚分析出来的网页信息,设置items

2.根据刚刚分析的网页,设计爬虫spieder

(1)爬虫需要爬去单个页面需要信息

(2)爬完一个网页,爬虫需要去爬取下一个目标网页

3.在setting里设置相关信息

4.在pipelines处理特别的数据

代码已经上传到git

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容