Intro

(Optional) Create virtual environment

prefer using python version 3
mkvirtualenv --python=/usr/bin/python3 python3

check pip version by pip --version to make sure python 3 is used

Steps

  • scrapy startproject name
  • scrapy genspider botname url

robotstxt in setting should be true to always crawl permitted pages and be a good web citizen

  • inside project folder scrapy crawl botname
  • test in shell
  • scrapy crawl botname -o xx.json or csv to see result

shell to debug and test

scrapy shell

  • test url is valid - fetch(url)
  • test valid html - view(response.body)

Alternative xpath testing tool
http://www.freeformatter.com/xpath-tester.html

Xpath docs

uses response from selector

selctor, as it is named, selects html content,
from scrapy.selector import Selector
Since this is a common operation, response.selector is shorten to .xpath()

Extra
css can also be used as selector, but xpath is the official way

//name or //* - relative select every instance of html tag name or all
text() - text content in unicode
'//name[1]' - python isolated selector for ('//name')[0], use either
. - extracting first instance of data that is not response, can also just omit //
@ - attribute grabbing

if itemprop exist, use it over class to extract

Tools to get xpath fast -

Paste_Image.png

https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • scrapy学习笔记(有示例版) 我的博客 scrapy学习笔记1.使用scrapy1.1创建工程1.2创建爬虫模...
    陈思煜阅读 14,381评论 4 46
  • lesson 2 All the tables in the zoo database animals This ...
    赤乐君阅读 4,829评论 0 0
  • 一切探究和追查都来源于我收藏电影票的特殊癖好。 几年前的电影票字迹已经模糊,为了更好的保存票根,让回忆有据可查,根...
    半夏长安阅读 35,678评论 50 137
  • 刚毕业半年,在北京工作半年,忐忐忑忑的半年,从头认识自己的半年...... 大学 我的大学是荒废的,在游戏中度过,...
    贺韦阅读 1,373评论 0 0
  • 每次没想清楚就动手,像今天下午就走了很多弯路,如果边写边做呢 写的时候帮助理清思路,做完可以再整体总结或者像小呆大...
    木子肆阅读 1,522评论 0 0

友情链接更多精彩内容