use case - generic spider has useful methods for common crawling actions such as following all links on a site based on certain rules, crawling from Sitemaps, or parsing an XML/CSV feed
CrawlSipder
rules - objects that define crawling behavior
parse_start_url - a method that can be overriden to parse the initial responses and must return
Rules
scrapy.spiders.Rule
can declare multiple rules for followed links, always add a , at the end

Paste_Image.png
-
link_extractordefines how links will be extracted from each crawled page - allow/deny - only allow or ignore domains
-
callback- calling methods to perform crawling on the response; if no callback is specified,followis default to True
avoid calling
parsesince this is reserved for CrawlSpider to use it to set up the rules
-
follow- a boolean if set to true extract all links on the page
Scrapy filter out duplicate link by default
beware that start_urls should not contain trailling slash
works
Paste_Image.png
does not work
Paste_Image.png
-
process_links- filter purpose

