Installation (python 3.4)
We need the Scrapy library (v1.3.3) along with PyMongo (v3.4.0) (latest version when this blog created) for storing the data in MongoDB. You need to install MongoDB as well(not covered).
$ pip install Scrapy==1.3.3
$ pip freeze > requirements.txt
$ pip install pymongo==3.4.0
$ pip freeze > requirements.txt
start project
$ scrapy startproject stack
Specify Data
Those familiar with Django will notice that Scrapy Items are declared similar to Django Models, except that Scrapy Items are much simpler as there is no concept of different field types.
In items.py
file
#stack/items.py
from scrapy.item import Item, Field
class StackItem(Item):
title = Field()
url = Field()
Create the Spider
Create a file called stack_spider.py
in the “spiders” directory.
Using Chrome -> inspect
to copy XPath
of the craped element.
# stack/spider/stack_spider.py file
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
class StackSpider(Spider):
name = "stack"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
item = StackItem()
item['title'] = question.xpath(
'a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath(
'a[@class="question-hyperlink"]/@href').extract()[0]
yield item
Test
$ scrapy crawl stack -o items.json -t json