这次的练习主要是对京东的ipad商品页面进行爬取,主页如下:
items.py
对名字、商铺、价格和营销方式进行抓取
name = scrapy.Field()
shop = scrapy.Field()
icon = scrapy.Field()
price = scrapy.Field()
jd_spider.py
此处对url是自己观察规律进行构造的,发现url只有page进行了改变,并且是以2的间隔增长
class JdSpiderSpider(scrapy.Spider):
name = 'jd_spider'
allowed_domains = ['www.jd.com']
start_urls = ['https://search.jd.com/Search?keyword=ipad&enc=utf-8&page={}'.format(str(i) for i in range(1,101, 2 ))]
def parse(self, response):
lists = response.xpath('//li[@class="gl-item"]/div')
for list in lists:
item = JingdongItem()
item['name'] = list.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()').extract_first()
item['shop'] = list.xpath('.//div[@class="p-shop"]/span/a/text()').extract_first()
item['icon'] = list.xpath('.//div[@class="p-icons"]/i[@class="goods-icons J-picon-tips J-picon-fix"]/text()').extract_first()
item['price'] = list.xpath('.//div[@class="p-price"]/strong/i/text()').extract_first()
yield item
- 其他爬虫代码可参考github