通过分析scrapy爬取简书中故事分类中的帖子的标题、url、点赞数、作者、评论数、评论URl、打赏数等。
1、分析完HTML网页的结构,然后编写相应的爬虫spider,下面是爬虫代码
# -*- coding: utf-8 -*-
import scrapy
# from jianshu.items import Jinshuitem
class JiansSpider(scrapy.Spider):
name ='jians'
allowed_domains = ['jianshu.com']
#start_urls = ['https://www.jianshu.com/c/fcd7a62be697?order_by=commented_at&page=1']
def start_requests(self):
for iin range(1,50):
url ="https://www.jianshu.com/c/fcd7a62be697?order_by=commented_at&page={}".format(i)
yield scrapy.Request(
url,
callback=self.parse
)
def parse(self, response):
conList = response.xpath("//ul[@class = 'note-list']/li")
item = {}
for liin conList:
item["title"] = li.xpath(".//a[@class = 'title']/text()").extract_first()
item["href"] = li.xpath(".//a[@class = 'title']/@href").extract_first()
if item["href"]is not None:
item["href"] ="https://www.jianshu.com" + item["href"]
item["diamond"] = li.xpath(".//span[@class = 'jsd-meta']/text()").extract()
if item["diamond"]is None:
item["diamond"] =None
else:
if len(item["diamond"]) >=2:
item["diamond"] = item["diamond"][-1].strip()
else:
return item["diamond"]
item["author"] = li.xpath(".//a[@class = 'nickname']/text()").extract()
item["comments_num"] = li.xpath(".//div[@class = 'meta']/a[2]/text()").extract()
if item["comments_num"]is not None:
if len(item["comments_num"]) >=2:
item["comments_num"] = item["comments_num"][-1]
else:
return item["comments_num"]
item["comments_href"] = li.xpath(".//div[@class = 'meta']/a[2]/@href").extract_first()
if item["comments_href"]is not None:
item["comments_href"] ="https://www.jianshu.com" + item["comments_href"]
item["like_num"] = li.xpath(".//div[@class = 'meta']/span[2]/text()").extract_first()
yield item
2、然后配置一下setting,例如User_Agent、LOG_LEVEL、ROBOTS协议、DOWNLOAD_DELAY等代码如下
LOG_LEVEL ="WARNING"
USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
ROBOTSTXT_OBEY =False
DOWNLOAD_DELAY =3
3、在setting中启动pipline,配置piplien.py用来连接数据库,把数据放入到mongo数据库中存放,
代码如下
from pymongoimport MongoClient
class JianshuPipeline(object):
def process_item(self, item,spider):
connection = MongoClient("localhost",27017)
db = connection.mydb
collection = db.story
data =dict(item)
collection.insert(data)
connection.close()
最后查看数据抓取的是否完整和准确

PS:因为爬去的数据比较少,没有碰见反扒封IP的情况,所以没有做ip代理