1. 项目思路
需求:1. 爬取简书上的所有文章;2. 爬取文章的标签;3. 爬取文章的作者信息、浏览量、字数、评论数、点赞数;4. 获取文章的地址
思路:1. 获取详情页面中的推荐文章,分析文章url,使用Crawlspider类实现全站爬取;2. 分析详情页中的标签、作者信息、浏览量、字数、评论数、点赞数的加载方式,选用合适的提取方式。
2. 创建项目
打开cmd,进入项目目录,执行scrpay startproject jianshu_spider
,创建scrapy项目;
执行cd jianshu_spider
进入项目;使用Crawlspider来爬取简书网上的文章,执行scrapy genspider -t crawl js jianshu.com
,创建Crawlspider爬虫。
3. 项目准备
改动setting.py文件
设置请求头
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36'
}
设置robots协议
ROBOTSTXT_OBEY = False
在项目文件夹中,新建一个start.py文件,方便反复启动爬虫。
from scrapy import cmdline
cmdline.execute("scrapy crawl js".split())
4. items文件
注:这里的article_id
是在页面分析时,觉得有需要才添加到item中的。
import scrapy
class JianshuItem(scrapy.Item):
# 标题
title = scrapy.Field()
# 作者头像
avatar = scrapy.Field()
# 作者ID
author = scrapy.Field()
# 发布时间
pub_time = scrapy.Field()
# 文章地址
origin_url = scrapy.Field()
# 文章id
article_id = scrapy.Field()
# 文章内容
content = scrapy.Field()
# 文章字数
word_count = scrapy.Field()
# 浏览量
view_count = scrapy.Field()
# 评论数
comment_count = scrapy.Field()
# 喜欢数
like_count = scrapy.Field()
# 文章标签
subjects = scrapy.Field()
5. 爬虫
例:https://www.jianshu.com/p/379c0c04b838?utm_campaign=maleskine&utm_content=note&utm_medium=pc_all_hots&utm_source=recommendation
观察发现,article_id=379c0c04b838 由0-9数字和a-z小写字母组成,且只有12位,故设置爬取规则为allow=r'.*/p/[0-9a-z]{12}.*'
,获取的url给回给parse_detail
进行解析,设置follow=True
为无限爬取。
from jianshu_spider.items import JianshuItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import json
class JsSpider(CrawlSpider):
name = 'js'
allowed_domains = ['jianshu.com']
start_urls = ['http://jianshu.com/']
# 设置爬取规则
rules = (
Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
)
# 解析页面
def parse_detail(self, response):
title = response.xpath("//h1[@class='title']/text()").get()
avatar = response.xpath("//div[@class='author']/a/img/@src").get()
author = response.xpath("//div[@class='author']//span[@class='name']//text()").get()
pub_time = response.xpath("//span[@class='publish-time']/text()").get().strip("*")
origin_url = response.url
article_id = origin_url.split("?")[0].split("/")[-1]
content = response.xpath("//div[@class='show-content-free']").get()
json_str = response.xpath("//script[@type='application/json']/text()").get()
article_data = json.loads(json_str)
word_count = article_data['note']['public_wordage']
view_count = article_data['note']['views_count']
comment_count = article_data['note']['comments_count']
like_count = article_data['note']['likes_count']
subjects = ",".join(response.xpath("//div[@class='include-collection']/a/div//text()").getall())
# 返回item给pipeline
item = JianshuItem(
title=title,
avatar=avatar,
pub_time=pub_time,
origin_url=origin_url,
article_id=article_id,
author=author,
content=content,
word_count=word_count,
view_count=view_count,
comment_count=comment_count,
like_count=like_count,
subjects=subjects
)
yield item
6.使用Selenium+Chromedriver
在解析页面时可以在terminal中使用 scrapy shell 来进行即时的提取尝试,执行
scrapy shell https://www.jianshu.com/p/379c0c04b838
继续分析页面可得知,文章标签是通过异步加载进网页的。故采取Selenium+Chromedriver来模拟打开网页进行爬取。
思路:
- 需要在middlewares中使用Selenium+Chromedriver来截取爬虫引擎发出request请求,获取到request后打开网页,完成加载,返回response给spider,这样spider就可以提取到异步加载的内容了。
- 这里部分页面的标签展示不完全,需要点击展开。这些操作完全可以使用Selenium来完成。示例代码如下:
from selenium import webdriver
from scrapy.http.response.html import HtmlResponse
import time
class SeleniumDownloadMiddleware(object):
def __init__(self):
self.driver = webdriver.Chrome(executable_path=r'D:\SoftWare\chromedriver\chromedriver.exe')
def process_request(self, request, spider):
self.driver.get(request.url)
time.sleep(0.5)
try:
while True:
ShowMore = self.driver.find_element_by_class_name("show-more")
ShowMore.click()
if not ShowMore:
break
except:
pass
source = self.driver.page_source
response = HtmlResponse(url=self.driver.current_url, body=source, encoding='utf-8', request=request)
return response
7. 数据入库
在pipeline中连接Mysql数据库,写入数据并保存。
import pymysql
class JianshuPipeline(object):
# 初始化连接数据库
def __init__(self):
dbparams = {
'host': '127.0.0.1',
'port': 3306,
'user': 'root',
'password': 'root',
'database': 'jianshu',
'charset': 'utf8'
}
self.conn = pymysql.Connect(**dbparams)
self.cursor = self.conn.cursor()
self._sql = None
# 使用property装饰器,方便sql语句的调用
@property
def sql(self):
if not self._sql:
self._sql = '''
insert into article(id, title, content, avatar, pub_time, origin_url, author, article_id) values(null, %s, %s, %s, %s, %s, %s, %s)
'''
return self._sql
return self._sql
# 写入数据
def process_item(self, item, spider):
self.cursor.execute(self.sql, (item['title'], item['content'], item['avatar'], item['pub_time'], item['origin_url'], item['author'], item['article_id']))
self.conn.commit()
return item
也可以采用Twist异步存储的方式来提高入库效率,示例代码如下:
import pymysql
from twisted.enterprise import adbapi
from pymysql import cursors
class JianshuTwistPipeline(object):
# 初始化连接数据库
def __init__(self):
dbparams = {
'host': '127.0.0.1',
'port': 3306,
'user': 'root',
'password': 'root',
'database': 'jianshu',
'cursorclass': cursors.DictCursor
}
self.dbpool = adbapi.ConnectionPool('pymysql', **dbparams)
self._sql = None
# 使用property装饰器,方便sql语句的调用
@property
def sql(self):
if not self._sql:
self._sql = '''
insert into article(id, title, content, avatar, pub_time, origin_url, author, article_id, word_count, view_count, comment_count, like_count, subjects) values(null, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
'''
return self._sql
return self._sql
# 分为两部分,在写入时,给一个报错日志提醒。
def process_item(self, item, spider):
defer = self.dbpool.runInteraction(self.insert_item, item)
defer.addErrback(self.handle_error, item, spider)
# 写入数据
def insert_item(self, cursor, item):
cursor.execute(self.sql, (item['title'], item['content'], item['avatar'], item['pub_time'], item['origin_url'], item['author'], item['article_id'], item['word_count'], item['view_count'], item['comment_count'], item['like_count'], item['subjects']))
# 报错日志
def handle_error(self, error, item, spider):
print('=' * 10 + "error" + '=' * 10)
print(error)
print('=' * 10 + "error" + '=' * 10)
8. setting.py设置
在setting.py中开启我们自己写的pipeline和middleware,同时开启延迟下载
DOWNLOAD_DELAY = 2
SPIDER_MIDDLEWARES = {
'jianshu_spider.middlewares.UserAgentSpiderMiddlewair': 543,
}
DOWNLOADER_MIDDLEWARES = {
'jianshu_spider.middlewares.SeleniumDownloadMiddleware': 543,
}
ITEM_PIPELINES = {
# 'jianshu_spider.pipelines.JianshuPipeline': 300,
# 使用Twist异步存储
'jianshu_spider.pipelines.JianshuTwistPipeline': 300,
}
9. 爬取结果
使用Mysql的图形化管理器Navicat来查看爬取结果