前言
以我的理解,写一个爬虫分为以下几个步骤
- 分析目标网站
- 访问单个网页地址,获取网页源代码
- 提取数据
- 保存数据
- 抓取剩余网页
以下开始正题
1. 分析目标网站
- 目标网站为简书七日热门文章 http://www.jianshu.com/trending/weekly 。 提取数据为用户,标题,阅读量,评论量,获赞量,打赏数
- 用chrome tools 查看这个网页,是用ajax加载的,分析规律,发现url为 http://www.jianshu.com/trending/weekly?page=1 , page=1 至 page=5.
2. 访问单个网页地址,获取网页源代码
- 设置url
url = 'http://www.jianshu.com/trending/weekly?page=1'
- 设置头部信息(用来伪装请求,本案例中可省略)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
request = urllib2.Request(url=url, headers=headers)
- 发送请求和接收响应
html = urllib2.urlopen(request)
3. 从源代码中提取数据
# 先用BeautifulSoup转换一下,以便之后解析
bsObj = BeautifulSoup(html.read(), 'lxml')
-
抓出每篇文章的源代码,并提取目标数据(写的很差劲,just work)
items = bsObj.findAll("div", {"class": "content"})
for item in items:
author = item.find("a", {"class": "blue-link"}).get_text()
title = item.find("a", {"class": "title"}).get_text()
other = item.find("div", {"class": "meta"}).get_text()
pattern = re.compile('(\d+)')
content = re.findall(pattern, other)
view = content[0]
comment = content[1]
like = content[2]
money = content[3] if (len(content) == 4) else 0 # 非常不严谨,暂时这么做
4. 保存数据
with open('articlesOfSevenDays.csv', 'a') as resultFile:
wr = csv.writer(resultFile, dialect= 'excel')
wr.writerow([author,title,view,comment,like,money])
因为遇到编码问题,所以添加以下代码
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
5. 抓取剩余网页
for i in range(1,6):
print "开始抓取第{}页...".format(i)
url = 'http://www.jianshu.com/trending/weekly?page={}'. format(i)
# 重复之前提取数据和保存数据的代码
完整的代码
#!/usr/bin/env python
# coding=utf-8
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup
from urllib.error import HTTPError
import re
import csv
import os
def getHTML(i):
url = 'http://www.jianshu.com/trending/weekly?page={}'.format(i)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
try:
request = Request(url=url, headers=headers)
html = urlopen(request)
bsObj = BeautifulSoup(html.read(), 'lxml')
items = bsObj.findAll("div", {"class": "content"})
except HTTPError as e:
print(e)
exit()
return items
def getArticleInfo(items):
articleInfo= []
for item in items:
author = item.find("a", {"class": "blue-link"}).get_text()
title = item.find("a", {"class": "title"}).get_text()
other = item.find("div", {"class": "meta"}).get_text()
pattern = re.compile('(\d+)')
content = re.findall(pattern, other)
view = content[0]
comment = content[1]
like = content[2]
money = content[3] if (len(content) == 4) else 0 # 不太严谨
articleInfo.append([author, title, view, comment, like, money])
return articleInfo
dir = "../jianshu/"
if not os.path.exists(dir):
os.makedirs(dir)
csvFile = open("../jianshu/jianshuSevenDaysArticles.csv","wt",encoding='utf-8')
writer = csv.writer(csvFile)
writer.writerow(("author", "title", "view", "comment", "like", "money"))
try:
for i in range(1, 6):
items = getHTML(i)
articleInfo = getArticleInfo(items)
for item in articleInfo:
writer.writerow(item)
finally:
csvFile.close()
抓取结果
总结
- 页面解析水平不好,接下来要学习:正则表达式,beautifulSoup,lxml
- 遇到的编码问题待学习