再次使用神奇的requests对三国演义下手。
首先在Chrome下打开三国演义主目录网址,按F12查看,每一个章节的链接很有规律,都可以通过如下的css访问。
'#middlediv > #mulu > ul > li > a'
indexUrl="http://www.shicimingju.com/book/sanguoyanyi.html"
base_url = 'http://www.shicimingju.com'
r = requests.get(indexUrl, proxies=proxies)
soup = BS(r.text, "lxml")
book_lists = soup.select('#middlediv > #mulu > ul > li > a')
在拿到每一章的链接之后,就可以顺利下手了。
首先决定以每一章节作为文件名。
其css如下:
'#alldiv > #main > #chaptercontent > #con > h2'
然后爬正文:
其css如下:
'#alldiv > #main > #chaptercontent > #con > #con2 > p'
注意问题,在出来中文网页的时候,果不其然遇到了编码问题,有些章节的编码是ISO-8859-1,有些是utf-8。
所以还要分别处理:
if r.encoding == "ISO-8859-1":
soup = BS(r.text.encode('ISO-8859-1', 'ignore').decode('utf-8'), "lxml")
else:
soup = BS(r.text, "lxml")
完整代码如下:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import requests
import sys
import os
from bs4 import BeautifulSoup as BS
reload(sys)
sys.setdefaultencoding( "utf-8" )
sub_folder = os.path.join(os.getcwd(), "sanguoyanyi")
if not os.path.exists(sub_folder):
os.mkdir(sub_folder)
proxies = {
"http": "http://yourproxy.com:8080/",
"https": "https://yourproxy.com:8080/",
}
indexUrl="http://www.shicimingju.com/book/sanguoyanyi.html"
base_url = 'http://www.shicimingju.com'
r = requests.get(indexUrl, proxies=proxies)
soup = BS(r.text, "lxml")
book_lists = soup.select('#middlediv > #mulu > ul > li > a')
for book in book_lists:
real_url = base_url + book.get('href')
print real_url
r = requests.get(real_url, proxies=proxies)
print r.encoding
# soup = BS(r.text.encode('ISO-8859-1', 'ignore').decode('utf-8'), "lxml")
try:
if r.encoding == "ISO-8859-1":
soup = BS(r.text.encode('ISO-8859-1', 'ignore').decode('utf-8'), "lxml")
else:
soup = BS(r.text, "lxml")
title_lists = soup.select('#alldiv > #main > #chaptercontent > #con > h2')
# print title_lists[0].get_text()
file_name = title_lists[0].get_text() + ".txt"
print file_name
filename = sub_folder + "/" + file_name
print filename
content_lists = soup.select('#alldiv > #main > #chaptercontent > #con > #con2 > p')
content_of_novel = ""
for content in content_lists:
content_of_novel += content.get_text()
# print content.get_text()
with open(filename, "wb") as f:
f.write(content_of_novel)
f.close()
except UnicodeDecodeError:
print "you need re check the encoding:" + real_url
continue