很喜欢《庄子》一书,尤其是里面的呆若木鸡和庖丁解牛两个故事,揭示了为人和处事的三种境界。准备下载一个文本文档近期重温一下,搜索发现华语网上的质量很高,可惜是分篇的,懒得一点一点的copy,所以考虑用python爬取下来。
1. 爬取下图页面(https://www.thn21.com/wen/Famous/5609.html)内的各章节的链接。
import requests, re
from bs4 import BeautifulSoup
novelname = ''
names = []
urls = []
req = requests.get(url = 'https://www.thn21.com/wen/Famous/5609.html')
req.encoding = 'gb2312'
html = req.text
div_bf = BeautifulSoup(html, "html.parser")
novelname = re.search(r'(.+?)简介', div_bf.title.string).group(1)
a = div_bf.body.select('[href^="/wen/famous/hdnj/zuangzi"]')
for each in a:
names.append(each.string) # 章节名
urls.append('https://www.thn21.com/' + each.get('href')) # 章节链接
2. 依次爬取各章节链接中的内容。
注意因为内容中含有繁体字,所以需要用gb18030进行解码。
for i in range(len(a)):
with open(novelname + '.txt', 'a', encoding='utf-8') as f:
f.write(' ' + names[i] + '\n')
req = requests.get(url = urls[i])
html= req.content.decode('gb18030', 'ignore')
bf = BeautifulSoup(html, "html.parser")
texts = ""
for i in bf.body.find_all('p'):
if i.string:
texts += " " + i.string + "\n"
f.writelines(texts)
f.write('\n\n')