在图书馆下文献的时候看到公告说斯普林格在疫情期间对旗下部分图书提供免费下载。学校图书馆说的有效期是2020年7月1日之前。下载地址为:https://link.springer.com/search?query=&facet-content-type=%22Book%22&showAll=false
足足有一万七千多本书可以免费下载,同时还有23万图书提供在线阅读。随便点了几本发现同时提供pdf和epub格式的下载。于是写了一个爬虫,下了一些Python、R、生物信息学以及生物类的图书。代码如下:
# 引入库
import requests
from bs4 import BeautifulSoup
# 用于手动生成标签页面,vTag代表检索词,i代表页码
# 示例地址:https://link.springer.com/search/page/1?facet-content-type=%22Book%22&showAll=false&query=Python
def fTag(vTag, vNum):
vUrls = [];
for i in range(1, vNum):
vUrls.append("https://link.springer.com/search/page/" + str(i) + "?facet-content-type=%22Book%22&showAll=false&query=" + vTag);
return(vUrls);
# 用于收集每本书的网址
def fPageUrls(vUrl):
r = requests.get(vUrl);
vHtml = r.text;
vSoup = BeautifulSoup(vHtml, "lxml");
vPages = vSoup.select("h2");
vPageUrls = [];
for vPage in vPages:
try:
vPageUrls.append("https://link.springer.com" + vPage.find("a")["href"]);
except:
continue;
return(vPageUrls);
# 分别下载pdf书和epub书,目前有一个bug,同名的书后面的会覆盖前面的,虽然这个改起来不难,但是发懒没改
def fDownLoad(vPageUrls):
for PageUrl in vPageUrls:
r = requests.get(PageUrl);
vHtml = r.text;
vSoup = BeautifulSoup(vHtml, "lxml");
vName = vSoup.find("h1").text;
print(vName);
vUrls = vSoup.select("div.sticky-banner__container");
for vUrl in vUrls:
try:
vDownload = vUrl.findAll("a");
j = 0;
for i in vDownload:
vBookUrl = "https://link.springer.com" + i.get("href");
r = requests.get(vBookUrl);
if j == 0:
with open(vName + ".pdf", "wb") as f:
f.write(r.content);
j = 1;
elif j == 1:
# 如果只需要一种格式的书,可以将写文件的部分注释掉
with open(vName + ".epub", "wb") as f:
f.write(r.content);
j = 0;
except:
continue;
# 程序入口(参数为检索词和下载前多少页)
def main(vTag, vNum):
vUrls = fTag(vTag, vNum);
for vUrl in vUrls:
vPageUrls = fPageUrls(vUrl);
fDownLoad(vPageUrls);
# 运行实例
main("with R", 2);
# main("Python", 3);
已经下了很多电子书了,老实说,下这么多,怕是没时间看呢,先屯起来。