提示
我代码里面用到的2.csv是一个关于基因id的一部分文件,csv格式
import requests
from lxml import etree
from selenium import webdriver
import time
import os
headers = {
'User-Agent': "PostmanRuntime/7.20.1",
'Accept': "*/*",
'Cache-Control': "no-cache",
'Postman-Token': "4847ea30-a4b1-4f98-bd3d-c5f41c8ed792,8906cc9c-db81-4f9d-9a3e-4e89ffac7404",
'Host': "omim.org",
'Accept-Encoding': "gzip, deflate",
'Connection': "keep-alive",
'cache-control': "no-cache"
}
with open('2.csv', 'r', encoding='utf-8') as f:
ids = f.readlines()
for id in ids:
id = id.replace('\n', '')
url = "https://omim.org/entry/" + str(id)
querystring = {"search": str(id), "highlight": str(id)}
response = requests.request("GET", url, headers=headers, params=querystring)
time.sleep(1)
data = response.content.decode('utf-8')
if 'Error 403' in data:
print('Error 403')
break
filename = './htmldata/' + str(id) + '.html'
if not os.path.exists('./htmldata/'):
os.mkdir('./htmldata/')
with open(filename, 'a', encoding='utf-8') as f:
f.writelines(data)
print(id, url)
这里的爬虫我用到的是最简单的爬虫,我只需要把全网页的html给抓取下来然后生成html格式的文件就可以了
里面有一个sleep(1),是为了让OMIM的服务器不至于封禁我的ip加上的,可以根据自身情况代理,可以写个方法随机代理或者写一个方法随机休眠几秒,没必要和我一样完全写死休眠一秒