最近利用python 抓取医学网页数据:主要工具还是使用
确保已安装:selenium ,geckodriver.exe首先要打开需要抓取的网页如:查看网页源代码(F12)
我们发现其中有<div class = "search-list"> 故可以首先使用:driver.find_element_by_class_name("search-list")获取网页正文内容
然后查看"下一页"对应代码:
detail_url = driver.find_element_by_link_text("下一页").get_attribute('href')
下面是代码:
def get_text():
driver = webdriver.Firefox()
urls = ["url1",
"url2",
"url3"
]
ref = ['url1','url2','url3']
for i in range(len(urls)):
driver.get(urls[i])
count=0
while True:
count +=1
input = driver.find_element_by_class_name("search-list")
with codecs.open(ref[i]+'_page_'+str(count)+'.txt','w',encoding='utf-8') as f: #保存网页源代码
f.write(input.text)
try:
detail_url = driver.find_element_by_link_text("下一页").get_attribute('href')
driver.get(detail_url)
except:
break