一、准备工作
- Python3.6
- requests
- BeautifulSoup
- selenium
- chromedriver
二、selenium作用
煎蛋做了反爬虫的机制,图片的URL做了加密处理,F12能看到,但是beautifulsoup解析不出来。 本来是想找解密的方法,无意中搜到selemium这个神器。 selenium 是一个web的自动化测试工具,可以模拟用户操作浏览器。这样就可以直接获取图片URL了
三、chromedriver下载
内网:https://npm.taobao.org/mirrors/chromedriver/
外网:https://sites.google.com/a/chromium.org/chromedriver/downloads
四、源代码
import requests
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
Directory = 'ooxx/'
base_url = "http://jandan.net/ooxx/page-"
path = "D:\chrome\chromedriver.exe"
driver = webdriver.Chrome(executable_path=path)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
}
img_url = []
urls = ["http://jandan.net/ooxx/page-{}#comments".format(str(i)) for i in range(80, 85)]
def getImg():
n = 1
for url in img_url:
print('第' + str(n) + ' 张', end='')
with open(Directory + url[-15:], 'wb') as f:
f.write(requests.get(url).content)
print('...OK!')
n = n+1
def getImgUrl(url):
driver.get(url)
data = driver.page_source
soup = BeautifulSoup(data, "html.parser") # 解析网页
images = soup.select("a.view_img_link") # 定位元素
for i in images:
z = i.get('href')
if str('gif') in str(z):
pass
else:
http_url = "http:" + z
img_url.append(http_url)
print(http_url)
if __name__ == "__main__":
for url in urls:
getImgUrl(url)
getImg()
print("")
项目地址:https://github.com/aszt/jiandan-gril
注:源码中存放了最新版,支持Chrome v62-64
PS:爬煎蛋不要太过分,对煎蛋服务器压力很大,练手后去爬其他大站吧。