心得
- 抓取图片要加载
urllib.request
库,使用urllib.request.urlretrieve()
函数,目录要先建立好 - 下载到本地的文件命名可以截取服务器端文件名称
- 动态加载的URL通过观察Network/XHR行为获得
- 功能用函数封装后使用更方便
我的代码
from bs4 import BeautifulSoup
import requests,urllib.request
import time
url = 'http://weheartit.com/inspirations/taylorswift'
urls =['http://weheartit.com/inspirations/taylorswift?scrolling=true&page={}&before=256131287'.format(str(i)) for i in range(2,21,1)]
urls.insert(0,url)
save_path = 'C:/Users/tanghx/Desktop/pic/'
def get_img_src(url, data = None):
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text,'lxml')
imgs = soup.select('div.entry-preview > a > img')
photo_links = []
for img in imgs:
photo_links.append(img.get('src'))
return(photo_links)
def retrieve_pics(photo_links):
for item in photo_links:
urllib.request.urlretrieve(item,save_path + item[-24:-15] + item[-4:])
n = 0
for single_url in urls:
retrieve_pics(get_img_src(single_url))
n = n + 1
print('Page ',n,' retrieved')
time.sleep(2)
运行结果
下载的图片