实战计划第四天,抓了100张照片。
最终成果是这样的:
我的代码:
#!/usr/bin/env python #告诉计算机执行程序在系统环境变量中的名字,详细位置在环境变量中设置好了
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import time
import urllib.request
url = 'http://weheartit.com/inspirations/taylorswift?page=' #网址弄错了耽误了效率
proxies = {"HTTP":"121.58.227.252:8080"}
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0'}
def download (url):
wb_data = requests.get(url,headers=headers)
if wb_data.status_code != 200:
return
filename = url.split('/')[4] #split是将字符串分解成小字符串
target = 'E:\PycharmProjects\homework4\imgs\{}.jpg'.format(filename)
with open(target,'wb') as fs:
fs.write(wb_data.content)
print('%s -> %s' % (url,target)) #遍历 cookies 中的 name 和 value 信息打印#和C中的占位符一致
'''''
def dl_image(url):
urllib.request.urlretrieve(url,path + url.split('/')[2] + url.split('.')[-1])
print('Done')
'''''
def get_img(url,data=None):
wb_data = requests.get(url,headers=headers) #代理和请求头文件
soup = BeautifulSoup(wb_data.text,'lxml')
imgs = soup.select('#main-container > div > div > div > div > div > a > img') #copy CSS selector
if data == None:
for img in imgs:
data = img.get('src')
print(data)
download(data)
def get_more_pages(start,end):
for one in range(start,end):
get_img(url+str(one))
time.sleep(2)
get_more_pages(1,10)
总结
- 对网址的处理,很多时候网址选择错误导致报错
- 代理搞了半天,每一个能用的,老师报错,用VPN解决掉了
- with as 读写文件方法
- split分割
- 异步加载 XRH下检视器看网页