一、需求:
用python实现去内涵段子里面下载网页当中的图片到本地当中
二、实现:
1、获取要爬取的URL地址
2、设置headers
3、请求网页内容,把html内容转换成XML
4、解析地址内容,进行图片下载
三、开始操作:以下图为例子
1、获取要爬取的URL地址:
url="http://www.neihan8.com/gaoxiaomanhua/index_2.html"
2、设置headers:
headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}
3、请求网页内容,把html内容转换成XML
request = urllib2.Request(url,headers=headers)
response = urllib2.urlopen(request).read()
xml = etree.HTML(response)#这个etree是需要在前面导入包的 : from lxml import etree
4、解析地址内容,进行图片下载,我们通过上面的图片进行获取到具体的xpath图片地址.
linklist = content.xpath('/html/body/div[@class="main wrap"]//div[@class="left"]/div[@class="pic-column-list mt10"]/div/a/img/@src')
ps:这个linklist里面存放的是所有这个xpath里面的内容,所以如果需要下载的话需要依次提取
for link in linklist:
image_request = urllib2.Request(link)
response = urllib2.urlopen(image_request).read()
filename = link[10:0]
with open(fileName,"wb") as f:
f.write(response)
上面是分别解释了一下流程,都是手写的代码,第一次写文章比较粗糙大家见谅了。下面是整个代码的内容
import urllib2
from lxmlimport etree
class Spider:
pass
def __init__(self):
self.pageNum =2
self.switch =True
def loadImage(self):
url ="http://www.neihan8.com/gaoxiaomanhua/index_"+str(self.pageNum)+".html"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}
request = urllib2.Request(url,headers=headers)
response= urllib2.urlopen(request).read()
content = etree.HTML(response)
linklist = content.xpath('/html/body/div[@class="main wrap"]//div[@class="left"]/div[@class="pic-column-list mt10"]/div/a/img/@src')
for image_linkin linklist:
print "downLoading..."
self.writeImage(image_link)
def writeImage(self,link_address):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}
download_request =urllib2.Request(link_address)
response = urllib2.urlopen(download_request).read()
fileName = link_address[-10:]
with open(fileName,"wb")as f:
f.write(response)
print "downLoad---FINISH"
if __name__ =="__main__":
spider = Spider()
spider.loadImage()