爬虫01

前面正则表达式毕竟学的不咋地。所以原理知道。但不晓得怎么去匹配信息，

import urllib.request re 导入相关的模块

data=urllib.request.urlopen(“url”).read().decode(utf-8) 打开url界面，并读取

req=.*? 正则表达式匹配要得到的信息

res=re.compile(req).findall(data) 在url中选取要匹配的信息

for line in range(0,len(res)):

print(line[i]) 可以看到所爬的所有信息

写入文件中去：

with open(,w) as f: 添加文件路径，windows里面要用\\转义

for line in range(0,len(res)): 遍历

f.write(res[i]+"\n") 将每一Line写入\n的意思是写一个换一行。

有关urllib的基础知识：

import urllib.request

urllib.request.urlretrieve(网址,本地文件存贮地址) 这个函数可以直接从网上下东西到本地

urllib.request.urlcleanup() 可用于直接清除缓存，减少内存压力

还有info() 表示相关信息

对中文的转码：

keyword="彭坤"

keyword=urllib.request.quote(keyword)

超时设置：

网站服务器反应问题造成的网页显示时间长短，根据需要设定超时时间

urllib.request.urlopen("url",timeout=5) 这样来设置

自动模拟http请求

post() 表单，要登录的那种 and get()

get 一般网址为 url+http.?字段=值&字段=值&等

post格式：表单操作

import urllib.request

import urllib.parse

posturl=url

posttt=urllib.parse.urlencode({"name":"nideminzi","password":"nidemima"}).decode("utf-8") 对url进行解析，

进行post，需要用到urllib.request.Requset(地址，解析过的数据)

req=urllib.request.Request(posturl,posttt)

res=urllib.request.urlopen(req).read().decode("utf-8")

爬虫异常处理：

如果没有异常处理，遇到异常时会崩溃，下次运行时会重新开始运行

URLError 原因 :连不上服务器，远程url不存在，无网络，触发HTTPError错误

HTTPError

爬虫的浏览器伪装技术：fn+f12进入开发者工具栏

请求头的格式：（"User-Agent",具体的值）元组形式

headers=("User-Agent"," ")

opener=urllib.request.build_opener()

opener.addheaders=[headers]

data=opener.open(url).read().decode("utf-8")

用户代理池的用法

ip代理与ip代理池的构建：用代理ip爬网站百度搜搜西刺代理

初始化ip，和初始化用户代理差不多一样的步骤，

from bs4 import BeautifulSoup

import os

import urllib.request

if not os.path.exists('photofirst'):

os.makedirs('phtofirst')

url="https://pixabay.com/zh/photos/?q=%E9%A3%8E%E6%99%AF&image_type=&min_width=&min_height=&cat=&pagi="

for i in range(1,200):

res=urllib.request.urlopen(url+str(i))

data=BeautifulSoup(res,'lxml')

datas=data.find_all('img')

link=[]

for i in datas:

s=i.get('srcset')

if s is None:

continue

else:

link.append(s.split(' ')[0])

i=0

for links in link:

i+=1

filename='photofirst//'+'photofist'+str(i)+'.gpj'

with open(filename,'w'):

urllib.request.urlretrieve(links,filename)

推荐阅读更多精彩内容