所有需要安装的库:
beautifulsoup4==4.7.1
certifi==2019.6.16
chardet==3.0.4
fake-useragent==0.1.11
freeze==1.0.10
idna==2.8
lxml==4.3.4
Pillow==6.1.0
pymongo==3.8.0
PyMySQL==0.9.3
requests==2.22.0
selenium==3.141.0
six==1.12.0
soupsieve==1.9.2
urllib3==1.25.3
前提回顾:
- 请求网站,获取源码: urllib、requests、selenium、pyquery
- 解析源码: 正则表达式,lxml.etree、beautifulsoup4、selenium
- 存储数据: pymysql、pymongo
1. 代理语法
1.1.1 urllib的代理设置: User-Agent
import urllib.request
from fake_useragent import UserAgent
# 设置User-Agent参数
ua = UserAgent()
headers = {
'User-Agent': ua.random
}
req = urllib.request.Request(url='http://httpbin.org/get',
headers=headers)
res = urllib.request.urlopen(req)
print(res.read().decode('utf-8'))
1.1.2 urllib设置user-agent和ip
ua = UserAgent()
proxies = {
'http': 'http://182.116.234.232:9999',
'https': 'https://182.116.234.232:9999',
}
headers = {
'User-Agent': ua.random
}
url = 'http://httpbin.org/get'
proxy_handler = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_handler)
# 1.修改opener的addheaders属性
# opener.addheaders = [('User-Agent', ua.random)]
# 2.修改opener.open()接受参数的类型,传递的是请求对象
req = urllib.request.Request(url, headers=headers)
res = opener.open(req)
print(res.read().decode('utf-8'))
1.2 requests的user-agent参数和ip设置
import requests
url = 'http://httpbin.org/get'
ua = UserAgent()
proxies = {
'http': 'http://113.128.8.9:9999',
'https': 'https://113.128.8.9:9999',
}
headers = {
'User-Agent': ua.random
}
res = requests.get(url, headers=headers, proxies=proxies)
print(res.text)
1.3 selenium的ip设置
from selenium import webdriver
from selenium.webdriver import ChromeOptions
options = ChromeOptions()
options.add_argument('--proxy-server=http://113.128.8.9:9999')
browser = webdriver.Chrome(options=options)
url = 'http://httpbin.org/get'
browser.get(url)