大师兄的Python学习笔记(十八）: Python与HTTP

大师兄的Python学习笔记(十七）: Mail编程
 大师兄的Python学习笔记(十九）: Python与(XML和JSON)

一、HTTP相关概念

1. 关于HTML

HTML称为超文本标记语言，是一种标识性的语言。
通过一系列标签将网络上的文档格式统一，使分散的Internet资源连接为一个逻辑整体。
HTML文本是由HTML命令组成的描述性文本，HTML命令可以说明文字、图形、动画、声音、表格、链接等。
常见的网页都是由HTML语言制作的。
简单的HTML案例:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>HTML Sample</title>
</head>
<body>
    <text>Hello World!</text>
</body>
</html>

2. 关于HTTP协议

HTTP协议(Hyper Text Transfer Protocol)即超文本传输协议,用于在互联网的服务器和本地浏览器之间传输HTML文件。
HTTP基于TCP/IP通信协议来传递数据。
HTTP是位于TCP/IP四层模型的应用层。

3. 关于URL

URL(UniformResourceLocator)即统一资源定位符,是互联网上用来标识某一处资源的地址。
一个完整的URL包括以下几部分：

部分	介绍
协议	代表使用的协议，可以是HTTP、FTP等协议名称后加":" 在"HTTP:"后面加分隔符“//”
域名	资源的地址部分，可以使用地址名称或IP地址作为域名使用
端口	跟在域名后面的是端口，域名和端口之间使用“:”作为分隔符端口不是L必须的，如省略则采用默认端口
虚拟目录	从域名后的第一个“/”开始到最后一个“/”为止，是虚拟目录虚拟目录不是必须的
文件名	从域名后的最后一个“/”开始到"锚"之间是文件部分文件名部分不是必须的，省略则使用默认文件名
锚	从“#”开始到最后都是锚部分锚不是必须的
参数	从“？”开始到“#”为止之间的部分为参数可以允许有多个参数，参数与参数之间用“&”作为分隔符

例1: https://www.baidu.com/?tn=62095104_26_oem_dg
例2: ftp://127.0.0.1:8080

4. HTTP工作步骤

4.1 客户端连接到Web服务器

HTTP客户端，通常是浏览器，与Web服务器的HTTP端口（默认为80）建立一个TCP套接字连接。

4.2 发送HTTP请求

通过TCP套接字，客户端向Web服务器发送一个文本的请求报文。

4.3 服务器接受请求并返回HTTP响应

Web服务器解析请求，定位请求资源。
服务器将资源复本写到TCP套接字(响应)，由客户端读取。

4.4 释放连接

若connection 模式为close，则服务器主动关闭TCP连接，客户端被动关闭连接。
若connection 模式为keepalive，则该连接会保持一段时间，在该时间内可以继续接收请求。

4.5 客户端浏览器解析HTML内容

客户端浏览器首先解析状态行，查看表明请求是否成功的状态代码。
客户端解析每一个响应头，响应头告知以下为若干字节的HTML文档和文档的字符集。
客户端浏览器读取响应数据HTML，根据HTML的语法对其进行格式化，并在浏览器窗口中显示。

5. 关于HTTP请求

客户端发送一个HTTP请求(request)到服务器请求消息。

5.1 请求结构

请求(Request)由请求行（request line）、请求头部（header）、空行和请求数据四个部分组成。
例:

GET /home/msg/data/personalcontent?num=8&indextype=manht&_req_seqid=xx&asyn=1&t=xx&sid=xx HTTP/1.1
Host: www.baidu.com
Connection: keep-alive

Accept: text/plain, */*; q=0.01
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: https://www.baidu.com/?tn=62095104_26_oem_dg
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
... ...

5.2 请求方法

HTTP1.1 包含了九种请求方法：

序号	方法	描述
1	GET	请求指定的页面信息，并返回实体主体。
2	HEAD	类似于 GET 请求，只不过返回的响应中没有具体的内容，用于获取报头
3	POST	向指定资源提交数据进行处理请求（例如提交表单或者上传文件）。数据被包含在请求体中。 POST 请求可能会导致新的资源的建立和/或已有资源的修改。
4	PUT	从客户端向服务器传送的数据取代指定的文档的内容。
5	DELETE	请求服务器删除指定的页面。
6	CONNECT	HTTP/1.1 协议中预留给能够将连接改为管道方式的代理服务器。
7	OPTIONS	允许客户端查看服务器的性能。
8	TRACE	回显服务器收到的请求，主要用于测试或诊断。
9	PATCH	是对 PUT 方法的补充，用来对已知资源进行局部更新。

5.3 GET方法

使用GET方法，请求的数据会附在URL之后，以?分割URL和传输数据。
多个参数用&连接。
如果数据是英文字母/数字，原样发送，如果是空格，转换为+，如果是中文/其他字符，则直接把字符串用BASE64加密。
例:

GET /home/msg/data/personalcontent?num=8&indextype=manht&_req_seqid=xx&asyn=1&t=xx&sid=xx HTTP/1.1
Host: www.baidu.com
Connection: keep-alive

5.4 POST方法

使用POST方法，把提交的数据放置在是HTTP包的包体中。

POST / HTTP/1.1
Host: www.kaixin001.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
Content-Length: 4
Content-Type: application/x-www-form-urlencoded
Referer: http://www.kaixin001.com/photo/album.php?flag=6&uid=3629403&albumid=49478971
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9

6. 关于响应消息

一般情况下，服务器接收并处理客户端发过来的请求后会返回一个HTTP的响应消息。

6.1 响应结构

响应也由四个部分组成，分别是：状态行、消息报头、空行和响应正文。
例：

HTTP/1.1 200 OK
Date: Fri, 20 May 2020 15:57:21 GMT
Content-Type: text/html; charset=UTF-8

<html>
      <head></head>
      <body>
            HTTP/1.1 302 Moved Temporarily
            Server: bfe/1.0.8.18
            Date: Wed, 20 May 2020 07:51:29 GMT
            Content-Type: text/html
            Content-Length: 161
            Connection: keep-alive
            Location: https://www.baidu.com/?tn=62095104_26_oem_dg
      </body>
</html>

6.2 响应状态码

状态码包含在状态行中，是由服务器告诉客户端发生了什么事。
由三位数字组成，第一个数字定义了响应的类别。
共分五个大类，三位数字对应具体状态:

7. Cookies简介

由于HTTP协议是无状态的，即服务器不知道用户上一次做了什么，这严重阻碍了交互式Web应用程序的实现。
Cookies是用来绕开HTTP的无状态性的“额外手段”之一，服务器可以设置或读取Cookies中包含的信息，借此维护用户跟服务器会话中的状态。

二、http包

http包在Python标准库中，能够实现HTTP协议的一些功能。

1. http.client

http.client可以实现HTTP客户端的很多功能。

1.1 HTTPConnection( host，port，[timeout] )类

返回一个HTTPConnection实例。
host：目标服务器ip地址或域名。
port：目标服务器端口号。
timeout：阻塞超时时间。

1) HTTPConnection.request( method，url，body =None，headers = {} )方法

用于发送请求报文。

method: 发送方法，通常为 GET 或者 POST

url: 操作的url地址

body: 发送的数据

headers: HTTP头部

2) HTTPConnection.getrespone()方法

用于获取响应报文。

1.2 HTTPRespone()类

通过HTTPConnection.getrespone()返回的实例。

1) HTTPRespone.getheader(name)

返回头部中的 name 字段对应的值。

2) HTTPRespone.getheaders()

以元组的方式返回整个头部的信息。

3) HTTPRespone.read()

返回响应报文中的正文部分。

4) HTTPRespone.status

返回状态码。

5) HTTPRespone.version

返回 HTTP协议版本。

>>>import http.client

>>>host = 'www.baidu.com'
>>>ip = 80

>>>connection = http.client.HTTPConnection(host,ip)
>>>connection.request('GET','/') # 发送请求
>>>res = connection.getresponse() # 获取响应
>>>print(f'状态码:{res.status}')
>>>print(f'协议版本:{res.version}')
>>>print(f'HTTP头:{res.getheaders()}')
>>>print(f'正文:{res.read()}')
状态码:200
协议版本:11
HTTP头:[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ('Connection', 'keep-alive'), ('Content-Length', '14615'), ('Content-Type', 'text/html'), ('Date', 'Fri, 22 May 2020 03:19:17 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Pragma', 'no-cache'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BAIDUID=62F12DC1BA263D19E1B7BA2621BEC17D:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max- ... ...

2. http.server

http.server可以实现网络服务器端的对话。
简答案例:

>>>from http.server import HTTPServer,BaseHTTPRequestHandler
>>>import json

>>>host = ('localhost',10005)
>>>data = {'result':'Hello from server'}

>>>class Resquest(BaseHTTPRequestHandler):
>>>    def do_GET(self):
>>>        self.send_response(200)
>>>        self.send_header('Content-type','application/json')
>>>        self.end_headers()
>>>        self.wfile.write(json.dumps(data).encode())
>>>if __name__ == '__main__':
>>>    server = HTTPServer(host,Resquest)
>>>    print(f"Starting server,listen at:{host}")
>>>    server.serve_forever()
Starting server,listen at:('localhost', 10005)
127.0.0.1 - - [23/May/2020 22:09:35] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [23/May/2020 22:09:35] "GET /favicon.ico HTTP/1.1" 200 -

3. http.cookies

http.cookies可以实现cookies功能。
创造一个简单的cookie:

>>> from http import cookies
>>> cookie = cookies.SimpleCookie()
>>> cookie['new_cookie'] = 'my_new_cookie_value'
>>> print(cookie)
Set-Cookie: new_cookie=my_new_cookie_value

三、urllib包

urllib包在Python标准库中，也面向HTTP协议。
与HTTP包的区别是，http包则实现了对 HTTP协议的封装,而urllib包主要用于处理url 。
HTTP包是urllib.request模块的底层。

1. urllib.request模块

用请求，响应，浏览器模拟，代理，cookie等功能。
时urllib包中最常用的模块。

1.1 快速请求

1) request.urlopen(url, data=None, timeout=10)方法

打开url地址，并返回一个HTTPRespones实例。
url: 目标网址
data：Post提交的数据
timeout：网站的访问超时时间

2) HTTPRespone.read()方法

返回文本数据。

3) HTTPRespone.info()方法

获得服务器返回的头信息。

4) HTTPRespone.getcode()方法

返回状态码。

5) HTTPRespone.geturl

返回请求的url。

>>>from urllib.request import urlopen

>>>url = 'https://www.baidu.com'
>>>response = urlopen(url,data=None,timeout=10)

>>>print(f"code:{response.getcode()}\n{'*'*20}")
>>>print(f"url:{response.geturl()}\n{'*'*20}")
>>>print(f"info:{response.info()}\n{'*'*20}")
>>>print(f"page:{response.read().decode('utf-8')}\n{'*'*20}")
code:200
********************
url:https://www.baidu.com
********************
info:Accept-Ranges: bytes
Cache-Control: no-cache
Content-Length: 227
Content-Type: text/html
Date: Mon, 25 May 2020 08:27:08 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=BB6BFE68C85C0EAD420212462967E21F; expires=Thu, 31-Dec->37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1590395228; expires=Thu, 31-Dec-37 23:55:55 GMT; max->age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BAIDUID=BB6BFE68C85C0EAD97009D0ABA0A5025:FG=1; max->age=31536000; expires=Tue, 25-May-21 08:27:08 GMT; domain=.baidu.com; path=/; >version=1; comment=bd
Strict-Transport-Security: max-age=0
Traceid: 1590395228050946970613001546894747393222
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close


********************
page:<html>
<head>
  <script>
      location.replace(location.href.replace("https://","http://"));
  </script>
</head>
<body>
  <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
********************

1.2 模拟浏览器

通过更改header，模拟成浏览器而不是被识别为Python程序。

1) 模拟PC浏览器

使用request.Request(url,header)方法添加头header。

>>>from urllib.request import urlopen,Request

>>>url = 'https://www.baidu.com'
>>>headers = {
>>>'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'
}
>>>request = Request(url,headers=headers)
>>>response = urlopen(request)

>>>print(response.read().decode('utf-8'))
<!DOCTYPE html><!--STATUS OK-->
                           <html><head><meta http-equiv="Content-Type" >content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" >content="IE=edge,chrome=1">... ...

2) 模拟手机浏览器

使用Request.add_header(header)方法添加header。

>>>from urllib.request import urlopen,Request

>>>url = 'https://www.baidu.com'
>>>req = Request(url)
>>>req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) '
>>>                             'AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
>>>print(req.headers)
{'User-agent': 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25'}

1.3.Cookie的使用

>>>from urllib import request
>>>import http.cookiejar

>>>def get_cookie(url,filename="cookie_sample.txt"):
>>>    #获取页面cookie并保存到文件
>>>    cookie = http.cookiejar.MozillaCookieJar(filename)
>>>    handler = request.HTTPCookieProcessor(cookie)
>>>    opener = request.build_opener(handler)
>>>    resp = opener.open(url)
>>>    cookieStr = ''
>>>    for item in cookie:
>>>        cookieStr += f"{item.name}={item.value};"
>>>    print(cookieStr)
>>>    cookie.save()

>>> if __name__ == '__main__':
>>>     url = "https://www.baidu.com"
>>>     req = request.Request(url)
>>>     get_cookie(req)
BAIDUID=12619442C4A42DEEF0B77B57F14485B8:FG=1;BIDUPSID=12619442C4A42DEE693CFC78F0B1716D;PSTM=1590397291;BD_NOT_HTTPS=1;

1.4 设置代理

>>>import urllib.request

>>>def load_proxy(proxy):
>>>    proxies = urllib.request.ProxyHandler(proxy) # 创建代理处理器
>>>    opener = urllib.request.build_opener(proxies,urllib.request.HTTPHandler) # 创建特定的opener对象
>>>    urllib.request.install_opener(opener) # 安装全局的opener

>>>if __name__ == '__main__':
>>>    proxy = {'http': 'xx.xx.xx.xx:xxxx', 'https': 'xx.xx.xx.xx:xxxx'}  # 代理地址
>>>    load_proxy(proxy)

2. urllib.error模块

可以捕获urllib.request产生的异常。

2.1 urllib.error.URLError

通常在网络无法连接、服务器不存等情况触发。

>>>import urllib.request
>>>import urllib.error

>>>request = urllib.request.Request("http://www.mustnotaanurl.com/") # 不存在的服务器
>>>try:
>>>    urllib.request.urlopen(request).read()
>>>except urllib.error.URLError as e:
>>>    print(e.reason)
[Errno 11001] getaddrinfo failed

2.2 urllib.error.HTTPError

HTTPError是URLError的子类。
当状态码不正常时(非2xx时)，会捕获到HTTPError。

>>>import urllib.request
>>>import urllib.error

>>>request = urllib.request.Request("https://www.sina.com.cn/notexist") # 不存在的页面
>>>try:
>>>    response = urllib.request.urlopen(request)
>>>except urllib.error.HTTPError as e:
>>>    print(e)
HTTP Error 404: Not Found

3. urllib.parse模块

实现url的识别和分段。

3.1 urllib.parse.urljoin(url)方法

用于拼接url。
url必须为同样的站点,否则后面参数会覆盖前面的地址。

>>> from urllib.parse import urljoin

>>> url1 = urljoin('http://www.baidu.com','index.html')
>>> url2 = urljoin('http://www.baidu.com','http://www.sina.com/index.html')
>>> print(url1)
>>> print(url2)
http://www.baidu.com/index.html
http://www.sina.com/index.html

3.2 urllib.parse.urlunparse()方法

用于构造url。

>>>from urllib.parse import urlunparse

>>>url = urlunparse(('http','www.baidu.com','/index.html','8080','id=name',''))
>>>print(url)
http://www.baidu.com/index.html;8080?id=name

3.3 urlencode()方法

将字典构形式的参数序列化为url编码后的字符串

>>>from urllib.parse import urlencode

>>>params ={
>>>    'name':'xxx',
>>>    'age':20
>>>}
>>>print(urlencode(params))
name=xxx&age=20

3.4 quote()和unquote方法

quote()将中文转换为URL编码。
unquote()是quote()的反操作。

>>>from urllib.parse import quote,unquote

>>>kw = '雪纳瑞'
>>>url_quote =f'www.baidu.com/s?wd={quote(kw)}'
>>>print(url_quote)
www.baidu.com/s?wd=%E9%9B%AA%E7%BA%B3%E7%91%9E

>>>url_unquote = unquote(url_quote)
>>>print(url_unquote)
www.baidu.com/s?wd=雪纳瑞

4. urllib.robotparser模块

用于解析网站的Robots协议。
Robots协议的全称是“网络爬虫排除标准”（Robots Exclusion Protocol），网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不能抓取。
当一个爬虫访问一个站点时，它会首先检查该站点根目录下是否存在robots.txt，如果存在，搜索机器人就会按照该文件中的内容来确定访问的范围；如果该文件不存在，所有的搜索蜘蛛将能够访问网站上所有没有被口令保护的页面。

>>>from urllib import robotparser

>>>rp = robotparser.RobotFileParser()
>>>rp.set_url("http://www.sina.com.cn/robots.txt")
>>>rp.read()
>>>print(rp.can_fetch("admin", "https://news.sina.com.cn/")) # 该页面是否允许被爬
True

四、urllib3包

Urllib3是一个功能强大的第三方包，用于HTTP客户端。
相比Urllib包，提供了很多增强特性。

1.1 线程安全的连接池

>>>import urllib3

>>>headers={"content-type":"applaction/json"}
>>>http = urllib3.PoolManager() # 创造一个线程池实例
>>>res_get = http.request("GET","http://www.baidu.com",fields={"key":"value"},headers=headers) # 创建get请求
>>>res_post = http.request("POST","https://www.baidu.com?wd={quote(kw)}",headers=headers)
>>>print(f"get code:{res_get.status}")
>>>print(f"post code:{res_post.status}")
get code:200
post code:200

1.2 客户端 SSL/TLS 验证

>>>import urllib3,certifi

>>>http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED',ca_cert=certifi.where())

1.3 失败重试以及 HTTP 重定向

>>>import urllib3
>>>from urllib3.exceptions import MaxRetryError

>>>http = urllib3.PoolManager(timeout=urllib3.Timeout(connect=1.0,read=2.0)) # Timeout对象，可以是connect timeout 也可以是read timeout
>>>try:
>>>    res = http.request("GET","http://www.google.com",retries=10,redirect=False) # 重试10次,并关闭重定向
>>>except MaxRetryError as e:
>>>    print(e)
HTTPConnectionPool(host='www.google.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x000001CF2E33F908>, 'Connection to www.google.com timed out. (connect timeout=1.0)'))

1.4 HTTP 和 SOCKS 代理

使用ProxyManager替代PoolManager,操作方法一样。

import urllib3

proxy = urllib3.ProxyManager("http://proxyurl:10005")
request = proxy.request("GET","https://www.baidu.com/")

参考资料

https://blog.csdn.net/u010138758/article/details/80152151 J-Ombudsman
https://www.cnblogs.com/zhuluqing/p/8832205.html moisiet
https://www.runoob.com 菜鸟教程
http://www.tulingxueyuan.com/ 北京图灵学院
http://www.imooc.com/article/19184?block_id=tuijian_wz#child_5_1 两点水
https://blog.csdn.net/weixin_44213550/article/details/91346411 python老菜鸟
https://realpython.com/python-string-formatting/ Dan Bader
https://www.liaoxuefeng.com/ 廖雪峰
https://blog.csdn.net/Gnewocean/article/details/85319590 新海说
https://www.cnblogs.com/Nicholas0707/p/9021672.html Nicholas
https://www.cnblogs.com/dalaoban/p/9331113.html 超天大圣
https://blog.csdn.net/zhubao124/article/details/81662775 zhubao124
https://blog.csdn.net/z59d8m6e40/article/details/72871485 z59d8m6e40
《Python学习手册》Mark Lutz
《Python编程从入门到实践》Eric Matthes

本文作者：大师兄(superkmi)