抓包分析
抓包分析是爬虫必不可少的技能之一,常用的工具有Fiddler4,Charles, whareshark或者浏览器的debug.
什么时候需要抓包分析呢?
- APP数据的抓取,一般要结合反编译(后面有篇文章讲APP数据的抓取)
- 网页需要登录
- 复杂的抓取,比如对请求头,回复的报文头的分析,分析请求失败的原因等
登录
这里使用Fiddler4,网页上登录http://www.kanzhun.com/login/, 抓到得报文如下:
POST http://www.kanzhun.com/login.json HTTP/1.1
Host: www.kanzhun.com
Proxy-Connection: keep-alive
Content-Length: 69
Accept: application/json, text/javascript, */*; q=0.01
Origin: http://www.kanzhun.com
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Referer: http://www.kanzhun.com/login/
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.8,en;q=0.6
Cookie: W_CITY_S_V=0; ac="c_wno@sina.com"; t=EhhloI4AGnoXJMz; aliyungf_tc=AQAAAOkujCD0yw4ASSL+myvvTxkg1TH/; __c=1465718622; __g=-; __l=l=%2F&r=; __a=74808725.1465379010.1465379010.1465718622.6.2.3.6; AB_T=abvb
redirect=%2F&account=casd1%40sina.com&password=123456&remember=true
点击webforms后发现提交的表单内容为(部分内容我打了*):
Name | Value |
---|---|
redirect | / |
account | c_***@sina.com |
password | 1111**** |
remember | true |
那么我就可以通过requests模拟提交表单,实现登录。
# -*- coding:utf-8 -*-
"""
File Name : 'test3'.py
Description:
Author: 'chengwei'
Date: '2016/5/24' '14:08'
python: 2.7.10
"""
import requests
import time
import json
def main():
s = requests.Session()
data = {
"redirect": '/',
"account": 'username',
"password": 'passwd',
"remember": 'true',
}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept-Encoding': 'gzip, deflate',
'X-Requested-With': 'XMLHttpRequest',
'Accept': 'application/json, text/javascript, */*; q=0.01'
}
s.post('http://www.kanzhun.com/login.json', headers=headers, data=data)
res = s.get('http://www.kanzhun.com/gsx3195.html?ka=com-blocker1-salary', headers=headers)
time.sleep(1)
if __name__ == '__main__':
main()
如果不登录,访问工资页面是看不到全部内容的,而我们通过提交表单登录后,这个session以后访问工资页面就会返回全部内容。