先开启mitmproxy, 命令如下
mitmweb -s 抓取.py
我们需要对request的url做过滤,发现请求的url包含有ajax字符就提取response内容,然后使用bs4解析xml,获取需要的数据,代码如下:
from mitmproxy import ctx
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
# def request(flow):
# flow.request.headers['user-agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
def run_selenium():
driver = webdriver.PhantomJS()
url = 'http://wsjs.saic.gov.cn/txnRead01.do?SVVVdE0o=KalIkqkedI6edI6edpSi_r6ZKYhRAJQahSFFMpYtTEaqqH0'
driver.get(url)
def response(flow):
ctx.log.error('获取的url是: ' + flow.request.url)
if 'txnRead02.ajax' in flow.request.url:
soup = BeautifulSoup(flow.response.text, 'xml')
for record in soup.find_all('record'):
item = {}
item['index'] = record.find('index').get_text()
item['注册号'] = record.find('sn').get_text()
item['中文名称'] = record.find('hnc').get_text()
item['注册时间'] = record.find('mno').get_text()
item['英文名称'] = record.find('hne').get_text()
item['国际分类'] = record.find('nc').get_text()
ctx.log.warn(str(item))
df = pd.DataFrame(item, index = ['0'])
header = True if item['index'] == 1 else False
df.to_csv('/爬虫例子/商标.csv', mode = 'a', encoding='utf_8_sig', index = False, header = header)
# [ctx.log.warn(a.get('href')) for a in soup.find_all('a')]
if __name__ == "__main__":
run_selenium()
在这里我并没有使用selenium来自动调用网页,后续会继续对此进行改进的。
需要手动进行的操作就是翻页,翻页后,mitmproxy会自动获取结果并且解析保存为csv的文件。