2020以来知乎的反爬机制越来越完善,对api的调用也做了加密限制,以往快捷的方法不能用了,还是回归requests+selenium的老方法吧。碰巧最近毕设需要做一个小型爬虫项目,用知乎来练练手吧。
爬取目标:知乎某用户的所有回答(可以改成文章、动态···)
爬虫环境:python3.8+requests库+bs4+selenium
爬虫流程:
1、确定start_url;
2、xpath定位页数所在标签,获取总页数;
3、循环调用函数get_answers(question_url)
获取每一页的回答。每一页最多20条回答,循环20次;
问题记录:
- 登录问题。若不登录首先会有频繁的弹窗影响速度,其次会受到某些反爬机制的限制;
解决方案:一般的思想是先手动进行一次登录(或脚本模拟),get cookies后带上cookie发送http请求。这里由于使用了selenium库,直接接管登录过的浏览器就好。
首先是通过命令行启动一个chrome
// 端口port可改成任意私有端口
chrome.exe --remote-debugging-port=8222 --user-data-dir="C:\selenum\AutomationProfile"
在打开的浏览器登录知乎后,运行python即可。
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:8222")
return webdriver.Chrome(executable_path='F:/ChromeDriver/chromedriver.exe', chrome_options=chrome_options)
- ip封禁。爬取数据量较小,暂时没遇到。一般的解决方法是设置ip代理池。
# -*- coding: utf-8 -*-
"""
Created on Mon Mar 29 16:30:00 2021
@author: zw
"""
# 引入必要的库
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json
import time
import re
def get_driver():
#采用接管已打开浏览器的方式
try:
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:8222")
return webdriver.Chrome(executable_path='F:/ChromeDriver/chromedriver.exe', chrome_options=chrome_options)
except Exception:
return webdriver.Firefox()
# 爬取一页的答案
def get_answers(question_url):
driver.get(question_url)
count = 0
for k in range(20):
try:
# xpath = '/html/body/div[1]/div/main/div/div[2]/div[1]/div/div[3]/div/div[2]/div[{}]/div/div[2]/div'.format(k+1)
#for封禁账号
xpath = '/html/body/div[1]/div/main/div/div[3]/div[1]/div/div[3]/div/div[2]/div[{}]/div/div[2]/div'.format(k+1)
button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath+'/button')))
button.click()
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath+'/span')))
answer = element.text
# print(answer)
file = open('F:/爬虫/jin-rong-nan.txt','a',encoding='utf-8')
file.write(answer)
file.close()
count = count + 1
# print('answer '+ str(count) +' collected!')
except:
continue
return count
if __name__ == "__main__":
driver = get_driver()
#设置爬取的用户回答url
#兰飞鸿
# start_url = 'https://www.zhihu.com/people/lan-fei-hong-3/answers'
# start_url = 'https://www.zhihu.com/people/lan-fei-hong-2/answers'
# start_url = 'https://www.zhihu.com/people/lan-fei-hong-26/answers'
#爱分析的金融男
start_url = 'https://www.zhihu.com/people/ai-jin-rong-de-fen-xi-ren/answers'
# xpath = '/html/body/div[1]/div/main/div/div[2]/div[1]/div/div[3]/div/div[2]/div[last()]/button[last()-1]'
#for封禁账号
xpath = '/html/body/div[1]/div/main/div/div[3]/div[1]/div/div[3]/div/div[2]/div[last()]/button[last()-1]'
driver.get(start_url)
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath)))
pagenum = int(element.text)
count = 0
print(pagenum)
print('crawl running...')
for k in range(pagenum):
# 设置你想要搜索的问题
question_url = start_url + '?page={}'.format(k+1)
count = count + get_answers(question_url)
print('answer '+ str(count) +' collected!')
pass