Python网络数据采集之审查元素

普通的网络数据可使用get或post方法来采集得到，而有些网站源代码只能通过审查元素的方式才可以看到，本文介绍如何使用Python代码实现审查元素的网络数据采集方法。

使用Python实现通过审查元素采集数据需要用到selenium库，具体实现步骤如下：

官网下载selenium并安装，编写Python测试脚本。

  from selenium import webdriver

  browser = webdriver.Firefox();
  browser.get("http://www.baidu.com");

运行脚本报以下错误：

  WebDriverException: Message: ‘geckodriver‘ executable needs to be in PATH.

是因为 selenium 3.x开始，webdriver/firefox/webdriver.py的init中，executable_path="geckodriver"；并且firefox 47以上版本，需要下载第三方driver，即geckodriver。

下载对应版本的浏览器驱动geckodriver，解压后将geckodriver.exe放置到firefox安装目录，并将该目录写入“环境变量-系统变量-Path”中。

再次运行脚本可能会报以下错误：

  WebDriverException: Message: Unable to find a matching set of capabilities

首先，检查本机java版本，selenium3.x只支持java8版本以上；其次，检查firefox浏览器版本，将firefox47卸载，安装最新版本的firefox版本57。两者无误后，即可成功，亲测有效。

运行脚本无报错后，可写采集网络数据的代码，如下所示。

  # -*- coding:utf-8 -*-
  from bs4 import BeautifulSoup
  from selenium import webdriver
  import re,requests,sys,json,random,time
  import os
      
  chromedriver = "D:\Program Files\Mozilla Firefox\geckodriver"
  driver = webdriver.Firefox(executable_path = chromedriver)
          
  totelPage = 200
  page = 1
  head = 'http://apps.webofknowledge.com'
  f = open('WOS_file.csv','w')
  while page <= totelPage:
      print "Page %i ..." % ( page )
  
      url = 'http://apps.webofknowledge.com/summary.do?product=WOS&parentProduct=WOS&search_mode=GeneralSearch&parentQid=&qid=1&SID=U2NEeEDm3nTK7peYwWt&&update_back2search_link_param=yes&page=' + str(page)
      driver.get(url)
      try:
          sourcePage = driver.page_source
          soup = BeautifulSoup(sourcePage,"html.parser")
          urls = soup.find_all("a", class_="smallV110", href=re.compile("full_record.do"),attrs={"tabindex":"0"})
          print(len(urls))
        
          for r_url in urls:
              list_url = head + str(r_url.attrs['href'])
              f.writelines(list_url +  '\n')
      finally:
          pass
  
      time.sleep(random.randint(5,10)/10.0)
      page += 1
          
  driver.quit()
  print "done..."
  f.close()

本例中主要使用了两个用于采集的第三方库webdriver和BeautifulSoup。其中，driver.page_source用于获取审查元素的网页源代码；BeautifulSoup(sourcePage,"html.parser")将网页源代码梳理成网格；BeautifulSoup中的find_all方法用于查询匹配信息。这也是网络数据采集常用库和方法，两个库的API可参考官方手册或相关博文，在此不赘述。

Python网络数据采集之审查元素

推荐阅读更多精彩内容