Python3.X 爬虫实战（动态页面爬取解析）

Python3+Scrapy+phantomJs+Selenium爬取今日头条
在实现爬虫的过程中，我们不可避免的会爬取又js以及Ajax等动态网页技术生成网页内容的网站，今日头条就是一个很好的例子。
本文所要介绍的是基于Python3，配合Scrapy+phantomjs+selenium框架的动态网页爬取技术。
本文所实现的2个项目已上传至Github中，求Star~ 1. 爬取今日头条新闻列表URL： 2. 爬取今日头条新闻内容：
静态网页爬取技术以及windows下爬虫环境搭建移步上几篇博客，必要的安装软件也在上一篇博客中提供。

本文介绍使用PhantongJs + Selenium实现新闻内容的爬取，爬取新闻列表的url也是相同的原理，不再赘述。
项目结构

这里写图片描述

项目原理
底层代码使用Python3，网络爬虫基础框架采用Scrapy，由于爬取的是动态网页，整个网页并不是直接生成页面，动过Ajax等技术动态生成。所以这里考虑采用 PhantomJs+Selenium模拟实现一个无界面的浏览器，去模拟用户操作，抓取网页代码内容。
代码文件说明
项目结构从上到下依次为：
middleware.py：整个项目的核心，用于启动中间件，在Scrapy抓取调用request的过程中实现模拟用户操作浏览器
ContentSpider.py：爬虫类文件，定义爬虫
commonUtils：工具类
items.py：爬虫所抓取到的字段存储类
pipelines.py：抓取到的数据处理类

这5个为关键类代码，其余的代码为业务相关代码。
关键代码讲解
middleware.py

douguo request middleware

for the page which loaded by js/ajax

ang changes should be recored here:

@author zhangjianfei

@date 2017/05/04

from selenium import webdriver
from scrapy.http import HtmlResponse
from DgSpiderPhantomJS import settings
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import random

class JavaScriptMiddleware(object):
print("LOGS Starting Middleware ...")

def process_request(self, request, spider):

    print("LOGS:  process_request is starting  ...")

    # 开启虚拟浏览器参数
    dcap = dict(DesiredCapabilities.PHANTOMJS)

    # 设置agents
    dcap["phantomjs.page.settings.userAgent"] = (random.choice(settings.USER_AGENTS))

    # 启动phantomjs
    driver = webdriver.PhantomJS(executable_path=r"D:\phantomjs-2.1.1\bin\phantomjs.exe", desired_capabilities=dcap)

    # 设置60秒页面超时返回
    driver.set_page_load_timeout(60)
    # 设置60秒脚本超时时间
    driver.set_script_timeout(60)

    # get page request
    driver.get(request.url)

    # simulate user behavior
    js = "document.body.scrollTop=10000"
    driver.execute_script(js)  # 可执行js，模仿用户操作。此处为将页面拉至1000。

    # 等待异步请求响应
    driver.implicitly_wait(20)

    # 获取页面源码
    body = driver.page_source

    return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)

-- coding: utf-8 --

import scrapy
import random
import time
from DgSpiderPhantomJS.items import DgspiderPostItem
from scrapy.selector import Selector
from DgSpiderPhantomJS import urlSettings
from DgSpiderPhantomJS import contentSettings
from DgSpiderPhantomJS.mysqlUtils import dbhandle_update_status
from DgSpiderPhantomJS.mysqlUtils import dbhandle_geturl

class DgContentSpider(scrapy.Spider):
print('LOGS: Spider Content_Spider Staring ...')

sleep_time = random.randint(60, 90)
print("LOGS: Sleeping :" + str(sleep_time))
time.sleep(sleep_time)

# get url from db
result = dbhandle_geturl()
url = result[0]
# spider_name = result[1]
site = result[2]
gid = result[3]
module = result[4]

# set spider name
name = 'Content_Spider'
# name = 'DgUrlSpiderPhantomJS'

# set domains
allowed_domains = [site]

# set scrapy url
start_urls = [url]

# change status
"""对于爬去网页，无论是否爬取成功都将设置status为1，避免死循环"""
dbhandle_update_status(url, 1)

# scrapy crawl
def parse(self, response):

    # init the item
    item = DgspiderPostItem()

    # get the page source
    sel = Selector(response)

    print(sel)

    # get post title
    title_date = sel.xpath(contentSettings.POST_TITLE_XPATH)
    item['title'] = title_date.xpath('string(.)').extract()

    # get post page source
    item['text'] = sel.xpath(contentSettings.POST_CONTENT_XPATH).extract()

    # get url
    item['url'] = DgContentSpider.url

    yield item

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class DgspiderUrlItem(scrapy.Item):
url = scrapy.Field()

class DgspiderPostItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
text = scrapy.Field()

-- coding: utf-8 --

Define your item pipelines here

Don't forget to add your pipeline to the ITEM_PIPELINES setting

See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import re
import datetime
import urllib.request
from DgSpiderPhantomJS import urlSettings
from DgSpiderPhantomJS import contentSettings
from DgSpiderPhantomJS.mysqlUtils import dbhandle_insert_content
from DgSpiderPhantomJS.uploadUtils import uploadImage
from DgSpiderPhantomJS.mysqlUtils import dbhandle_online
from DgSpiderPhantomJS.PostHandle import post_handel
from DgSpiderPhantomJS.mysqlUtils import dbhandle_update_status
from bs4 import BeautifulSoup
from DgSpiderPhantomJS.commonUtils import get_random_user
from DgSpiderPhantomJS.commonUtils import get_linkmd5id

class DgspiderphantomjsPipeline(object):

# post构造reply
cs = []

# 帖子title
title = ''

# 帖子文本
text = ''

# 当前爬取的url
url = ''

# 随机用户ID
user_id = ''

# 图片flag
has_img = 0

# get title flag
get_title_flag = 0

def __init__(self):
    DgspiderphantomjsPipeline.user_id = get_random_user(contentSettings.CREATE_POST_USER)

# process the data
def process_item(self, item, spider):
    self.get_title_flag += 1

    # 获取当前网页url
    DgspiderphantomjsPipeline.url = item['url']

    # 获取post title
    if len(item['title']) == 0:
        title_tmp = ''
    else:
        title_tmp = item['title'][0]

    # 替换标题中可能会引起 sql syntax 的符号
    # 对于分页的文章，只取得第一页的标题
    if self.get_title_flag == 1:

        # 使用beautifulSoup格什化标题
        soup_title = BeautifulSoup(title_tmp, "lxml")
        title = ''
        # 对于bs之后的html树形结构，不使用.prettify()，对于bs, prettify后每一个标签自动换行，造成多个、
        # 多行的空格、换行，使用stripped_strings获取文本
        for string in soup_title.stripped_strings:
            title += string

        title = title.replace("'", "”").replace('"', '“')
        DgspiderphantomjsPipeline.title = title

    # 获取正post内容
    if len(item['text']) == 0:
        text_temp = ''
    else:
        text_temp = item['text'][0]

    soup = BeautifulSoup(text_temp, "lxml")
    text_temp = str(soup)

    # 获取图片
    reg_img = re.compile(r'<img.*?>')
    imgs = reg_img.findall(text_temp)
    for img in imgs:
        DgspiderphantomjsPipeline.has_img = 1

        # matchObj = re.search('.*src="(.*)"{2}.*', img, re.M | re.I)
        match_obj = re.search('.*src="(.*)".*', img, re.M | re.I)
        img_url_tmp = match_obj.group(1)

        # 去除所有Http:标签
        img_url_tmp = img_url_tmp.replace("http:", "")

        # 对于![a.jpg](http://a.jpg)这种情况单独处理
        imgUrl_tmp_list = img_url_tmp.split('"')
        img_url_tmp = imgUrl_tmp_list[0]

        # 加入http
        imgUrl = 'http:' + img_url_tmp

        list_name = imgUrl.split('/')
        file_name = list_name[len(list_name)-1]

        # if os.path.exists(settings.IMAGES_STORE):
        #     os.makedirs(settings.IMAGES_STORE)

        # 获取图片本地存储路径
        file_path = contentSettings.IMAGES_STORE + file_name
        # 获取图片并上传至本地
        urllib.request.urlretrieve(imgUrl, file_path)
        upload_img_result_json = uploadImage(file_path, 'image/jpeg', DgspiderphantomjsPipeline.user_id)
        # 获取上传之后返回的服务器图片路径、宽、高
        img_u = upload_img_result_json['result']['image_url']
        img_w = upload_img_result_json['result']['w']
        img_h = upload_img_result_json['result']['h']
        img_upload_flag = str(img_u)+';'+str(img_w)+';'+str(img_h)

        # 在图片前后插入字符标记
        text_temp = text_temp.replace(img, '[dgimg]' + img_upload_flag + '[/dgimg]')

    # 替换<strong>标签
    text_temp = text_temp.replace('<strong>', '').replace('</strong>', '')

    # 使用beautifulSoup格什化HTML
    soup = BeautifulSoup(text_temp, "lxml")
    text = ''
    # 对于bs之后的html树形结构，不使用.prettify()，对于bs, prettify后每一个标签自动换行，造成多个、
    # 多行的空格、换行
    for string in soup.stripped_strings:
        text += string + '\n\n'

    # 替换因为双引号为中文双引号，避免 mysql syntax
    DgspiderphantomjsPipeline.text = self.text + text.replace('"', '“')

    return item

# spider开启时被调用
def open_spider(self, spider):
    pass

# sipder 关闭时被调用
def close_spider(self, spider):

    # 数据入库：235
    url = DgspiderphantomjsPipeline.url
    title = DgspiderphantomjsPipeline.title
    content = DgspiderphantomjsPipeline.text
    user_id = DgspiderphantomjsPipeline.user_id
    create_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    dbhandle_insert_content(url, title, content, user_id, DgspiderphantomjsPipeline.has_img, create_time)

    # 处理文本、设置status、上传至dgCommunity.dg_post
    # 如果判断has_img为1，那么上传帖子
    if DgspiderphantomjsPipeline.has_img == 1:
        if title.strip() != '' and content.strip() != '':
            spider.logger.info('status=2 , has_img=1, title and content is not null! Uploading post into db...')
            post_handel(url)
        else:
            spider.logger.info('status=1 , has_img=1, but title or content is null! ready to exit...')
        pass
    else:
        spider.logger.info('status=1 , has_img=0, changing status and ready to exit...')
        pass

转自：

http://blog.csdn.net/qq_31573519/article/details/74248559

、

最后编辑于：2017.12.10 01:53:17

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 217,734评论 6赞 505
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,931评论 3赞 394
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 164,133评论 0赞 354
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,532评论 1赞 293
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,585评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,462评论 1赞 302
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,262评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,153评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,587评论 1赞 314
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,792评论 3赞 336
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,919评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,635评论 5赞 345
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,237评论 3赞 329
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,855评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,983评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,048评论 3赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,864评论 2赞 354

Python3.X 爬虫实战（动态页面爬取解析）

douguo request middleware

for the page which loaded by js/ajax

ang changes should be recored here:

@author zhangjianfei

@date 2017/05/04

-- coding: utf-8 --

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

http://doc.scrapy.org/en/latest/topics/items.html

-- coding: utf-8 --

Define your item pipelines here

Don't forget to add your pipeline to the ITEM_PIPELINES setting

See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

推荐阅读更多精彩内容