Scrapy抓取新浪微博

项目概述:相信很多小伙伴都有用过新浪微博,因为这是当今很火的一款社交app。正因为这样,我们需要获取新浪微博中每一个用户的信息以及评论、发布时间等来满足公司的需求,获取每日热点、评论量、点赞量等相关信息。如今是一个大数据的时代,得数据者得天下,下面教大家如何抓取新浪微博的数据。

首先需要安装python环境(python2.7以及scrapy+selenium+phantomjs+chrome)

一、python2.7+scrapy+ selenium+ phantomjs安装:

以下例子基于python 2.7.9,其他版本同理。
1、下载python

wget https://www.python.org/ftp/python/2.7.9/Python-2.7.9.tgz

2、解压、编译安装(依次执行以下5条命令)

tar -zxvf Python-2.7.9.tgz
cd Python-2.7.9
./configure --prefix=/usr/local/python-2.7.9
make
make install

3、系统自带了python版本,我们需要为新安装的版本添加一个软链

 ln -s /usr/local/python-2.7.9/bin/python /usr/bin/python2.7.9 

4、若需使用该版本,只需输入"python2.7.9 + 空格 + py脚本"

python2.7.9 ~/helloworld.py

scrapy安装:

pip install scrapy
# 如果用到分布式
pip install scrapy_redis

selenium安装:

pip install selenium

phantomjs安装:

PhantomJS下载在/usr/local/src/packet/目录下(这个看个人喜好)

操作系统:CentOS 7 64-bit

bzip2 -d phantomjs-2.1.1-linux-x86_64.tar.bz2
  • 4.再使用tar进行解压到/usr/local/目录下边
    1. 安装依赖软件
tar xvf phantomjs-2.1.1-linux-x86_64.tar -C /usr/local/
yum -y install wget fontconfig
# 重命名(方便以后使用phantomjs命令)
mv /usr/local/phantomjs-2.1.1-linux-x86_64/ /usr/local/phantomjs
  • 6.最后一步就是建立软连接了(在/usr/bin/目录下生产一个phantomjs的软连接,/usr/bin/是啥目录应该清楚,不清楚使用 echo $PATH查看)
ln -s /usr/local/phantomjs/bin/phantomjs /usr/bin/

到这一步就安装成功了,接下来测试一下(经过上面建立的软连接,你就可以使用了,而且是想使用命令一样的进行使用哦!):

[root@localhost ~]# phantomjs

二、chrome安装:

说明:要在服务器上安装chrome运行环境,使他能够同selenium自动化测试脚本一同抓取数据,那么久需要配置chrome依赖环境。

selenium+chromedriver在服务器运行

1.前言
想使用selenium从网站上抓数据,但有时候使用phantomjs会出错。chrome现在也有无界面运行模式了,以后就可以不用phantomjs了。
但在服务器安装chrome时出现了一些错误,这里总结一下整个安装过程

2.ubuntu上安装chrome

# Install Google Chrome# https://askubuntu.com/questions/79280/how-to-install-chrome-browser-properly-via-command-linesudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debsudo dpkg -i google-chrome*.deb  # Might show "errors", fixed by next linesudo apt-get install -f

这时应该已经安装好了,用下边的命行运行测试一下:

google-chrome --headless --remote-debugging-port=9222 https://chromium.org --disable-gpu

这里是使用headless模式进行远程调试,ubuntu上大多没有gpu,所以–disable-gpu以免报错。

之后可以再打开一个ssh连接到服务器,使用命令行访问服务器的本地的9222端口:

curl http://localhost:9222

如果安装好了,会看到调试信息。但我这里会报一个错误,下边是错误的解决办法。

  • 一、可能的错误解决方法
    运行完上边的命令可能会报一个不能在root下运行chrome的错误。这个时候使用下边方设置一下chrome
    (1).找到google-chrome文件
    我的位置位于/opt/google/chrome/
    (2).用vi打开google-chrome文件
vi /opt/google/chrome/google-chrome

在文件中找到

exec -a "$0" "$HERE/chrome" "$@"

(3).在后面添加 –user-data-dir –no-sandbox即可,整条shell命令就是

exec -a "$0" "$HERE/chrome" "$@" --user-data-dir --no-sandbox

(4).再重新打开google-chrome即可正常访问!

3.安装chrome驱动chromedriver

下载chromedriver
chromedriver提供了操作chrome的api,是selenium控制chrome的桥梁。
chromedriver最好安装最新版的,记的我一开始安装的不是最新版的,会报一个错。用最新版的chromedriver就没有问题,最新版的可以在下边地址找到
https://sites.google.com/a/chromium.org/chromedriver/downloads

我写这个文章时最新版是2.37

wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zipunzip chromedriver_linux64.zip

到这里服务器端的无界面版chrome就安装好了。

4.无界面版chrome使用方法

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'")
wd = webdriver.Chrome(chrome_options=chrome_options,executable_path='/home/chrome/chromedriver')
wd.get("https://www.163.com")
content = wd.page_source.encode('utf-8')print content
wd.quit()

三、抓取数据:

抓取新浪微博,我们需要模拟登陆,登陆成功后获取cookie进行保存,为了不被封禁账号,我们需要用很多微博账号来进行抓取(根据你的数据量的需求提供账号的多少)
1.首先模拟登陆获取cookie

#!/usr/bin/env python
# encoding: utf-8
import datetime
import json
import base64
from time import sleep
import pymongo
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

输入你的微博账号和密码,可去淘宝买,一元七个。
建议买几十个,微博反扒的厉害,太频繁了会出现302转移。
或者你也可以把时间间隔调大点。

WeiBoAccounts = [
{'username': 'javzx61369@game.weibo.com', 'password': 'esdo77127'},
{'username': 'v640e2@163.com', 'password': 'wy539067'},
{'username': 'd3fj3l@163.com', 'password': 'af730743'},
{'username': 'oia1xs@163.com', 'password': 'tw635958'},
]
'''
WeiBoAccounts = [{'username': '你的用户名', 'password': '你的密码'}]
cookies = []
client = pymongo.MongoClient("192.168.98.5", 27017)
db = client["Sina"]
userAccount = db["userAccount"]
def get_cookie_from_weibo(username, password):
    driver = webdriver.PhantomJS()
    driver.get('https://weibo.cn')
    print driver.title
    assert "微博" in driver.title
    login_link = driver.find_element_by_link_text('登录')
    ActionChains(driver).move_to_element(login_link).click().perform()
    login_name = WebDriverWait(driver, 10).until(
        EC.visibility_of_element_located((By.ID, "loginName"))
    )
    login_password = driver.find_element_by_id("loginPassword")
    login_name.send_keys(username)
    login_password.send_keys(password)
    login_button = driver.find_element_by_id("loginAction")
    login_button.click()
    # 这里停留了10秒观察一下启动的Chrome是否登陆成功了,没有的化手动登陆进去
    sleep(10)
    cookie = driver.get_cookies()
    #print driver.page_source
    print driver.current_url
    driver.close()
    return cookie
def init_cookies():
    for cookie in userAccount.find():
        cookies.append(cookie['cookie'])
if __name__ == "__main__":
    try:
        userAccount.drop()
    except Exception as e:
        pass
    for account in WeiBoAccounts:
        cookie = get_cookie_from_weibo(account["username"], account["password"])
        userAccount.insert_one({"_id": account["username"], "cookie": cookie})
    init_cookies()

代码很简单。就是模拟登陆获取cookie插入到mongo数据库中!方便以后请求数据进行使用。init_cookies()这个函数供middleware中间件后期使用。

*大沙发

  • 大厦

scrapy items代码如下:

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Item, Field
class InformationItem(Item):
    """ 个人信息 """
    _id = Field()  # 用户ID
    NickName = Field()  # 昵称
    Gender = Field()  # 性别
    Province = Field()  # 所在省
    City = Field()  # 所在城市
    BriefIntroduction = Field()  # 简介
    Birthday = Field()  # 生日
    Num_Tweets = Field()  # 微博数
    Num_Follows = Field()  # 关注数
    Num_Fans = Field()  # 粉丝数
    SexOrientation = Field()  # 性取向
    Sentiment = Field()  # 感情状况
    VIPlevel = Field()  # 会员等级
    Authentication = Field()  # 认证
    URL = Field()  # 首页链接
class TweetsItem(Item):
    """ 微博信息 """
    _id = Field()  # 用户ID-微博ID
    ID = Field()  # 用户ID
    Content = Field()  # 微博内容
    PubTime = Field()  # 发表时间
    Co_oridinates = Field()  # 定位坐标
    Tools = Field()  # 发表工具/平台
    Like = Field()  # 点赞数
    Comment = Field()  # 评论数
    Transfer = Field()  # 转载数
    filepath = Field()
class RelationshipsItem(Item):
    """ 用户关系,只保留与关注的关系 """
    fan_id = Field()
    followed_id = Field()  # 被关注者的ID

2.初始请求代码:

#!/usr/bin/env python
# encoding: utf-8
""" 初始的待爬队列 """
weiboID = [
    #"5303798085"
    #'6033587203'
    '6092234294']

3.scrapy spider代码如下:

此代码主要是用于解析页面中获取的数据信息,请详阅# encoding: utf-8
import datetime
import requests
import re
from lxml import etree
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sina.config import weiboID
from sina.items import TweetsItem, InformationItem, RelationshipsItem
import time
import random
def rand_num():
    number = ""
    for i in range(5):
        number += str(random.randint(0,9))
    return number
class Spider(Spider):
    name = "SinaSpider"
    host = "https://weibo.cn"
    start_urls = list(set(weiboID))
    filepath = '/home/YuQing/content/'
    def start_requests(self):
        for uid in self.start_urls:
            yield Request(url="https://weibo.cn/%s/info" % uid, callback=self.parse_information)
    def parse_information(self, response):
        """ 抓取个人信息 """
        informationItem = InformationItem()
        selector = Selector(response)
        ID = re.findall('(\d+)/info', response.url)[0]
        print response.url, response.body
        try:
            text1 = ";".join(selector.xpath('body/div[@class="c"]//text()').extract())  # 获取标签里的所有text()
            nickname = re.findall('昵称;?[::]?(.*?);', text1)
            gender = re.findall('性别;?[::]?(.*?);', text1)
            place = re.findall('地区;?[::]?(.*?);', text1)
            briefIntroduction = re.findall('简介;?[::]?(.*?);', text1)
            birthday = re.findall('生日;?[::]?(.*?);', text1)
            sexOrientation = re.findall('性取向;?[::]?(.*?);', text1)
            sentiment = re.findall('感情状况;?[::]?(.*?);', text1)
            vipLevel = re.findall('会员等级;?[::]?(.*?);', text1)
            authentication = re.findall('认证;?[::]?(.*?);', text1)
            url = re.findall('互联网;?[::]?(.*?);', text1)
            informationItem["_id"] = ID
            if nickname and nickname[0]:
                informationItem["NickName"] = nickname[0].replace(u"\xa0", "")
            if gender and gender[0]:
                informationItem["Gender"] = gender[0].replace(u"\xa0", "")
            if place and place[0]:
                place = place[0].replace(u"\xa0", "").split(" ")
                informationItem["Province"] = place[0]
                if len(place) > 1:
                    informationItem["City"] = place[1]
            if briefIntroduction and briefIntroduction[0]:
                informationItem["BriefIntroduction"] = briefIntroduction[0].replace(u"\xa0", "")
            if birthday and birthday[0]:
                try:
                    birthday = datetime.datetime.strptime(birthday[0], "%Y-%m-%d")
                    informationItem["Birthday"] = birthday - datetime.timedelta(hours=8)
                except Exception:
                    informationItem['Birthday'] = birthday[0]  # 有可能是星座,而非时间
            if sexOrientation and sexOrientation[0]:
                if sexOrientation[0].replace(u"\xa0", "") == gender[0]:
                    informationItem["SexOrientation"] = "同性恋"
                else:
                    informationItem["SexOrientation"] = "异性恋"
            if sentiment and sentiment[0]:
                informationItem["Sentiment"] = sentiment[0].replace(u"\xa0", "")
            if vipLevel and vipLevel[0]:
                informationItem["VIPlevel"] = vipLevel[0].replace(u"\xa0", "")
            if authentication and authentication[0]:
                informationItem["Authentication"] = authentication[0].replace(u"\xa0", "")
            if url:
                informationItem["URL"] = url[0]
            try:
                urlothers = "https://weibo.cn/attgroup/opening?uid=%s" % ID
                new_ck = {}
                for ck in response.request.cookies:
                    new_ck[ck['name']] = ck['value']
                r = requests.get(urlothers, cookies=new_ck, timeout=5)
                if r.status_code == 200:
                    selector = etree.HTML(r.content)
                    texts = ";".join(selector.xpath('//body//div[@class="tip2"]/a//text()'))
                    print texts
                    if texts:
                        # num_tweets = re.findall(r'微博\[(\d+)\]', texts)
                        num_tweets = texts.split(';')[0].replace('微博[', '').replace(']','')
                        # num_follows = re.findall(r'关注\[(\d+)\]', texts)
                        num_follows = texts.split(';')[1].replace('关注[', '').replace(']','')
                        # num_fans = re.findall(r'粉丝\[(\d+)\]', texts)
                        num_fans = texts.split(';')[2].replace('粉丝[', '').replace(']','')
                        if len(num_tweets) > 0:
                            informationItem["Num_Tweets"] = int(num_tweets)
                        if num_follows:
                            informationItem["Num_Follows"] = int(num_follows)
                        if num_fans:
                            informationItem["Num_Fans"] = int(num_fans)
            except Exception as e:
                print e
        except Exception as e:
            pass
        else:
            yield informationItem
        if informationItem["Num_Tweets"] and informationItem["Num_Tweets"] < 5000:
            yield Request(url="https://weibo.cn/%s/profile?filter=1&page=1" % ID, callback=self.parse_tweets,
                          dont_filter=True)
        if informationItem["Num_Follows"] and informationItem["Num_Follows"] < 500:
            yield Request(url="https://weibo.cn/%s/follow" % ID, callback=self.parse_relationship, dont_filter=True)
        if informationItem["Num_Fans"] and informationItem["Num_Fans"] < 500:
            yield Request(url="https://weibo.cn/%s/fans" % ID, callback=self.parse_relationship, dont_filter=True)
    def parse_tweets(self, response):
        """ 抓取微博数据 """
        selector = Selector(response)
        ID = re.findall('(\d+)/profile', response.url)[0]
        divs = selector.xpath('body/div[@class="c" and @id]')
        for div in divs:
            try:
                tweetsItems = TweetsItem()
                id = div.xpath('@id').extract_first()  # 微博ID
                content = div.xpath('div/span[@class="ctt"]//text()').extract()  # 微博内容
                cooridinates = div.xpath('div/a/@href').extract()  # 定位坐标
                like = re.findall('赞\[(\d+)\]', div.extract())  # 点赞数
                transfer = re.findall('转发\[(\d+)\]', div.extract())  # 转载数
                comment = re.findall('评论\[(\d+)\]', div.extract())  # 评论数
                others = div.xpath('div/span[@class="ct"]/text()').extract()  # 求时间和使用工具(手机或平台)
                tweetsItems["_id"] = ID + "-" + id
                tweetsItems["ID"] = ID
                if content:
                    tweetsItems["Content"] = " ".join(content).strip('[位置]')  # 去掉最后的"[位置]"
                if cooridinates:
                    cooridinates = re.findall('center=([\d.,]+)', cooridinates[0])
                    if cooridinates:
                        tweetsItems["Co_oridinates"] = cooridinates[0]
                if like:
                    tweetsItems["Like"] = int(like[0])
                if transfer:
                    tweetsItems["Transfer"] = int(transfer[0])
                if comment:
                    tweetsItems["Comment"] = int(comment[0])
                if others:
                    others = others[0].split('来自')
                    tweetsItems["PubTime"] = others[0].replace(u"\xa0", "")
                    if len(others) == 2:
                        tweetsItems["Tools"] = others[1].replace(u"\xa0", "")
                filename = 'wb_'+time.strftime('%Y%m%d%H%M%S')+'_'+rand_num()+'.txt'
                tweetsItems["filepath"] = self.filepath + filename
                yield tweetsItems
            except Exception as e:
                print e,111111111111111111111
                self.logger.info(e)
                pass
        next_page = '下页'.decode('utf-8')
        url_next = selector.xpath('body/div[@class="pa" and @id="pagelist"]/form/div/a[text()="%s"]/@href' % next_page).extract()
        if url_next:
            yield Request(url=self.host + url_next[0], callback=self.parse_tweets, dont_filter=True)
    def parse_relationship(self, response):
        """ 打开url爬取里面的个人ID """
        selector = Selector(response)
        if "/follow" in response.url:
            ID = re.findall('(\d+)/follow', response.url)[0]
            flag = True
        else:
            ID = re.findall('(\d+)/fans', response.url)[0]
            flag = False
        he = "关注他".decode('utf-8')
        she = "关注她".decode('utf-8')
        urls = selector.xpath('//a[text()="%s" or text()="%s"]/@href' % (he, she)).extract()
        uids = re.findall('uid=(\d+)', ";".join(urls), re.S)
        for uid in uids:
            relationshipsItem = RelationshipsItem()
            relationshipsItem["fan_id"] = ID if flag else uid
            relationshipsItem["followed_id"] = uid if flag else ID
            yield relationshipsItem
            yield Request(url="https://weibo.cn/%s/info" % uid, callback=self.parse_information)
        next_page = '下页'.decode('utf-8')
        next_url = selector.xpath('//a[text()="%s"]/@href' % next_page).extract()
        if next_url:
            yield Request(url=self.host + next_url[0], callback=self.parse_relationship, dont_filter=True)
  1. scrapy middlewares代码如下:
# encoding: utf-8
import random
from sina.cookies import cookies, init_cookies
from sina.user_agents import agents
class UserAgentMiddleware(object):
    """ 换User-Agent """
    def process_request(self, request, spider):
        agent = random.choice(agents)
        request.headers["User-Agent"] = agent
class CookiesMiddleware(object):
    """ 换Cookie """
    def __init__(self):
        init_cookies()
    def process_request(self, request, spider):
        cookie = random.choice(cookies)
        request.cookies = cookie
这一点就是获取到mongo中保存的cookie在scrapy下载中间件去发送请求的时候携带给request,这样就可以携带cookie去获取数据了。

5.scrapy pipelines保存数据:

众所周知,pipelines是用来清洗数据和保存数据的管道
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from sina.items import RelationshipsItem, TweetsItem, InformationItem
import time
import random
import json
class MongoDBPipeline(object):
    def __init__(self):
        clinet = pymongo.MongoClient("服务器的ip", 27017)
        db = clinet["Sina"]
        self.Information = db["Information"]
        self.Tweets = db["Tweets"]
        self.Relationships = db["Relationships"]
    def process_item(self, item, spider):
        """ 判断item的类型,并作相应的处理,再入数据库 """
        if isinstance(item, RelationshipsItem):
            try:
                self.Relationships.insert(dict(item))
            except Exception:
                pass
        elif isinstance(item, TweetsItem):
            try:
                self.Tweets.insert(dict(item))
                filename = item['filepath']
                lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
                with open(filename, 'w') as f:
                    f.write(lines)
            except Exception,e:
                print e
        elif isinstance(item, InformationItem):
            try:
                self.Information.insert(dict(item))
            except Exception:
                pass
        return item

6.为了数据稳定性抓取。我们还需要构建一个user_agent代理池去不停的跟换并伪装成浏览器。具体实现如下:

#!/usr/bin/env python
# encoding: utf-8
""" User-Agents """
agents = [
    "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
    "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
    "Mozilla/2.02E (Win95; U)",
    "Mozilla/3.01Gold (Win95; I)",
    "Mozilla/4.8 [en] (Windows NT 5.1; U)",
    "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
    "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
    "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"]

middlewares中会获取这个user-agent数组在scrapy的process_request()方法中,将请求携带指定的user-agent去发送请求。

走到这里基本都介绍完了,谢谢大家!如有疑问请留言,我会详细回复。(第一次写简书,请多多包涵)后续会出如何抓取知乎、今日头条等文章!

u=2200166214,500725521&fm=26&gp=0.jpg
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,406评论 6 503
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,732评论 3 393
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 163,711评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,380评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,432评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,301评论 1 301
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,145评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,008评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,443评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,649评论 3 334
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,795评论 1 347
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,501评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,119评论 3 328
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,731评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,865评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,899评论 2 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,724评论 2 354

推荐阅读更多精彩内容