爬虫准备

背景调研：

在爬取一个网站之前，首先需要对目标站点的规模、结构进行了解。网站自身的rotbots.txt和Sitemap文件都可以为我们提供帮助

1.rotbots协议检查

大多数网站都会定义rotbots.txt文件，这样可以让爬虫爬取时了解网站存在那些限制，这些限制仅仅作为建议给出，但是作为良好的网络公民应该予以准守。并且检查rotbots协议可以防止爬虫被封禁。

2.识别网站所用的技术

构建网站的技术，也会影响爬虫的使用，python提供了builtwith模块，该模块将url作为参数，下载后进行分析，然后返回网站使用的技术。

pip install builtwith

例：

import builtwith

html_1 = builtwith.parse('https://www.jianshu.com/')

print(html_1)

3.寻找网站作者

对于一些网站，你可能会想知道作者是谁，python提供了python-whois库

pip install python-whois

import whois

print(whois.whois('https://www.jianshu.com/'))

4.编写第一个网络爬虫

为了抓取网站，我们首先需要下载包含信息的网页，该过程称为“爬取（crawling）”。

import urllib.request

def download_url(url):

return urllib.request.urlopen(url).read()

当url传入时，该函数会下载并返回HTML，但是该函数使用时，可能会出现问题，比如请求的页面不存在，这时程序退出抛出异常，安全起见，让它变得更健壮。

健壮后

import urllib.request

from urllibimport error

def download_url(url):

    print('Download',url)

    try:

        html = urllib.request.urlopen(url).read()

    except urllib.error.URLErroras e:

        print('download error',e.reason)

        html =None

return html

现在下载时，出现问题，函数可以捕获，然后返回None

5.重试下载

下载时，遇到的错误往往是临时性的，比如服务器过载返回503，对于这样的情况，需要重新请求下载，但是如果返回404找不到页面，重新请求也是没有用。所以我们确保5xx的情况下重新请求即可：

import urllib.request
from urllib import error

def download(url,num_retries=2):

    print('download',url)

    try:

        html = urllib.request.urlopen(url).read()

    except urllib.error.URLErroras e:

        print('download error',e.reason)

        html =None

if num_retries >0:

            if hasattr(e,'code')and 500 <= e.code <=600:

                return download(url,num_retries-1)

    return html

现在当函数执行遇到code为5xx时，将会递归调用函数本身进行重试，并且添加了重新下载的次数，默认值为2次。

6.设置用户代理

为了让下载更加可靠，可以控制代理设定，我们重新对download函数进行修改，设定一个代理，这次使用默认代“wswp”，即“web Scraping with Python”。

def download(url,user_agent='wswp',num_retries=2):

    print('download',url)

    headers = {'user_agent':user_agent}

    request = urllib.request.Request(url,headers=headers)

    try:

        html = urllib.request.urlopen(request).read()

    except urllib.error.URLErroras e:

        print('download error',e.reason)

        html =None

if num_retries >0:

            if hasattr(e,'code')and 500<=e.code <=600:

                return download(url,user_agent,num_retries-1)

    return html

现在我们拥有了灵活下载的函数，该函数可以捕获异常、重新请求下载、设置用户代理。