爬虫架构

一、爬虫调度端（启动、停止、监视爬虫运行情况）

注意：

python3	python2
urllib.request	urllib和urllib2
urllib.parse	urlparse

二、URL管理器（管理待抓取URL集合和已抓取URL集合

目的：防止重复抓取、防止循环抓取
需要实现的功能：
1、添加新URL到待爬取集合中
2、判断待添加URL是否在容器中
3、获取待爬URL
4、判断是否还有待爬URL
5、将URL从待爬移动到已爬
实现方式
1、存储在内存中，使用python的话，可以直接将URL集合存储在set（）中。
2、存储在关系数据库中，例如MySQL，url（url，is_crawled），用一个表存储两个数据集合。
3、存储在缓存数据库中，比如说redis，将两个集合存储在两个set中。

三、网页下载器（将URL指向的URL以HTML的形式下载下来并存储为本地文件或者字符串传送给网页解析器，将已爬取的URL传给URL管理器）

注意：python3中用urllib.request代替了urllib2；用http.cookiejar代替了cooklib

下载器
urllib2,Python官方基础模块
requests，第三方包更强大
urllib2下载网页方法1：url传送urllib2.urlopen（url）方法

import urllib2.request
#直接请求
response = urllib2.request.urlopen('http://www.baidu.com')
#获取状态码，如果是200表示获取成功
print(response.getcode())
#读取内容
cont = response.read()

urllib2下载网页方法2：添加data、http header

将url、data、header传送给urllib2.Request类，以request作为参数传送给urllib2.urlopen方法

import urllib2
#创建Request对象
request = urllib2.Request(url)
#添加数据
request.add_data('a','1')#向服务器提交数据
#添加http的header，提交头信息
request.add_header('User-Agent','Mozilla/5.0)
#发送请求获取结果
response = urllib2.urlopen(request)

urllib2下载网页方法3：添加特殊情景的处理器
HTTPCookieprocessor：处理登录才能访问的网页
ProxyHandler：处理需要代理访问的网页
HTTPShandler：处理加密访问的网页
HTTPRedirectHandler:处理url相互自动跳转的网页
将以上Handler传送给urllib2的build_opener()方法。返回一个opener对象，然后给opener对象install_opener。然后再用urllib2的urlopen（url）和urlopen（request）实现网页的下载。

import urllib2.,cookielib
#创建cookie容器,存储cookie的数据
cj = cookielib.CookieJar()
#创建一个opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
#给urllib2安装opener
urllib2.install_opener(opener)
#使用带有cookie的urllib2访问网页
response = urllib2.urlopen(''http://www.baidu.com/'')

四、网页解析器

从网页中提取有价值的数据的工具

以下载好的Html网页字符串为输入，输出有价值的数据、新的待爬URL列表。

网页解析器的分类

正则表达式
字符串形式的模糊匹配
html.parser（python自带模块）
Beautiful Soup（第三方插件，可以使用html.parser和Ixml作为解析器）
Ixml（第三方插件）
以上三种是使用结构化解析

结构化解析-DOM（Document Object Model）树

将网页文档下载为DOM树，以树的形式进行上下级的访问和遍历。

DOM树.jpg

网页解析器-Beautiful Soup

Python的第三方库，用于从HTML或者XML中提取数据

BeautifulSoup语法

创建BeautifulSoup对象

创建BeautifulSoup对象
from bs4 import BeautifulSoup
# 根据HTML网页字符串创建BeautifulSoup对象
soup = BeautifulSoup(
    html_doc,               # HTML文档字符串
    'html.parser'           # HTML解析器
    from_encoding='utf8'    # HTML文档的编码
)

搜索节点（find_all,find)

# 方法：find_all(name,attrs,string)
# 查找所有标签为a的节点
soup.find_all('a')

# 查找所有标签为a，链接符合/view/123.html形式的节点
soup.find_all('a', href='/view/123.htm')
soup.find_all('a', href=re.compile(r'/view/\d+\.htm'))

# 查找所有标签为div，class为abc，文字为Python的节点
soup.find_all('div',class_='abc',string='Python')

访问节点信息

# 得到节点：<a href='1.html'>Python</a>

# 获取查找到的节点的标签名称
node.name

# 获取查找到的a节点的href属性
node['href']

# 获取查找到的a节点的链接文字
node.get_text()

实例

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')

print('获取所有的链接')
links = soup.find_all('a')
for link in links:
    print(link.name, link['href'],link.get_text())


print('获取lacie的链接')
link_node = soup.find('a', href='http://example.com/lacie')
print(link_node.name, link_node['href'], link_node.get_text())


print('正则匹配')
link_node = soup.find('a', href=re.compile(r'ill'))
print(link_node.name, link_node['href'], link_node.get_text())

print('获取P段落文字')
p_node = soup.find('p', class_='title')
print(p_node.name, p_node.get_text())

爬虫基础