网络爬虫开发实战-学习笔记

基本库的使用

urllib库

urllib库包含四个模块：
· request：请求模块，模拟发送请求。
· error：异常处理模块。
· parse：工具模块，提供url处理方法。
· robotparser：识别网站的robot.txt文件

发送请求

1.urlopen()HTTP请求的方法。

import urllib.request

response = urllib.request.urlopen('http://www.python.org')
print(response.read().decode('utf-8'))#输出网页的源代码

利用type()输出响应的类型：

import urllib.request
response = urllib.request.urlopen('http://www.python.org')
type(response)

结果：<class 'http.client.HTTPResponse'>
它是一个HTTPResponse类型的对象，主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg、version、status、reason、debuglevel、closed等属性。
实例：

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

200
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Content-Length', '48425'), ('Accept-Ranges', 'bytes'), ('Date', 'Thu, 15 Aug 2019 02:32:21 GMT'), ('Via', '1.1 varnish'), ('Age', '1947'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2151-IAD, cache-hkg17932-HKG'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '1, 1030'), ('X-Timer', 'S1565836342.629250,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

输出：状态码、头信息和响应头中的Server值（nginx意思是服务器用Nginx搭建的）。

给链接传递参数：urlopen()的API

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

· data参数

data参数是可选的。如果需要添加，需要使用bytes()将参数转化为字节流编码格式的内容，即bytes类型。另外，如果传递了这个参数，它的请求方式就不再是GET方式，而是POST方式。

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

{
    "args": {},   
    "data": "",   
    "files": {},   
    "form": {
          "word": "hello"
     },   
"headers": {
        "Accept-Encoding": "identity", 
        "Content-Length": "10", 
        "Content-Type": "application/x-www-form-urlencoded",                           
        "Host": "httpbin.org", 
        "User-Agent": "Python-urllib/3.7"\n  }, 
        "json": null, 
        "origin": "112.42.28.190, 112.42.28.190",     
        "url": "https://httpbin.org/post"
}

传递的参数出现在form字段中，模拟表单提交的方式，以POST方式传输数据。

· timeout

timeout参数用于设置超时时间。如果请求超时，还未得到响应，就抛出异常。

import urllib.request

response = urllib.requst.urlopen('http://httpbin.org/get', timeout=1)
print(response.read())

socket.timeout: timed out

程序运行超过1秒，服务器仍未响应，于是抛出URLError异常。
用try except语句实现超时则跳过该页面抓取：

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

TIME OUT

· 其他参数

context参数，它必须是ssl.SSLContex类型，用来指定SSL设置。
cafile指定CA证书，capath指定其路径。
cadefault已弃用，默认值为False。

2.Request

如果请求中需要加入Headers等信息，就可以用更强大的Reques类来构建。

import urllib.request

request = urllib.request.Request('http://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

我们依然是用urlopen()来发送这个请求，不过该方法的对象是Request类型的对象。这样将请求独立成一个对象，可更好地配置参数。
构造方法：

class urllib.request.Request(url, data=None, headers={}, origin_req_hostNone, unverifiable=False, methon=None)

· url：请求URL，必传参数。
· data：必须传bytes类型的。
· headers：是一个字典，请求头，可以通过headers参数直接构造，也可以通过调用实例的add_headers()方法添加。
· origin_req_host：请求方的host名称或IP地址。
· unverifiable：表示这个请求是否无法验证，默认是False。
· method：字符串，指示请求使用的方法，比如GET、POST和PUT等。

from urllib import request,parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf-8')
req = request.Request(url=url, data=data, headers=headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

#输出：
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpin.org", 
    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"
  }, 
  "json": null, 
  "origin": "112.42.28.190, 112.42.28.190", 
  "url": "https://httpin.org/post"
}

url即请求URL，headers中指定了User-Agent和Host，参数data用urlencode()和bytes()转成字节流，指定请求方式为POST。
另外，headers也可以用add_header()方法来添加：

req = request.Request(url=url, data=data,  method='POST')
req.add_header('User-Agent' ,'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT'))

3.高级用法

Handler，各种处理器——处理登录验证，处理Cookies，处理代理设置。
urllib.request模块里的BaseHandler类，是所有其他Handler的父类，它提供了最基本的方法，例如default_open()、protocol_request()等。
有各种Handler子类继承这个BaseHandler类：
· HTTPDefaultErrorHandler：处理HTTP响应错误
· HTTPRedirectHandler：处理重定向
· HTTPCookieProcessor：处理Cookies
· ProxyHandler：设置代理，默认代理为空
· HTTPPasswordMgr：管理密码，维护用户名和密码的表
· HTTPBasicAuthHandler：管理认证

OpenerDirector类（Openr）

Opener可以使用Open方法，返回的类型和urlopen()一样。利用Handler来构建Opener。

· 验证

有些网站需要你输入用户名和验证码，验证成功后才能查看页面。
借助HTTPBasicAuthHandler请求这样的页面：

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

首先实例化HTTPBasicAuthHandler对象，其参数是HTTPPasswordMgrWithDefaultRealm对象，它利用add_password添加用户名和密码，这样就建立了一个处理验证的Handler。
然后利用Handler和build_opener()方法构建一个Opener，再用open()方法打开链接，完成验证。

· 代理

添加代理：

 from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.1:9743',
    'https': 'http://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

这里在本地搭建了一个代理，它运行在9743端口上。
ProxyHandler，参数是一个字典，键名是协议类型（HTTP或HTTPS等），键值是代理链接，可以添加多个代理。
然后利用这个Handler及build_opener()方法构造一个Opener，发送请求。

· Cookies

获取网站的Cookies：

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

首先声明一个CookieJar对象。然后利用HTTPCookieProcessor来构建一个Handler。最后用buid_opener()方法构建Opener，执行open()函数即可。
也可以输出成文件格式。

处理异常

1.URLError

URLError类来自urllib库的error模块，它继承自OSError类，是error异常模块的基类，由request模块产生的异常都可以通过捕获这个类处理。
属性reason：返回错误的原因

from urllib import request, error
try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

打开一个不存在的页面，本应报错，但我们捕获了URLError这个异常：Not Found

2.HTTPError

它是URLError的子类，专门处理HTTP请求错误，如请求失败，属性：
· code：返回HTTP状态码。
· reason：同父类一样，用于返回错误的原因。
· headers：返回请求头。

from urllib import request,error

try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')

这里捕获了HTTPError异常，输出了reason、code和headers属性。
因为是URLError是HTTPError的父类，所以可以选择先捕获子类的错误，再捕获父类的错误：

from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

先捕获HTTPError，获取它的错误状态码、原因、headers等信息。如果不是HTTPError异常，就会捕获URLError异常，输出错误原因。最后用else来处理正常的逻辑。

有时候reason属性返回的不一定是字符串，也可能是一个对象。

解析链接

urllib库里提供了parse模块，它定义了处理URL的标准接口，例如实现URL各部分的抽取、合并及链接转换。

1.urlparse()

该方法实现URL的识别和分段：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5$comment')
print(type(result), result)

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='ww.baidu.com', path='/index.html', params='user', query='id=5$comment', fragment='')

返回结果是一个ParseResult类型的对象，它包含6个部分，分别scheme、netloc、path、params、query和fragment：
//前面是scheme，代表协议；
第一个/符合前面是netloc，即域名；
后面是path，即访问路径；
后面是params，代表参数；
?后面是查询条件query，一般用作GET类型的URL；
#后面是锚点，用于直接定位页面内部的下拉位置。

API用法：

urllib.parse.urlparse(urlstring, scheme=' ', allow_fragments=True)

三个参数：
· urlstring：必填，待解析的URL。
· scheme：默认的协议（如http或https等）。假如链接没有带协议信息，会将这个作为默认的协议。scheme参数只有URL不含scheme参数才生效。
· allow_fragments：是否忽略fragment。False：fragment被忽略，它会被解析为path、params或query的一部分。

2.urlunparse()

对应于urlparse，将参数组装成网址。
它接受的参数长度必须是6，否则会抛出参数数量不足或过多的问题：

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

http://www.baidu.com/index.html;user?a=6#comment

3.urlsplit()

与urlparse()类似，不过不再单独解析params，params合并到path中。

4.urlunspilt()

与urlunparse类似，将链接各个部分组合成完整链接，传入参数的长度必须是5（可迭代对象，列表、元组等）。

5.urljoin()

urljoin()需要base_url()作为第一个参数，新的链接作为第二个参数，该方法会分析base_url的scheme、netloc和path这3个内容并对新链接缺失的部分进行补充，最后返回结果。

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http:/www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html

6.urlencode()

序列化，urlencode()在构造GET请求参数时非常有用：

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com'
url = base_url + urlencode(params)
print(url)

http://www.baidu.comname=germey&age=22

7.parse_qs()

反序列化。有一串GET请求参，利用parse_qs()将它转回字典：

from urllib.parse import parse_qs

query = 'name=germey&age=22'
print(parse_qs(query))

{'name': ['germey'], 'age': ['22']}

8.parse_qsl()

将参数转化为元组组成列表：

from urllib.parse import parse_qsl

query = 'name=germey&age=22'
print(parse_qsl(query))

[('name', 'germey'), ('age', '22')]

9.quote()

quote()将内容转化为URL编码格式：

from urllib.parse import quote

keyword = '壁纸'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

10.unquote()

URL解码：

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url))

https://www.baidu.com/s?wd=壁纸

分析Robots协议

1.Robots协议

网络爬虫派出标准（Robots Exclusion Protocol）。
搜索爬虫访问一个站点时，首先会检查这个站点根目录下是否存在robots.txt文件，如果存在，爬虫会根据其中定义的范围爬取。
常见写法：

禁止所有爬虫访问任何目录：
User-agent：*
Disallow：/
允许所有爬虫访问任何目录：
User-agent：*
Disallow：
禁止所有爬虫访问某些目录：
User-agent：*
Disallow：/private/
Disallow：/tmp/
只允许某一个爬虫访问：
User-agent：WebCrawler
Disallow：
User-agent：*
Disallow：/

2.爬虫名称

有固定的名字

3.robotparser

使用robotparser模块来解析robot.txt
声明：urllib.robotparser.RobotFileParser(url=' ')
· set_url：设置robot.txt文件链接。
· read()：读取robot.txt并进行分析。
· parse()：解析robot.txt文件。
· can_fetch()：传入两个参数：User-agent和要抓取的URL。返回True和False判断是否可以抓取这个URL。
· mtime()：返回上次抓取和分析robot.txt的时间。
· modified()：将当前时间设置为上次抓取和分析robots.txt的时间。

from urllib.robotparser import RobotFileParser

rp = RobotFileParser() # = rp = RobotFileParser('http://www.jianshu.com/robot,txt')
rp.set_url('http://www.jianshu.com/robot,txt')
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=1&type=collections'))

False
False

requests

urllib库中的urlopen()实际上是以GET方式请求网页，requests中对应的是get()：

import requests

r = requests.get('http://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

其他请求类型也可以用一句话来完成：

r = requests.post('http://httpbin.org/post')
r = requests.put('http://httpbin.org/put')
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')

GET请求

利用requests构建GET请求：

import requests

r = requests.get('http://httpbin.orgget')
print(r.text)

对于GET请求，附加额外的信息：

import requests

data = {
        'name' : 'germey',
        'age' : 22
}
r = requests.get('http://httpbin.org/get', params=data)
print(r.text)

如果想直接解析返回结果，得到一个字典格式的话，可以直接调用json()方法。返回结果是JSON格式的字符串。

抓取网页

加入headers信息，其中包含User-Agent字段信息，也就是浏览器标识信息。不加这个，有些网站会禁止抓取。

POST请求

用requests实现POST请求：

import requests

data = {'name': 'germey', 'age': '22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

返回结果中form部分就是提交的数据。

响应

text、content获取响应的内容
status_code得到状态码
headers得到响应头
cookies得到Cookies
url得到URL
history得到请求历史

高级用法

1.文件上传

import requests

files = {'file': open('favicon.ico', 'rb')}
r = reqquests.post("http://httpbin.org/post", files=files)
print(r.text)

favicon.ico需要和当前脚本在同一个目录下。

2.Cookies

获取Cookies：

import requests

r = requests.get("http://www.baidu.com")
print(r.cookies)
for key, value in r.cookies.items():
      print(key + '=' + value)

首先cookies属性可得到Cookies，它是RequestCookieJar类型。
后用items()方法将其转化为元组组成的列表，遍历输出每一个Cookies的名称和值，实现Cookies的遍历解析。

3.会话维持

Session对象方便地维护一个会话

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/12345679')
r = s.get('http://httpbin.org/cookies')
print(r.text)

{
  "cookies": {
    "number": "12345679"
  }
}

Session通常用于模拟登录成功之后再进行下一步的操作。可用于模拟在一个浏览器中打开同一个站点的不同页面。

4.SSL证书验证

5.代理设置

对于某些网站，请求几次仍正常。但大规模爬取后，网站可能会弹出验证码，或跳转到登录验证页面，甚至封禁客户端IP一定时间。
为防止以上情况发生，需要设置代理，用到proxies参数：

import requests

proxies = {
    "http": "http://10.10.1.10:3218",
    "https": "http://10.10.1.10:1080"
}

requests.get("https://www.taobao.com", proxies=proxies)

若代理需要使用HTTP Basic Auth，可以使用类似http://user:password@host:port这样的语法来设置代理。

6.超时设置

import requests

r = requests.get('https://www.taobao.com', timeout = 1)
print(r.status_code)

请求分两个阶段，即连接和读取，timeout为两个阶段耗时的总和。

7.身份验证

可使用requests自带的身份验证功能：

import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://localhost:5000', auth=HTTPBasicAuth('username', 'password'))
print(r.status_code)

更简单的写法：

import requests

r = requests.get('http://localhost:5000', auth=('username', 'password'))
print(r.status_code)

还有其他验证方法，如OAuth。

8.Prepared Request

我们可以将请求表示为数据结构，其中各个参数都可以通过一个Request对象来表示。在requests里叫Prepared Request。

正则表达式

模式	描述
\w	匹配字母、数字及下划线
\W	匹配不是字母、数字及下划线的字符
\s	匹配任意空白字符，等价于[\t\n\r\f]
\S	匹配任意非空白字符
\d	匹配任意数字，等价于[0-9]
\D	匹配任意非数字的字符
\A	匹配字符串的开头
\Z	匹配字符串结尾，如果有换行，只匹配到换行前
\z	匹配字符串结尾，存在换行也会匹配换行符
\G	匹配最后匹配完成的位置
\n	匹配一个换行符
\t	匹配一个制表符
^	匹配一行字符串的开头
$	匹配一行字符串的结尾
.	匹配任意字符，除换行符，re.DOTALL标记被指定，则包括换行符
[...]	表示一组字符，单独列出
[^...]	匹配不在[ ]内的字符
*	匹配0个或多个表达式，前
+	匹配1个或多个表达式，前
?	匹配0/1个前面正则表达式定义的片段，非贪婪
{n}	精确匹配前面的n个表达式
{n,m}	匹配n到m次前面正则表达式的片段，贪婪
a \| b	匹配a或b
( )	匹配括号内的表达式，表示一个组

Re库的基本使用

Re库是python的标准库，主要用于字符串匹配
调用：import re
raw string类型（原生字符串类型）
表示：r'text'
例：r'[1-9]\d{5}'

Re库主要功能函数

函数	说明
re.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
re.match()	从一个字符串的开始位置起匹配正则表达式，返回match对象
re.findall()	搜索字符串，以列表类型返回全部能匹配的子串
re.spilt()	将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
re.finditer()	搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象
re.sub()	在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

1、re.search(pattern, string, flags=0)
在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
· pattern：正则表达式的字符串或原生字符串表示
· string：待匹配字符串
· flags：正则表达式使用时的控制标记

常用标记	说明
re.I re.IGNORECASE	忽略正则表达式的大小写，[A-Z]能够匹配小写字符
re.M re.MULTILINE	正则表达式中的^操作符能够将给定字符串的每行当作匹配开始
re.S re.DOTALL	正则表达式中的.操作符能够匹配所有字符串，默认匹配换行外的所有字符

>>> import re
>>> match = re.search(r'[1-9]\d{5}', 'BIT 100081')
>>> if match:
    print(match.group(0))
#.search()是在整体中搜索的
    
100081

2、re.match(pattern, string, flags=0)

>>> import re
>>> match = re.match(r'[1-9]\d{5}', 'BIT 100081')
>>> if match:
    match.group(0)
#match从字符串位置搜索，字母不能匹配数字，为空
    
>>> match = re.match(r'[1-9]\d{5}', '100081 BIT')
>>> if match:
    print(match.group(0))
#调换一下目标字符串中字母与数字的位置，就可了
    
100081

3、re.findall(pattern, string, flags=0)
搜索字符串，以列表类型返回全部能匹配的子串

>>> import re
>>> ls = re.findall(r'[1-9]\d{5}', 'BIT100081 TSU100084')
>>> ls
['100081', '100084']

4、re.spilt(pattern, string, maxspilt=0, flags=0)
maxspilt：最大分割数
将一个字符串按照正则表达式匹配结果进行分割，返回列表类型

>>> import re
>>> re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084')
['BIT', ' TSU', '']

将一个正则表达式匹配字符串，匹配的部分去掉，去掉之后的部分作为元素放在列表里。
5、re.finditer(pattern, string, flags=0)
搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象。

>>> import re
>>> for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):
    if m:
        print(m.group(0))

        
100081
100084

6、re.sub(pattern, repl, string, count=0, flags=0)
· repl：替换匹配字符串的字符串
· count：匹配的最大替换次数

>>> re.sub(r'[1-9]\d{5}', ':zipcode', 'BIT100081 TSU100084')
'BIT:zipcode TSU:zipcode'

Re库的另一种等价用法

regex = re.compile(pattern, flags=0)
将正则表达式的字符串形式编译成正则表达式对象
`>>>regex = re.compile(r'[1-9]\d{5}')

函数	说明
regex.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
regex.match()	从字符串开始位置匹配
regex.findall()	以列表类型返回全部匹配的子串
regex.spilt()	以匹配部分为分隔，并去掉
regex.finditer()	搜索字符串，返回匹配结果的迭代类型
regex.sub()	替换

Re库的Match对象

match对象是一次匹配的结果，包含匹配的很多信息

>>> match = re.search(r'[1-9]\d{5}', 'BIT 100081')
>>> if match:
    print(match.group(0))

    
100081
>>> type(match)
<class 're.Match'>

Match对象的属性

属性	说明
.string	待匹配的文本
.re	匹配时使用的patter对象（正则表达式）
.pos	正则表达式搜索文本的开始位置
.endpos	正则表达式搜索文本的结束位置
.group(0)	获得匹配后的字符串
.start()	匹配字符串在原始字符串的开始位置
.end()	匹配字符串在原始字符串的结束位置
.span()	返回(.start(), .end())

>>> import re
>>> m = re.search(r'[1-9]\d{5}', "BIT100081 TSU100084")
>>> m.string
'BIT100081 TSU100084'
>>> m.re
re.compile('[1-9]\\d{5}')
>>> m.pos
0
>>> m.endpos
19
>>> m.group(0)
'100081'
>>> m.start()
3
>>> m.end()
9
>>> m.span()
(3, 9)

Re库的贪婪匹配和最小匹配

Re库默认采用贪婪匹配，即输出匹配最长的子串

>>> match = re.search(r'PY.*N', 'PYANBNCNDN')
>>> match.group(0)
'PYANBNCNDN'
#PY.*N：以PY为口头，N为结尾，*前任意字母无限扩展

最小匹配操作符

操作符	说明
*?	前一个字符0次或无限次扩展，最小匹配
+?	前一个字符1次或无限次扩展，最小匹配
??	前一个字符0次或1次扩展，最小匹配
{m,n}	扩展前一个字符m至n次（含n），最小匹配

>>> match = re.search(r'PY.*?N', 'PYANBNCNDN')
>>> match.group(0)
'PYAN'

爬取淘宝商品价格信息

import requests
import re

def getHTMLText(url):   #获得页面
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
    
def parsePage(ilt, html):   #对获得的页面进行解析
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_titile\"\:\".*?\"',html)
        for i in range(len(plt)):
            price = eval(plt[i].spilt(':')[l])
            title = eval(tlt[i].split(':')[l])
            ilt.append([price, title])
    except:
        print("")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count,g[0],g[1]))

def main():
    goods = '书包'  #爬取关键词
    depth = 2       #爬取深度
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []   #输出结果
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)

main()

抓取猫眼电影排行

import json
import requests
from requests.exceptions import RequestException
import re
import time


def get_one_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
                         + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5] + item[6]
        }


def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')


def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)


if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)
        time.sleep(1)

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 218,546评论 6赞 507
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,224评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 164,911评论 0赞 354
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,737评论 1赞 294
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,753评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,598评论 1赞 305
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,338评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,249评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,696评论 1赞 314
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,888评论 3赞 336
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,013评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,731评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,348评论 3赞 330
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,929评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,048评论 1赞 270
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,203评论 3赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,960评论 2赞 355