爬虫学习笔记

爬虫

Python复习

数据类型

列表

#列表可以重新赋值
a = [11,'avc',1.1]

元组 tuple

#元组不可以重新赋值
b = (11,'avc',1.1)

字典

#相当于Java中的map
c = {'key': 'value','abc',7}

集合

d = set("asdffa") 
e = {'a','c','d'}
#差集
d-e
#并集
d|e
#交集
d&e

文件操作

open(文件地址,操作形式)

w：写入 r：读取 b：二进制 a：追加
```
#"D:\Python\1.txt" 要转换成"D:\\Python\\1.txt"或者"D:/Python/1.txt"
fh = open("D:\\Python\\1.txt", "r")
data = fh.read()
print(data)
```
遇到的问题：执行下列代码 readline() 并没有输出内容
```
fh = open("D:\\Python\\1.txt", "r", encoding="utf-8")
data = fh.read()
print(data)
line = fh.readline()
print(line)
fh.close()
```
解决：readline()函数进行读取是根据光标的位置来读，由于data = fh.read()将整个文件读取完，光标在文件最后，所以readline读不到内容

w在文件关闭后会保存，再次打开后写操作都会覆盖掉之前的内容

fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "来啊"
fh.write(data)
fh.close()
fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "快活啊"
fh.write(data)
fh.close()

# 文件内容：快活啊

改为a后会在之前的内容追加写

fh = open("D:\\Python\\1.txt", "w", encoding="utf-8")
data = "来啊"
fh.write(data)
fh.close()
fh = open("D:\\Python\\1.txt", "a", encoding="utf-8")
data = "快活啊"
fh.write(data)
fh.close()
# 文件内容：来啊快活啊

异常处理

异常处理的格式

'''
try:
    程序
except Exception as 异常名称 :
    异常处理部分
'''
try:
    for i in range(0, 10):
        print(i)
        if i == 2:
            print(1/0)
    print("hello")
except Exception as err:
    print(err)

面向对象

类中的方法第一个参数是 self 的才可以被实例调用

class Father:
    def speak(self):
        print("I can speak")


class Mother:
    def write(self):
        print("I can write")


class Son(Father):
    def speak(self):
        print("I can speak well")


class Daughter(Father, Mother):
    pass


father = Father()
father.speak()
son = Son()
son.speak()
daughter = Daughter()
daughter.speak()
daughter.write()


# I can speak
# I can speak well
# I can speak
# I can write

正则表达式

正则表达式可以对数据进行筛选，提取出我们关注的信息

原子

使用正则表达式依赖re模块

原子时正则表达式中最基本的单位，每个正则表达式中至少要包含一个原子，常见的原子类型有：

普通字符作为原子

import re

string = "xixihaha"
# 普通字符作为原子
pat = "ih"
result = re.search(pat, string)
print(result)
# <re.Match object; span=(3, 5), match='ih'>

非打印字符作为原子

# 非打印字符做原子
# \n 换行符 \t 制表符
string = '''xixi
haha
'''
pat = "\n"
result = re.search(pat, string)
print(result)
# <re.Match object; span=(4, 5), match='\n'>

通用字符作为原子
- \w 匹配字母、数字、下划线 \W 匹配除字母、数字、下划线之外的
- \d 匹配十进制数字 \D匹配除十进制数字之外的
- \s 空白字符 \S除空白字符之外的
```
#通用字符作原子
string = '''xixiha123haha'''
pat = "\d\d"
result = re.search(pat, string)
print(result)
# <re.Match object; span=(5, 8), match='a12'>
```

原子表将不同的原子组成表

string = '''zhangfuzhi'''
pat = "zhang[efg]u" # 原子表[efg]中包含'e' 'f' 'g'，如果这三个原子中有匹配的则提取出来
result = re.search(pat, string)
print(result)
# <re.Match object; span=(0, 7), match='zhangfu'>
# 反例：
string = '''zhangfuzhi'''
pat = "zhang[eg]u"
result = re.search(pat, string)
print(result)
# None

元字符

元字符就是正则表达式中一些具有特殊含义的字符，比如重复N此次前面的字符等

. 匹配除换行符以外任意一个字符
^ 如果不在原子表里面代表匹配开始位置，在原子表里代表非
$ 匹配结束位置
* 前面的原子重复出现零次一次或多次
? 前面的原子出现一次或零次
+ 前面的原子出现一次或多次
{n} 前面原子恰好出现n次
{n,} 前面原子至少出现n次
{n,m} 前面原子至少出现n次，之多出现m次
| 模式选择符“或”
() 模式单元

模式修正符

I 忽略大小写
M 多行匹配
L 本地化识别匹配
U 根据Unicode进行解析
S 让.匹配换行符

string = '''Python'''
pat = "pyt"
result = re.search(pat, string, re.I)
print(result)
# <re.Match object; span=(0, 3), match='Pyt'>

贪婪模式与懒惰模式

贪婪模式的核心就是尽可能多的匹配，懒惰模式的核心点时尽可能少的匹配。默认是贪婪模式

string = '''Pythony'''
pat = "p.*y"  # 贪婪模式
pat2 = "p.*?y"  # ?代表使用懒惰模式
result = re.search(pat, string, re.I)
result2 = re.search(pat2, string, re.I)
print(result)
print(result2)

# <re.Match object; span=(0, 7), match='Pythony'>
# <re.Match object; span=(0, 2), match='Py'>

正则表达式函数

match() 从头开始匹配

string = '''Pythony'''
pat = "p.*y"
pat2 = "o.*?y"
result = re.match(pat, string, re.I)
result2 = re.match(pat2, string, re.I)
print(result)
print(result2)

# <re.Match object; span=(0, 7), match='Pythony'>
# None

seach() 任何地方都可以匹配

全局匹配函数格式：re.compile(正则表达式).findall(数据)

string = '''pythonypouyppypady'''
pat = "p.*?y"
result = re.compile(pat).findall(string)
print(result)

正则实例

匹配.com或.cn

string = "<a href='http://www.baidu.com'>百度</a><a href='http://www.jd.com'>百度</a>"
pat = "[a-zA-Z]+://[^\s]*[.com|.cn]"
result = re.compile(pat).findall(string)
print(result)
# ['http://www.baidu.com', 'http://www.jd.com']

匹配电话号码

string = "dashd0534-5657888dasbd a0534-5325695asdshgiusae//.001-12345678"
pat = "\d{4}-\d{7}|\d{3}-\d{8}"
result = re.compile(pat).findall(string)
print(result)
# ['0534-5657888', '0534-5325695', '001-12345678']

简单练手

爬取出版社信息并写入文件

url = "https://read.douban.com/provider/all"
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36 Edg/80.0.361.62'}
ret = urllib.request.Request(url, headers=header)
res = urllib.request.urlopen(ret)
data = res.read().decode('utf-8')

# <div class="name">重庆大学出版社</div>
pat = ">(.{1,10}?出版社)"
result = re.compile(pat).findall(data)
print(result)
fh = open("D:\\Python\\1.txt", "w")
for i in result:
    fh.write(i + '\n')
fh.close()

Requests库

get post请求

import requests

# GET请求
r = requests.get('http://httpbin.org/get')
print(r.status_code, r.reason)
print(r.text)

# POST请求
r = requests.post('http://httpbin.org/post', data={'a': '1'})
print(r.json())

httpbin是一个HTTP Request & Response Service，你可以向他发送请求，然后他会按照指定的规则将你的请求返回

自定义header请求

ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36 Edg/80.0.361.62'
headers = {'User-Agent': ua}
r = requests.get('http://httpbin.org/get', headers=headers, params={'c': '3'})
print(r.json())

带cookies的请求

cookies = dict(usrid='12345', token='hhhhhhh')
r = requests.get('http://httpbin.org/cookies', cookies=cookies)
print('带cookies的请求', r.json())
# 带cookies的请求 {'cookies': {'token': 'hhhhhhh', 'usrid': '12345'}}

Basic-auth认证请求

r = requests.get('http://httpbin.org/basic-auth/zfz/123456', auth=('zfz', '123456'))
print('Basic-auth认证请求', r.json())
# Basic-auth认证请求 {'authenticated': True, 'user': 'zfz'}

主动抛出状态码异常

bad_r = requests.get('http://httpbin.org/404')
print(bad_r.status_code)
bad_r.raise_for_status()

使用request.Session对象

# 创建一个Session对象
s = requests.Session()
# Session对象会保存服务器返回的set-cookies头信息里面的内容
s.get('http://httpbin.org/cookies/set/userid/123456')
# 下一次请求会将本地所有的cookies信息自动添加到头信息里面
r = s.get('http://httpbin.org/cookies')
print('检查session中的cookies', r.json())

# 检查session中的cookies {'cookies': {'userid': '123456'}}

BeautifulSoup库

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

选择所有title标签

soup.title
# <title>The Dormouse's story</title>

title标签的文本内容

soup.title.text
# "The Dormouse's story"

取出第一个a标签的所有属性

soup.a.attrs
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

取出a标签的href属性

soup.a.attrs['href']
# 'http://example.com/elsie'

判断是否有href属性
```
soup.a.has_attr('class')
```
取出第一个p标签下的所有子节点，取出的是一个迭代器，需要用list转换
```
list(soup.p.children)
```

取出本页的所有链接

for a in soup.find_all('a'):
    print(a.attrs['href'])
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

取出id=link3的节点

soup.find(id='link3')
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

bs支持css选择器

soup.select('.story')
# [<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>, <p class="story">...</p>]

xpath

xpath是一门在xml文档中查找信息的语言

节点（node）

元素、属性、文本、命名空间、文档（根）节点

节点关系

父（parent）
子（children）
同胞（sibling）
先辈（ancestor）
后代（descendant）

xpath语法

表达式	描述
nodename	选取此节点的所有子节点
/	从当前节点选区直接子节点
//	从当前节点选取子孙节点
.	选取当前节点
..	选取当前节点的父节点
@	选取属性

最后编辑于：2020.03.09 21:44:43

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 214,588评论 6赞 496
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 91,456评论 3赞 389
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 160,146评论 0赞 350
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 57,387评论 1赞 288
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 66,481评论 6赞 386
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,510评论 1赞 293
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,522评论 3赞 414
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,296评论 0赞 270
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,745评论 1赞 307
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,039评论 2赞 330
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,202评论 1赞 343
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,901评论 5赞 338
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,538评论 3赞 322
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,165评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,415评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,081评论 2赞 365
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,085评论 2赞 352