协程
Python社区虽然对于异步编程的支持相比其他语言稍显迟缓,但是也在Python3.4中加入了asyncio,在Python3.5上又提供了async/await语法层面的支持,刚正式发布的Python3.6中asyncio也已经由临时版改为了稳定版。
通常在Python中我们进行并发编程一般都是使用多线程或者多进程来实现的,对于计算型任务由于GIL的存在我们通常使用多进程来实现,而对与IO型任务我们可以通过线程调度来让线程在执行IO任务时让出GIL,从而实现表面上的并发。
其实对于IO型任务我们还有一种选择就是协程,协程是运行在单线程当中的“并发”,协程相比多线程一大优势就是省去了多线程之间的切换开销,获得了更大的运行效率。Python中的asyncio也是基于协程来进行实现的。在进入asyncio之前我们先来了解一下Python中怎么通过生成器进行协程来实现并发。
生成器(generator)
Iterables
When you create a list, you can read its items one by one. Reading its items one by one is called iteration:
>>> mylist = [1, 2, 3]
>>> for i in mylist:
... print(i)
1
2
3
Generators
Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly:
>>> mygenerator = (x*x for x in range(3))
>>> for i in mygenerator:
... print(i)
0
1
4
Yield
yield is a keyword that is used like return, except the function will return a generator.
>>> def createGenerator():
... mylist = range(3)
... for i in mylist:
... yield i*i
...
>>> mygenerator = createGenerator() # create a generator
>>> print(mygenerator) # mygenerator is an object!
<generator object createGenerator at 0xb7555c34>
>>> for i in mygenerator:
... print(i)
0
1
4
To master yield, you must understand that when you call the function, the code you have written in the function body does not run. The function only returns the generator object, this is a bit tricky :-)
Now the hard part:
The first time the for calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield, then it'll return the first value of the loop. Then, each other call will run the loop you have written in the function one more time, and return the next value, until there is no value to return.
使用asyncio模块实现协程
我们自己定义了一个协程display_date(num, loop),然后它使用关键字yield from来等待协程asyncio.sleep(2)的返回结果。而在这等待的2s之间它会让出CPU的执行权,直到asyncio.sleep(2)返回结果。gather()或者wait()来返回future的执行结果。
from collections import deque
def student(name, homeworks):
for homework in homeworks.items():
yield (name, homework[0], homework[1]) # 学生"生成"作业给老师
class Teacher(object):
def __init__(self, students):
self.students = deque(students)
def handle(self):
"""老师处理学生作业"""
while len(self.students):
student = self.students.pop()
try:
homework = next(student)
print('handling', homework[0], homework[1], homework[2])
except StopIteration:
pass
else:
self.students.appendleft(student)
if __name__ == '__main__':
Teacher([
student('Student1', {'math': '1+1=2', 'cs': 'operating system'}),
student('Student2', {'math': '2+2=4', 'cs': 'computer graphics'}),
student('Student3', {'math': '3+3=5', 'cs': 'compiler construction'})
]).handle()
运行结果如下
ziwenxie :: ~ » python coroutine.py
Loop: 1 Time: 2016-12-19 16:06:46.515329
Loop: 2 Time: 2016-12-19 16:06:46.515446
Loop: 1 Time: 2016-12-19 16:06:48.517613
Loop: 2 Time: 2016-12-19 16:06:48.517724
Loop: 1 Time: 2016-12-19 16:06:50.520005
Loop: 2 Time: 2016-12-19 16:06:50.520169
Loop: 1 Time: 2016-12-19 16:06:52.522452
Loop: 2 Time: 2016-12-19 16:06:52.522567
Loop: 1 Time: 2016-12-19 16:06:54.524889
Loop: 2 Time: 2016-12-19 16:06:54.525031
Loop: 1 Time: 2016-12-19 16:06:56.527713
Loop: 2 Time: 2016-12-19 16:06:56.528102
在Python3.5中为我们提供更直接的对协程的支持,引入了async/await关键字,上面的代码可以这样改写,使用async代替了@asyncio.coroutine,使用了await代替了yield from,这样代码变得更加简洁可读。
import asyncio
import datetime
async def display_date(num, loop): # 声明一个协程
end_time = loop.time() + 10.0
while True:
print("Loop: {} Time: {}".format(num, datetime.datetime.now()))
if (loop.time() + 1.0) >= end_time:
break
await asyncio.sleep(2) # 等同于yield from
loop = asyncio.get_event_loop() # 获取一个event_loop
tasks = [display_date(1, loop), display_date(2, loop)]
loop.run_until_complete(asyncio.gather(*tasks)) # 阻塞直到所有的tasks完成
loop.close()
例:Chain coroutines
import asyncio
async def compute(x, y):
print("Compute %s + %s ..." % (x, y))
await asyncio.sleep(1.0) # 协程compute不会继续往下面执行,直到协程sleep返回结果
return x + y
async def print_sum(x, y):
result = await compute(x, y) # 协程print_sum不会继续往下执行,直到协程compute返回结果
print("%s + %s = %s" % (x, y, result))
loop = asyncio.get_event_loop()
loop.run_until_complete(print_sum(1, 2))
loop.close()
运行结果如下
Compute 1 + 2 ...
1 + 2 = 3
并发哪家强
requests + ThreadPoolExecutor
import time
import requests
from concurrent.futures import ThreadPoolExecutor
NUMBERS = range(12)
URL = 'http://httpbin.org/get?a={}'
def fetch(a):
r = requests.get(URL.format(a))
return r.json()['args']['a']
start = time.time()
with ThreadPoolExecutor(max_workers=3) as executor:
for num, result in zip(NUMBERS, executor.map(fetch, NUMBERS)):
print('fetch({}) = {}'.format(num, result))
print('Use requests+ThreadPoolExecutor cost: {}'.format(time.time() - start))
运行结果如下
fetch(0) = 0
fetch(1) = 1
fetch(2) = 2
fetch(3) = 3
fetch(4) = 4
fetch(5) = 5
fetch(6) = 6
fetch(7) = 7
fetch(8) = 8
fetch(9) = 9
fetch(10) = 10
fetch(11) = 11
Use requests+ThreadPoolExecutor cost: 5.653115510940552
asyncio + aiohttp
import time
import requests
import asyncio
import aiohttp
NUMBERS = range(12)
URL = 'http://httpbin.org/get?a={}'
async def fetch_async(a):
async with aiohttp.request('GET', URL.format(a)) as r:
data = await r.json()
return data['args']['a']
start = time.time()
event_loop = asyncio.get_event_loop()
tasks = [fetch_async(num) for num in NUMBERS]
results = event_loop.run_until_complete(asyncio.gather(*tasks))
for num, result in zip(NUMBERS, results):
print('fetch({}) = {}'.format(num, result))
print('Use asyncio+requests+ThreadPoolExecutor cost: {}'.format(time.time() - start))
运行结果如下
fetch(0) = 0
fetch(1) = 1
fetch(2) = 2
fetch(3) = 3
fetch(4) = 4
fetch(5) = 5
fetch(6) = 6
fetch(7) = 7
fetch(8) = 8
fetch(9) = 9
fetch(10) = 10
fetch(11) = 11
Use asyncio+requests+ThreadPoolExecutor cost: 3.3252382278442383
同步机制
asyncio模块包含多种同步机制,每个原语的解释可以看 线程篇 ,这些原语的用法上和线程/进程有一些区别。
Semaphore(信号量)
并发的去爬取显然可以让爬虫工作显得更有效率,但是我们应该把抓取做的无害,这样既可以保证我们不容易发现,也不会对被爬的网站造成一些额外的压力。
在这里吐槽下,豆瓣现在几乎成了爬虫练手专用网站,我个人也不知道为啥?欢迎留言告诉我。难道是豆瓣一直秉承尊重用户的原则不轻易对用户才去封禁策略,造成大家觉得豆瓣最适合入门么?BTW,我每天在后台都能看到几十万次无效的抓取,也就是抓取程序写的有问题,但还在不停地请求着...
好吧回到正题,比如我现在要抓取http://httpbin.org/get?a=X这样的页面,X为1-10000的数字,一次性的产生1w次请求显然很快就会被封掉。那么我们可以用Semaphore控制同时的并发量(例子中为了演示,X为0-11):
import aiohttp
import asyncio
NUMBERS = range(12)
URL = 'http://httpbin.org/get?a={}'
sema = asyncio.Semaphore(3)
async def fetch_async(a):
async with aiohttp.request('GET', URL.format(a)) as r:
data = await r.json()
return data['args']['a']
async def print_result(a):
with (await sema):
r = await fetch_async(a)
print('fetch({}) = {}'.format(a, r))
loop = asyncio.get_event_loop()
f = asyncio.wait([print_result(num) for num in NUMBERS])
loop.run_until_complete(f)
asyncio — Asynchronous I/O, event loop, coroutines and tasks
Base Event Loop
The event loop is the central execution device provided by asyncio
. It provides multiple facilities, including:
- Registering, executing and cancelling delayed calls (timeouts).
- Creating client and server transports for various kinds of communication.
- Launching subprocesses and the associated transports for communication with an external program.
- Delegating costly function calls to a pool of threads.
Event loop examples
Example using the AbstractEventLoop.call_soon()
method to schedule a callback. The callback displays "Hello World"
and then stops the event loop:
import asyncio
def hello_world(loop):
print('Hello World')
loop.stop()
loop = asyncio.get_event_loop()
# Schedule a call to hello_world()
loop.call_soon(hello_world, loop)
# Blocking call interrupted by loop.stop()
loop.run_forever()
loop.close()
Example of callback displaying the current date every second. The callback uses the AbstractEventLoop.call_later()
method to reschedule itself during 5 seconds, and then stops the event loop:
import asyncio
import datetime
def display_date(end_time, loop):
print(datetime.datetime.now())
if (loop.time() + 1.0) < end_time:
loop.call_later(1, display_date, end_time, loop)
else:
loop.stop()
loop = asyncio.get_event_loop()
# Schedule the first call to display_date()
end_time = loop.time() + 5.0
loop.call_soon(display_date, end_time, loop)
# Blocking call interrupted by loop.stop()
loop.run_forever()
loop.close()
Event loops
Tasks and coroutines
Coroutines used with asyncio may be implemented using the async def
statement, or by using generators. The async def type of coroutine was added in Python 3.5, and is recommended if there is no need to support older Python versions.
Generator-based coroutines should be decorated with @asyncio.coroutine
, although this is not strictly enforced. The decorator enables compatibility with async def coroutines, and also serves as documentation. Generator-based coroutines use the yield from syntax introduced in PEP 380, instead of the original yield syntax.
The word “coroutine”, like the word “generator”, is used for two different (though related) concepts:
- The function that defines a coroutine (a function definition using async def
or decorated with @asyncio.coroutine). If disambiguation is needed we will call this a coroutine function (iscoroutinefunction() returns True). - The object obtained by calling a coroutine function. This object represents a computation or an I/O operation (usually a combination) that will complete eventually. If disambiguation is needed we will call it a coroutine object(iscoroutine() returns True).
Example of coroutine displaying "Hello World":
import asyncio
async def hello_world():
print("Hello World!")
loop = asyncio.get_event_loop()
# Blocking call which returns when the hello_world() coroutine is done
loop.run_until_complete(hello_world())
loop.close()
Example of coroutine displaying the current date every second during 5 seconds using the sleep()
function:
import asyncio
import datetime
async def display_date(loop):
end_time = loop.time() + 5.0
while True:
print(datetime.datetime.now())
if (loop.time() + 1.0) >= end_time:
break
await asyncio.sleep(1)
loop = asyncio.get_event_loop()
# Blocking call which returns when the display_date() coroutine is done
loop.run_until_complete(display_date(loop))
loop.close()
Example chaining coroutines:
import asyncio
async def compute(x, y):
print("Compute %s + %s ..." % (x, y))
await asyncio.sleep(1.0)
return x + y
async def print_sum(x, y):
result = await compute(x, y)
print("%s + %s = %s" % (x, y, result))
loop = asyncio.get_event_loop()
loop.run_until_complete(print_sum(1, 2))
loop.close()
compute() is chained to print_sum(): print_sum() coroutine waits until compute() is completed before returning its result.
Sequence diagram of the example:
Example combining a Future
and a coroutine function:
import asyncio
async def slow_operation(future):
await asyncio.sleep(1)
future.set_result('Future is done!')
loop = asyncio.get_event_loop()
future = asyncio.Future()
asyncio.ensure_future(slow_operation(future))
loop.run_until_complete(future)
print(future.result())
loop.close()
The coroutine function is responsible for the computation (which takes 1 second) and it stores the result into the future. The run_until_complete()
method waits for the completion of the future.
Example executing 3 tasks (A, B, C) in parallel:
import asyncio
async def factorial(name, number):
f = 1
for i in range(2, number+1):
print("Task %s: Compute factorial(%s)..." % (name, i))
await asyncio.sleep(1)
f *= i
print("Task %s: factorial(%s) = %s" % (name, number, f))
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(
factorial("A", 2),
factorial("B", 3),
factorial("C", 4),
))
loop.close()
Output
Task A: Compute factorial(2)...
Task B: Compute factorial(2)...
Task C: Compute factorial(2)...
Task A: factorial(2) = 2
Task B: Compute factorial(3)...
Task C: Compute factorial(3)...
Task B: factorial(3) = 6
Task C: Compute factorial(4)...
Task C: factorial(4) = 24
参考文献:
文中大部分内容摘自Python并发编程之协程/异步IO
关于生成器摘自What does the “yield” keyword do in Python?