Python并发编程

协程

Python社区虽然对于异步编程的支持相比其他语言稍显迟缓,但是也在Python3.4中加入了asyncio,在Python3.5上又提供了async/await语法层面的支持,刚正式发布的Python3.6中asyncio也已经由临时版改为了稳定版。

通常在Python中我们进行并发编程一般都是使用多线程或者多进程来实现的,对于计算型任务由于GIL的存在我们通常使用多进程来实现,而对与IO型任务我们可以通过线程调度来让线程在执行IO任务时让出GIL,从而实现表面上的并发。

其实对于IO型任务我们还有一种选择就是协程,协程是运行在单线程当中的“并发”,协程相比多线程一大优势就是省去了多线程之间的切换开销,获得了更大的运行效率。Python中的asyncio也是基于协程来进行实现的。在进入asyncio之前我们先来了解一下Python中怎么通过生成器进行协程来实现并发。

生成器(generator)

Iterables

When you create a list, you can read its items one by one. Reading its items one by one is called iteration:

>>> mylist = [1, 2, 3]
>>> for i in mylist:
...    print(i)
1
2
3

Generators

Generators are iterators, but you can only iterate over them once. It's because they do not store all the values in memory, they generate the values on the fly:

>>> mygenerator = (x*x for x in range(3))
>>> for i in mygenerator:
...    print(i)
0
1
4

Yield

yield is a keyword that is used like return, except the function will return a generator.

>>> def createGenerator():
...    mylist = range(3)
...    for i in mylist:
...        yield i*i
...
>>> mygenerator = createGenerator() # create a generator
>>> print(mygenerator) # mygenerator is an object!
<generator object createGenerator at 0xb7555c34>
>>> for i in mygenerator:
...     print(i)
0
1
4

To master yield, you must understand that when you call the function, the code you have written in the function body does not run. The function only returns the generator object, this is a bit tricky :-)

Now the hard part:

The first time the for calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield, then it'll return the first value of the loop. Then, each other call will run the loop you have written in the function one more time, and return the next value, until there is no value to return.

使用asyncio模块实现协程

我们自己定义了一个协程display_date(num, loop),然后它使用关键字yield from来等待协程asyncio.sleep(2)的返回结果。而在这等待的2s之间它会让出CPU的执行权,直到asyncio.sleep(2)返回结果。gather()或者wait()来返回future的执行结果。

from collections import deque


def student(name, homeworks):
    for homework in homeworks.items():
        yield (name, homework[0], homework[1])  # 学生"生成"作业给老师


class Teacher(object):
    def __init__(self, students):
        self.students = deque(students)

    def handle(self):
        """老师处理学生作业"""
        while len(self.students):
            student = self.students.pop()
            try:
                homework = next(student)
                print('handling', homework[0], homework[1], homework[2])
            except StopIteration:
                pass
            else:
                self.students.appendleft(student)

if __name__ == '__main__':
    Teacher([
        student('Student1', {'math': '1+1=2', 'cs': 'operating system'}),
        student('Student2', {'math': '2+2=4', 'cs': 'computer graphics'}),
        student('Student3', {'math': '3+3=5', 'cs': 'compiler construction'})
    ]).handle()

运行结果如下

ziwenxie :: ~ » python coroutine.py
Loop: 1 Time: 2016-12-19 16:06:46.515329
Loop: 2 Time: 2016-12-19 16:06:46.515446
Loop: 1 Time: 2016-12-19 16:06:48.517613
Loop: 2 Time: 2016-12-19 16:06:48.517724
Loop: 1 Time: 2016-12-19 16:06:50.520005
Loop: 2 Time: 2016-12-19 16:06:50.520169
Loop: 1 Time: 2016-12-19 16:06:52.522452
Loop: 2 Time: 2016-12-19 16:06:52.522567
Loop: 1 Time: 2016-12-19 16:06:54.524889
Loop: 2 Time: 2016-12-19 16:06:54.525031
Loop: 1 Time: 2016-12-19 16:06:56.527713
Loop: 2 Time: 2016-12-19 16:06:56.528102

在Python3.5中为我们提供更直接的对协程的支持,引入了async/await关键字,上面的代码可以这样改写,使用async代替了@asyncio.coroutine,使用了await代替了yield from,这样代码变得更加简洁可读。

import asyncio
import datetime


async def display_date(num, loop):  # 声明一个协程
    end_time = loop.time() + 10.0
    while True:
        print("Loop: {} Time: {}".format(num, datetime.datetime.now()))
        if (loop.time() + 1.0) >= end_time:
            break
        await asyncio.sleep(2)  # 等同于yield from


loop = asyncio.get_event_loop()  # 获取一个event_loop

tasks = [display_date(1, loop), display_date(2, loop)]

loop.run_until_complete(asyncio.gather(*tasks))  # 阻塞直到所有的tasks完成
loop.close()

例:Chain coroutines

import asyncio

async def compute(x, y):
    print("Compute %s + %s ..." % (x, y))
    await asyncio.sleep(1.0)  # 协程compute不会继续往下面执行,直到协程sleep返回结果
    return x + y

async def print_sum(x, y):
    result = await compute(x, y)  # 协程print_sum不会继续往下执行,直到协程compute返回结果
    print("%s + %s = %s" % (x, y, result))

loop = asyncio.get_event_loop()
loop.run_until_complete(print_sum(1, 2))
loop.close()

运行结果如下

Compute 1 + 2 ...
1 + 2 = 3

并发哪家强

requests + ThreadPoolExecutor

import time
import requests
from concurrent.futures import ThreadPoolExecutor

NUMBERS = range(12)
URL = 'http://httpbin.org/get?a={}'

def fetch(a):
   r = requests.get(URL.format(a))
   return r.json()['args']['a']

start = time.time()
with ThreadPoolExecutor(max_workers=3) as executor:
   for num, result in zip(NUMBERS, executor.map(fetch, NUMBERS)):
       print('fetch({}) = {}'.format(num, result))
print('Use requests+ThreadPoolExecutor cost: {}'.format(time.time() - start))

运行结果如下

fetch(0) = 0
fetch(1) = 1
fetch(2) = 2
fetch(3) = 3
fetch(4) = 4
fetch(5) = 5
fetch(6) = 6
fetch(7) = 7
fetch(8) = 8
fetch(9) = 9
fetch(10) = 10
fetch(11) = 11
Use requests+ThreadPoolExecutor cost: 5.653115510940552

asyncio + aiohttp

import time
import requests
import asyncio
import aiohttp

NUMBERS = range(12)
URL = 'http://httpbin.org/get?a={}'

async def fetch_async(a):
   async with aiohttp.request('GET', URL.format(a)) as r:
       data = await r.json()
   return data['args']['a']

start = time.time()
event_loop = asyncio.get_event_loop()
tasks = [fetch_async(num) for num in NUMBERS]
results = event_loop.run_until_complete(asyncio.gather(*tasks))
for num, result in zip(NUMBERS, results):
   print('fetch({}) = {}'.format(num, result))

print('Use asyncio+requests+ThreadPoolExecutor cost: {}'.format(time.time() - start))

运行结果如下

fetch(0) = 0
fetch(1) = 1
fetch(2) = 2
fetch(3) = 3
fetch(4) = 4
fetch(5) = 5
fetch(6) = 6
fetch(7) = 7
fetch(8) = 8
fetch(9) = 9
fetch(10) = 10
fetch(11) = 11
Use asyncio+requests+ThreadPoolExecutor cost: 3.3252382278442383

同步机制

asyncio模块包含多种同步机制,每个原语的解释可以看 线程篇 ,这些原语的用法上和线程/进程有一些区别。

Semaphore(信号量)

并发的去爬取显然可以让爬虫工作显得更有效率,但是我们应该把抓取做的无害,这样既可以保证我们不容易发现,也不会对被爬的网站造成一些额外的压力。

在这里吐槽下,豆瓣现在几乎成了爬虫练手专用网站,我个人也不知道为啥?欢迎留言告诉我。难道是豆瓣一直秉承尊重用户的原则不轻易对用户才去封禁策略,造成大家觉得豆瓣最适合入门么?BTW,我每天在后台都能看到几十万次无效的抓取,也就是抓取程序写的有问题,但还在不停地请求着...

好吧回到正题,比如我现在要抓取http://httpbin.org/get?a=X这样的页面,X为1-10000的数字,一次性的产生1w次请求显然很快就会被封掉。那么我们可以用Semaphore控制同时的并发量(例子中为了演示,X为0-11):

import aiohttp
import asyncio

NUMBERS = range(12)
URL = 'http://httpbin.org/get?a={}'
sema = asyncio.Semaphore(3)

async def fetch_async(a):
   async with aiohttp.request('GET', URL.format(a)) as r:
       data = await r.json()
   return data['args']['a']
   
async def print_result(a):
   with (await sema):
       r = await fetch_async(a)
       print('fetch({}) = {}'.format(a, r))

loop = asyncio.get_event_loop()
f = asyncio.wait([print_result(num) for num in NUMBERS])
loop.run_until_complete(f)

asyncio — Asynchronous I/O, event loop, coroutines and tasks

Base Event Loop

The event loop is the central execution device provided by asyncio
. It provides multiple facilities, including:

  • Registering, executing and cancelling delayed calls (timeouts).
  • Creating client and server transports for various kinds of communication.
  • Launching subprocesses and the associated transports for communication with an external program.
  • Delegating costly function calls to a pool of threads.

Event loop examples

Example using the AbstractEventLoop.call_soon()
method to schedule a callback. The callback displays "Hello World"
and then stops the event loop:

import asyncio

def hello_world(loop):
    print('Hello World')
    loop.stop()

loop = asyncio.get_event_loop()

# Schedule a call to hello_world()
loop.call_soon(hello_world, loop)

# Blocking call interrupted by loop.stop()
loop.run_forever()
loop.close()

Example of callback displaying the current date every second. The callback uses the AbstractEventLoop.call_later()
method to reschedule itself during 5 seconds, and then stops the event loop:

import asyncio
import datetime

def display_date(end_time, loop):
    print(datetime.datetime.now())
    if (loop.time() + 1.0) < end_time:
        loop.call_later(1, display_date, end_time, loop)
    else:
        loop.stop()

loop = asyncio.get_event_loop()

# Schedule the first call to display_date()
end_time = loop.time() + 5.0
loop.call_soon(display_date, end_time, loop)

# Blocking call interrupted by loop.stop()
loop.run_forever()
loop.close()

Event loops

Tasks and coroutines

Coroutines used with asyncio may be implemented using the async def
statement, or by using generators. The async def type of coroutine was added in Python 3.5, and is recommended if there is no need to support older Python versions.

Generator-based coroutines should be decorated with @asyncio.coroutine
, although this is not strictly enforced. The decorator enables compatibility with async def coroutines, and also serves as documentation. Generator-based coroutines use the yield from syntax introduced in PEP 380, instead of the original yield syntax.

The word “coroutine”, like the word “generator”, is used for two different (though related) concepts:

  • The function that defines a coroutine (a function definition using async def
    or decorated with @asyncio.coroutine). If disambiguation is needed we will call this a coroutine function (iscoroutinefunction() returns True).
  • The object obtained by calling a coroutine function. This object represents a computation or an I/O operation (usually a combination) that will complete eventually. If disambiguation is needed we will call it a coroutine object(iscoroutine() returns True).

Example of coroutine displaying "Hello World":

import asyncio

async def hello_world():
    print("Hello World!")

loop = asyncio.get_event_loop()
# Blocking call which returns when the hello_world() coroutine is done
loop.run_until_complete(hello_world())
loop.close()

Example of coroutine displaying the current date every second during 5 seconds using the sleep()
function:

import asyncio
import datetime

async def display_date(loop):
    end_time = loop.time() + 5.0
    while True:
        print(datetime.datetime.now())
        if (loop.time() + 1.0) >= end_time:
            break
        await asyncio.sleep(1)

loop = asyncio.get_event_loop()
# Blocking call which returns when the display_date() coroutine is done
loop.run_until_complete(display_date(loop))
loop.close()

Example chaining coroutines:

import asyncio

async def compute(x, y):
    print("Compute %s + %s ..." % (x, y))
    await asyncio.sleep(1.0)
    return x + y

async def print_sum(x, y):
    result = await compute(x, y)
    print("%s + %s = %s" % (x, y, result))

loop = asyncio.get_event_loop()
loop.run_until_complete(print_sum(1, 2))
loop.close()

compute() is chained to print_sum(): print_sum() coroutine waits until compute() is completed before returning its result.

Sequence diagram of the example:

Paste_Image.png

Example combining a Future
and a coroutine function:

import asyncio

async def slow_operation(future):
    await asyncio.sleep(1)
    future.set_result('Future is done!')

loop = asyncio.get_event_loop()
future = asyncio.Future()
asyncio.ensure_future(slow_operation(future))
loop.run_until_complete(future)
print(future.result())
loop.close()

The coroutine function is responsible for the computation (which takes 1 second) and it stores the result into the future. The run_until_complete()
method waits for the completion of the future.

Example executing 3 tasks (A, B, C) in parallel:

import asyncio

async def factorial(name, number):
    f = 1
    for i in range(2, number+1):
        print("Task %s: Compute factorial(%s)..." % (name, i))
        await asyncio.sleep(1)
        f *= i
    print("Task %s: factorial(%s) = %s" % (name, number, f))

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(
    factorial("A", 2),
    factorial("B", 3),
    factorial("C", 4),
))
loop.close()

Output

Task A: Compute factorial(2)...
Task B: Compute factorial(2)...
Task C: Compute factorial(2)...
Task A: factorial(2) = 2
Task B: Compute factorial(3)...
Task C: Compute factorial(3)...
Task B: factorial(3) = 6
Task C: Compute factorial(4)...
Task C: factorial(4) = 24

参考文献:
文中大部分内容摘自Python并发编程之协程/异步IO
关于生成器摘自What does the “yield” keyword do in Python?

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 211,948评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,371评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,490评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,521评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,627评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,842评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,997评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,741评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,203评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,534评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,673评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,339评论 4 330
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,955评论 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,770评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,000评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,394评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,562评论 2 349

推荐阅读更多精彩内容