[论文笔记] IOS: Inter-Operator Scheduler for CNN Acceleration

IOS: Inter-Operator Scheduler for CNN Acceleration
Proceedings of the 4 th MLSys Conference, San Jose, CA, USA, 2021.
Arxiv | Github | Website

主要内容

随着计算能力的提升，卷积神经网络（Convolutional neural network，CNN）的顺序执行（sequential execution）无法为了更加充分地利用计算资源而提供更多的并行化机会。在该篇论文中作者提出了一种从全局角度对深度学习模型计算图进行算子调度的算法：IOS（Inter-Operator Scheduler）。通过将算子间并行和算子内并行结合，并且采用动态规划算法查找更高效的调度策略，IOS算法可以帮助CNN实现更高的设备利用率。实验表明，相比于TensorRT，IOS可以使CNN的推理性能提升1.1倍到1.5倍。

现有算法及其存在的问题

MetaFlow^[1], fuses multiple operators matching a specific pattern into a larger operator to increase operator granularity.
Greedy Schedule^[2]，directly executes all available CNN operators on CPU tomaximize resource utilization. 存在的问题：
1. puts more operators in the early stages and fewer operators in subsequent stages, resulting in low utilization in later stages.
2. executing too many operators on the device concurrently may lead to resource contention problem that hurts the performance.
获得最优并行化 CNN 模型调度策略所面临的挑战：
1. 调度策略的可选个数随着算子数量的增加呈指数式增长；
2. 最优调度策略依赖于具体硬件设备和运行参数（如：batch size）

算法详解

计算图递归分割

The illustration of ending

将计算图 $G = (V;E)$ 分割为 $V-S'$ 和 $S'$ ， $V-S'$ 和 $S'$ 之间的边都是从 $V-S'$ 开始到 $S'$ 为止。 $S'$ 被称为 $V$ 的终点（ $ending$ ）。
在 $V$ 的最优调度策略中的最后一个阶段（ $stage$ ）所包含的算子，必定是 $V$ 的终点。
通过枚举 $V$ 的终点，将查找 $V$ 的最优调度策略转换为查找 $V-S'$ 的最优调度策略。
由此通过对计算图 $G$ 进行递归分割，完成整图的调度

定义代价函数

$cost[S] = \underset{S'}{min} (cost[S-S'] + stage\_latency[S'])$

其中： $S'$ 为 $S$ 的终点， $cost[\varnothing] = 0$ 。由此， $cost[V]$ 是全图最优调度策略的得到的模型推理耗时（latency）。

伪代码实现

Algorithm 1 Inter-Operator Scheduler (IOS)

IOS算法的实现由三个函数组成

INTEROPERATORSCHEDULER InterOperatorScheduler
takes a computation graph as an input and returns the optimal schedule found by IOS
SCHEDULER Scheduler
a recursive function implementing the dynamic programming algorithm to find the optimal schedule for a subset of operators in $G$
GENERATESTAGE GeneratorStage
chooses a better parallelization strategy for given operators $S'$

算法时间复杂度

（略）

实验

不同调度策略间的比较

End-to-end performance comparison

IOS-Merge schedule
only takes the “operator merge” strategy
IOS-Parallel schedule
only takes the “concurrent execution” strategy
IOS-Both schedule
considers both parallelization strategies
sequential schedule
executes the operator one-by-one according to certain topological ordering
greedy schedule
puts all the operators that can be executed currently in one stage, and repeats this process until all operators have been scheduled.

基于cuDNN框架的比较

End-to-end performance comparison

Figure 7 shows that IOS consistently outperforms all five baseline frameworks on four benchmark CNNs. IOS can achieve 1:1 to 1:5× speedup comparing to the state of the art library TASO, TVM-cuDNN, and TensorRT

更多的激活线程束提升设备利用率

激活线程束比较

算子计算本质上是执行GPU核函数。一个核函数调用一组线程，然后将其分组为多个线程块，并将每个线程块分发到流多处理器（stream multiprocessors, SMs）。SM上的线程块可以进一步划分为多个线程束，线程束作为最基本的执行单元，包含固定的线程个数，去执行单指令多线程方法。

当一个线程束从SM上开始执行到执行完成最后一个指令为止，可以认为其处于激活（active）状态。SM可以通过快速上下文切换来隐藏内存访问导致的线程束停顿：在每个循环中，每个线程束调度器会选择符合条件的线程束并且发射指令。如果此时没有合适线程束可调用，则导致了计算资源处于闲置状态。

增加激活线程束是提升每个周期具有符合条件线程束的几率的有效方法，因此，增加激活线程束至关重要。

Figure 8 shows the number of active warps on the whole GPU throughout the repeated execution of both the IOS and the Sequential schedule, sampled using NVIDIA’s CUPTI profiling toolset every 2.1 ms. IOS schedule achieves 58% more active warps on average compared to the Sequential schedule.

控制变量研究

通过调度调优减少搜索时间

Trade-off between the optimized latency and the optimization cost for Inception V3 and NasNet.

IOS可以通过调整调度策略参数，对优化后推理耗时和优化耗时之间调整。其中，参数 $r$ 为 the maximum number of operators in each group， $s$ 为 the maximum number of groups in a stage。Figure 9 展示了不同策略参数对优化后推理耗时和优化耗时的影响。

专门调度有益于性能提升

Latency (ms) of specialized schedules for batch size 1, 32 and 128, and specialized schedules for NVIDIA Tesla K80 and V100

不同的工作负载（比如：具有不同批量的网络模型）具有不同的计算特点。因此不同的工作负载需要确定不同的调度策略。由Table 3 (1) 可知，针对不同的批量大小选择对应的批量大小运行可以得到最佳的网络性能。

同时，针对不同的运行设备，同样需要确定不同的调度策略。由Table 3 (2) 可知，针对特定设备确定专门策略可以实现更好的网络性能。

The schedule found by IOS for the last block of Inception V3

由Figure 10 可知，同一个网络结构针对不同的批量大小选择不同的调度策略。

There are two stages in the schedule (1), which is optimized for batch size 1 while there are four stages in the schedule (2), which is optimized for batch size 32. The schedule (1) is 28% faster than the schedule (2) on batch size 1, while the schedule (2) is 8% faster than (1) on batch size 32. There are two differences between them. The first one is that convolution f and g in the schedule (2) are merged into a single convolution. This is because activation (the output tensor of an operator) is the memory bottleneck at large batch size. It is more crucial to reduce memory access, even at the cost of larger computation cost. Merging can reduce the memory access, because the merged kernel only access the output of convolution c once, instead of twice in the schedule (1). However, because the kernel size of f and g are 3x1 and 1x3, respectively, their kernel size would be expanded to 3x3 by padding zeros, which increases the amount of computation. Another difference between the schedule (1) and (2) is that the schedule (2) has more stages than the schedule (1). We found a similar phenomenon for large batch sizes because of resource contention. When multiple operators are executed on the device, there is a conflict over access to the shared resources such as the lastlevel cache, making the concurrent execution degrades the performance. This gets more severe for larger batch sizes because the demand for shared resources gets larger

针对不同批量大小的持续提升

The throughput comparison

在实际的应用中，需要不同的批量大小用模型推理。由 Figure 11 可知，模型吞吐量（throughput）随着批量的增加而增加，当批量大小大于128时，模型性能达到饱和，模型吞吐量不再随着批量的增加而增加。同时，相比于其他比较基线，IOS的模型吞吐量在所有批量大小都能达最优

算子内和算子间的并行

End-to-end performance comparison

TVM exploits the intra-operator parallelism by searching the schedule for each kernel on a specific device. IOS focuses on inter-operator parallelism and leaves the exploitation of intra-operator parallelism to cuDNN library

算子内并行和算子间并行并没有什么必然的联系，而且可以同时使用。由 Figure 12 可知，IOS 在部分网络的性能上优于 TVM，这是因为对于性能足够强的计算设备仅使用算子内并行无法提供足够的并行性。再另一部分网络模型，TVM则优于IOS，这是因为TVM找到了可分离卷积更高效的核函数实现，在这些网络模型的算子中可分离卷积占据了较大的比例。

Jia, Z., Thomas, J., Warszawski, T., Gao, M., Zaharia, M., and Aiken, A. Optimizing dnn computation with relaxed graph substitutions. In Talwalkar, A., Smith, V., and Zaharia, M. (eds.), Proceedings of Machine Learning and Systems, volume 1, pp. 27–39. 2019b. ↩
Tang, L., Wang, Y., Willke, T. L., and Li, K. Scheduling computation graphs of deep learning models on manycore cpus. arXiv preprint arXiv:1807.09667, 2018. ↩

最后编辑于：2022.03.27 16:34:40

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 217,826评论 6赞 506
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,968评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 164,234评论 0赞 354
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,562评论 1赞 293
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,611评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,482评论 1赞 302
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,271评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,166评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,608评论 1赞 314
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,814评论 3赞 336
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,926评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,644评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,249评论 3赞 329
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,866评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,991评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,063评论 3赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,871评论 2赞 354

[论文笔记] IOS: Inter-Operator Scheduler for CNN Acceleration

主要内容

算法详解

实验

控制变量研究

推荐阅读更多精彩内容