Inside Volta: The World’s Most Advanced Data Center GPU

来自nvidia的官方文档，介绍NV新的显卡结构volta，用ResNet-50 deep neural network进行了计算的对比，对比Tesla P100和Tesla V100，Tesla P00式Pascal架构的。
https://devblogs.nvidia.com/inside-volta/

Key Feature for Tesla V100：

1、redesign SM processor architecture opetimized for deep learning
2、Second-Generation NVLink
3、HBM2 Memory：三星个一个存储，速度更快，配合了新一代的存储控制器
4、新增的Volta Multi-Process Service功能，在并行计算上提高速度和性能。
5、增强的统一存储和地址转换服务。On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.
6、提供了Cooperative Groups和New Cooperative Launch APIs是一个新的programming model introduced in CUDA 9 for organizing groups of communicating threads. Cooperative Groups 允许进行更好的粒度描述，方便组织线程的分解。从Kepler架构开始支持. 新的Cooperative Launch APIs支持CUDA线程块之间的同步，Volta添加了同步模式。
7、Maximum Performance and Maximum Efficiency Modes
8、软件优化：提供了优化的cuda库cuDNN, cuBLAS, and TensorRT

GV100 硬件框架

GV100组成部件包含：Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), memory controllers.
GV100：6个GPCs, 84 Volta SMs, 42 TPCs (each including two SMs), and 8个512-bit memory controllers (4096 bits total). 1个SM包含64个FP32 Core, 64个INT32 Core, 32个FP64 Core, 8个Tensor Core. 4个texture unit.
一共包含： 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units。
每个memory controller都挂一个768 KB的L2 cache, 一对memory controllers控制一个HBM2 DRAM stack.
SM (Streaming Multiprocessor)
GP100 SM is partitioned into two processing blocks, each with 32 FP32 Cores, 16 FP64 Cores, an instruction buffer, one warp scheduler, two dispatch units, and a 128 KB Register File.
GV100 SM is partitioned into four processing blocks, each with 16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two of the new mixed-precision Tensor Cores for deep learning matrix arithmetic, a new L0 instruction cache, one warp scheduler, one dispatch unit, and a 64 KB Register File.

Independent Thread Scheduling

Volta架构郑家了独立线程调度，从GV100开始田间，方便programer编程，增加更好的异步和并行作用。
之前的SIMT模型：
Pascal和之前的NVGPU用SIMT的方式执行，一次执行一个warps（32个线程）。一个warp中有一个程序计数器供32个线程共享，可以作为activ mask用，记录当前warp中哪个thread是active的。
就是先将warp分散成一组一组的，然后按照顺序一次执行一组线程，最后所有线程都执行完成后，再将其合并在一起。

image.png

上面的图就是一个SIMT warp模型中的线程调度器。 Divergent branches within a warp are serialized so that all statements in one side of the branch are executed together to completion before any statements in the other side are executed. After the else statement, the threads of the warp will typically reconverge.
Pascal SIMT执行模型提高效率的手段：减少跟踪线程状态使用的资源数；最大化的聚合线程来提高并行度。但是，要跟踪整个warp的总体线程状态，意当执行路径发生分歧时，the threads which take different branches lose concurrency until they reconverge直到它们重新收敛为止。这种并发性损失意味着同一warp中不同divergent region或不同的执行状态无法交换数据或互相通知。这带来了不一致的情况，来自不同warp的线程继续并行运行，但是来自同一warp的diverged threads只有在reconverge之后才能顺序运行。如，需要细粒度共享由锁或互斥锁保护的数据的算法很容易导致死锁，具体取决于竞争线程来自哪个warp。这种问题，程序员要么避免细粒度的同步，要么依赖lock-free或warp-aware算法。

Volta SIMT Model

Volta通过维护每个线程的执行状态（包括程序计数器和调用堆栈），让所有线程实现等价并发性，忽略了warp的限制。

image.png

所以每个线程保留一个PC和S值，区别于以前的每个warp一个PC和S值。
Volta的独立线程调度就可以使GPU执行任何线程，允许一个线程等待另一线程产生数据。同时包含一个调度优化器，该优化器确定如何将同一warp中的活动线程group together into SIMT units，提高并行效率。保留了SIMT执行单元的高吞吐量，具有更大的灵活性：线程现在可以以sub-warp的粒度拆分和重组，仍将执行相同代码的线程group在一起执行，并并行运行。
所以上面的if-else的例子在执行起来就会进行分成更小的粒度进行多次的穿插，单执行过程依然是SIMT模式的（at any given clock cycle CUDA cores execute the same instruction for all active threads in a warp just as before）和先前体系结构的执行效率保持一致。重要的是，Volta的独立调度线程可以让复杂的细粒度算法和数据结构以更自然的方式实现。同时优化非同步代码，以保持尽可能多的收敛，以实现最大的SIMT效率。

image.png

上面这个图中是假设z的输入没有依赖ABXY，如果有依赖需要进行同步操作，CUDA9中用户可以自己调用函数进行强制reconverge进行同步操作。

image.png

Starvation-Free Algorithms

是独立线程调度器的一个关键pattern。这是并发计算算法，让系统确保所有线程都能充分访问竞争资源，保证它们能够正确执行。在这个算法中使用互斥锁保证最终尝试获取互斥锁的线程成功。
一个简化的示例，该示例利用Volta的独立线程调度功能：将节点插入多线程应用程序中的双向链接列表中。

__device__ void insert_after(Node *a, Node *b)
{
    Node *c;
    lock(a); lock(a->next);
    c = a->next;

    a->next = b;
    b->prev = a;

    b->next = c;
    c->prev = b;

    unlock(c); unlock(a);
}

这个例子中，双向链表的每个元素都至少具有三个组成部分：下一个指针，上一个指针和锁，该锁提供所有者对节点进行更新的独占访问权限。图14显示了节点B在节点A之后的插入，其中更新了A和C的下一个和上一个指针。

image.png

Independent thread scheduling in Volta ensures that even if a thread T0 currently holds the lock for node A, another thread T1 in the same warp can successfully wait for the lock to become available without impeding the progress of thread T0. Note, however, that because active threads in a warp execute together, threads spinning on a lock may degrade the performance of the thread holding the lock.

It is also important to note that the use of a per-node lock in the above example is critical for performance on the GPU. Traditional doubly-linked list implementations may use a coarse-grained lock that provides exclusive access to the entire structure, rather than separately protecting individual nodes. This approach typically leads to poor performance in applications with many threads—Volta may have up to 163,840 concurrent threads—caused by extremely high contention for the lock. By using a fine-grained lock on each node, the average per-node contention in large lists will usually be low except under certain pathological node insertion patterns.

This doubly-linked list with fine-grained locks is a simple example, but it demonstrates how independent thread scheduling gives developers the capability to implement familiar algorithms and data structures on the GPU in a natural way.