1、TensorRT

TensorRT：高性能的深度学习Inference Lib , 应用于产品真实环境的Inference. Not Train。
两个Key Metricx：功耗效率（Power Efficiency）、快速响应。
直接影响用户体验及Cost。
tensorRT在机器学习中的位置：

tensorRT in deeplearning.png

对于实际运行环境，TensorRT自动优化神经网络，可以提供更好的性能。
如下图，同功耗条件下，GPU和CPU的Inference对比：

TensorRT1_Efficiency-1.png

1.1 DNN的两个过程：train and inference

Solving a supervised machine learning problem with deep neural networks involves a two-step process.

1.The first step is to train a deep neural network on massive amounts of labeled data using GPUs. During this step, the neural network learns millions of weights or parameters that enable it to map input data examples to correct responses. Training requires iterative forward and backward passes through the network as the objective function is minimized with respect to the network weights. Often several models are trained and accuracy is validated against data not seen during training in order to estimate real-world performance.
Train - 利用大量带有标注的数据训练深度神经网络，神经网络需要学习上百万的权重和参数以使得数据和标注之间建立正确对应关系。训练过程需要迭代：通过网络进行前向和后向传播，调整权重使得目标函数减小。为了对真实世界数据做出正确预测，经常会训练多个网络。
2.The next step–inference–uses the trained model to make predictions from new data. During this step, the best trained model is used in an application running in a production environment such as a data center, an automobile, or an embedded platform. For some applications, such as autonomous driving, inference is done in real time and therefore high throughput is critical.
Inference - 使用训练后的模型对新的数据进行预测。针对不同的环境：如数据中心、智能手机、嵌入式平台，选择最好的训练模型。对于一些对延迟要求比较苛刻的应用，如智能驾驶，实际环境的Inference需要High Throughput and low Latency。

1.2 Inference Versus Training

train compare to inference.png

Both DNN training and Inference start out with the same forward propagation calculation, but training goes further. As Figure 1 illustrates, after forward propagation, the results from the forward propagation are compared against the (known) correct answer to compute an error value. A backward propagation phase propagates the error back through the network’s layers and updates their weights using gradient descent in order to improve the network’s performance at the task it is trying to learn. It is common to batch hundreds of training inputs (for example, images in an image classification network or spectrograms for speech recognition) and operate on them simultaneously during DNN training in order to prevent overfitting and, more importantly, amortize loading weights from GPU memory across many inputs, increasing computational efficiency.
对于DNN，training及Inference具有相同的前向传播计算过程，training走的更远。如上图，完成前向传播后，前向传播的结果与已知的正确答案之间对比计算产生一误差值，反向传播通过网络层将误差回传，利用梯度下降算法更新权重，以提升整个网络的性能。training过程有上百组输入是很普遍的（如图像分类网络中的图像输入，语音识别中的语音输入），在DNN training中同时对这些多个输入运算以避免出现过拟合（overfitting）。更重要的是，GPU memory从多个输入中分期载入权重可提升计算效率。
For inference, the performance goals are different. To minimize the network’s end-to-end response time, inference typically batches a smaller number of inputs than training, as services relying on inference to work (for example, a cloud-based image-processing pipeline) are required to be as responsive as possible so users do not have to wait several seconds while the system is accumulating images for a large batch. In general, we might say that the per-image workload for training is higher than for inference, and while high throughput is the only thing that counts during training, latency becomes important for inference as well.
对于Inference，目的是不同的。为缩减网络端到端的响应时间，Inference一般比training过程的输入少很多，对于一个依赖于Inference的服务器（如基于云端的图像处理pipeline），用户希望希望能够做出快速响应，而不是因为太多图像数据的载入等待很长时间。一般情况下，training载入的图像要比inference载入的图像要多的多，在training过程中唯一需要考虑的是高的throughtput，而对于inference而言，latency变得更为重要。

1.3 Inference using GPU VS CPU

使用两种经典的神经网络架构做实验：
AlexNet（2012 ImageNet ILSVRC冠军）
GoogleNet（2014 ImageNet ILSVRC冠军），网络深度及神经网络复杂度比AlexNet高很多
jetson_tx1_whitepaper.pdf 中对每种网络又考虑两种情况：

Case 1：允许对输入图像batching，主要针对在云端inference的模型（多个用户每时每刻都在上传图像），对输入数据的打包成batch增加的latency不敏感的情况，实验使用的bach size是48（for CPU） 128 for GPU。
Case 2：不使用batch（latency极度敏感），batch size =1

4个设备： NV TX1 VS Intel Core i7 6700 . NV Titan X VS Intel 16 Core Xeon E5
GPU框架选择： Caffe VS cuDNN
Intel CPU 运行优化的CPU Inference code.（Intel deep learning Framework，仅支持CaffeNet网络架构，类似于AlexNet，batch size 1-48）

NV TX1 ，采用两种浮点精度： 16bit 和32bit 来Inference.
Tegra X1增加了FP16的算法throughtput，在新版本的cuDNN中增加的FP16算法支持，增加了FP16的throughput，在没有引入loss降低分类精度条件下，显著提升了性能。

AlexNet tx1 vs i6700.png

AlexNet Titan X vs E5.png

GoogleNet Titan X vs E5.png

GPU VS CPU对比结论：

1.TX1 with FP16 比CPU方式的Inference具有更高的效能比：
Tegra X1 in FP16 45 img/sec/W Compare To Core i7 6700K 3.9 img/sec/W
绝对性能指标：258 img/sec on Tegra X1 in FP16 Compared To 242 img/sec on Core i7
2.Titan X VS E5的结果类似：
Titan X在消耗更低的能耗情况下，可以实现更好的性能，3000 Images/second VS 500 Images/second in large-bach size.
大的bach size情况下，Titan X比Xoen E5具有更好的性能；即便在no batching情况，TX1、Titan X可实现更好的 Performance/Watt（依赖于12GB framebuffer，在基于FFT卷积算法（对memory容量要求很高）上表现的更优秀）。
3.白皮书中还有一个结论：新的cuDNN对inference性能的优化，除了对增加的Caffe deep learning framework进行优化外，更多的优化是针对卷积算法的（对于多处理器运行小的batches，分割任务，提升GPU上运行小batches的性能）。新的cuDNN也增加了对卷积运算FP16的支持，FP16算法能实现两倍于FP32算法的性能，类似FP16存储，使用FP16算法不会引入loss降低精度（相对于FP32网络的inference）。
4.GPU性能提升还有归功于Caffe Framework，Caffe Framework允许在inference中使用cuBLAS GEMV(matrix-vector multiplication)，替代GEMM (matrix-matrix multiplication)。

训练生成的模型真实部署环境和之前训练环境会有较大差别，如目标是嵌入式设备，inference会对响应时间和功耗会有很高的要求。
Key Metric：效能比：inference性能/watt。
效能比对于大规模数据中心的环境也是一个critical Metric（重要指标），此外，还需要考虑：延迟、布置空间、散热，这些都会影响性能发挥。

1.4 tensorRT build and deployment (编译/构建 & 部署)

tensorRT是高性能的inference engine，目的是获取最大的inference throughput及效率，应用于图像分类、分割、目标检测。tensorRT根据实际场景(网络、移动端、嵌入式or自动驾驶)对训练后的神经网络进行优化，以获得最佳性能及GPU加速inference.

tensorRT-two function.png

tensorRF两个重要功能：
- 1 优化训练后的网络
- 2 Target Runtime
tensorRT使用需两步：build and deployment（编译&部署）
编译过程：优化网络配置，对于前向传播生成一优化Plan，该Plan是优化后的目标代码，可以序列化存储在内存或者硬盘中。
部署过程：通常需要长时间运行的服务或用户应用，该服务或者应用包含批量的数据输入和数据输出（分类、目标识别等）。使用TensorRT不用在部署硬件中安装或运行其它的deep learning framework。
inference服务的其它用途：batching及pipeline，我们就先不讨论，聚焦在tensorRT用于inference。

1.4.1 编译

tensorRT编译阶段：需要三个文件部署分类神经网络

A network architecture file (deploy.prototxt), #网络体系结构
Trained weights (net.caffemodel), and #训练后的权重
A label file to provide a name for each output class. #标签 -每一个输出类的名字

此外，还需定义batch size 及输出层，下面给出将caffe模型转化为tensorRT目标的步骤，3-5行读取网络信息。若没有提供网络结构文件(deploy.prototxt)，用户可以使用编译器自定义网络信息。
caffe 模型转化为tensorRT目标：

IBuilder* builder = createInferBuilder(gLogger);
// parse the caffe model to populate the network, then set the outputs
INetworkDefinition* network = builder->createNetwork();
CaffeParser parser;
auto blob_name_to_tensor = parser.parse(“deploy.prototxt”,
                                        trained_file.c_str(),
                                        *network,
                                        DataType::kFLOAT); 
// specify which tensors are outputs
network->markOutput(*blob_name_to_tensor->find("prob"));
// Build the engine
builder->setMaxBatchSize(1);
builder->setMaxWorkspaceSize(1 << 30); 
ICudaEngine* engine = builder->buildCudaEngine(*network);

tensorRT支持的层类型：

Convolution: 2D
Activation: ReLU, tanh and sigmoid
Pooling: max and average
ElementWise: sum, product or max of two tensors
LRN: cross-channel only
Fully-connected: with or without bias
SoftMax: cross-channel only
Deconvolution

不使用caffe parser时，可使用tensorRT C++ API定义网络，使用API定义上述任何支持的层及其参数，定义网络之间的变化参数，包含卷积层权重尺寸及输出如Polling层的窗口大小及窗口移动幅度。
Tensor RT C++ API 定义网络

ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});
IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);

定义并载入网络后的步骤：

必须明确输出tensors，见caffe转tensorRT代码部分的network->markOutput，在示例中使用的是“prob”（for probability）；
定义batch size，builder->setMaxBatchSize，可根据实际部署环境（应用需求和系统配置）改变batch size；
tensorRT执行层优化以降低inference时间，对于API用户，这部分是透明的，对网络层分析需要memory资源，需要明确可使用的Memory size. builder->setMaxWorkspaceSize
“buildCudaEngine” 执行层优化，编译优化的网络（基于提供的输入和参数）引擎。一旦模型转化为tensorRT目标，可用于Host device存储或在任何地方使用。

tensorRT会对神经网络执行一些重要的转换和优化，首先，有些层的输出没有使用，这些层将删除以减少计算量；然后，一些可能的卷积、偏置、Relu层会打包成单层。分为垂直层优化和水平层优化：
Vertical Layer Fusion

network_optimization-1.png

对于上图所示的网络结构，Vertical Layer Fusion后的结果见下图，Fuse层以CBR标识。
Layer Fusion提升了tensorRT优化网络的效率

network_vertical_fusion.png

Horizontal Layer Fusion
又叫层聚集(Layer aggregation)，将聚集层需要的部分组合，连接其各层对应的输出，Horizontal Layer Fusion通过对来自相同Source tensor的层打包，由于这些层都使用类似的参数作相同的操作，组合成单层将具有更高的计算效率。如下图中将3个1*1的CBR层组合成一个大的CBR层。需要注意组合后的层的输出需要分开提供给之前各个CBR层的输出。

network_horizontal_fusion.png

tensorRT在编译阶段做转换，编译过程在对训练网络和配置文件执行tensorRT parser read后. 这在caffe 模型转化为tensorRT目标的代码部分可查询.

1.4.2 部署

执行完Inference builder(buildCudaEngine)后，返回一指针指向新的inference engine runtime object(ICudaEngine). Runtime object已准备好供使用，其状态可以序列化存储到磁盘或作为分配的目标存储。序列化存储被成为：Plan.

如之前描述，runtime inference engine的batching and streaming data超出了本文档范围，下面的代码演示了使用inference engine处理一系列数据输入生成结果。

// The execution context is responsible for launching the 
// compute kernels
IExecutionContext *context = engine->createExecutionContext();

// In order to bind the buffers, we need to know the names of the 
// input and output tensors.
int inputIndex = engine->getBindingIndex(INPUT_LAYER_NAME),
int outputIndex = engine->getBindingIndex(OUTPUT_LAYER_NAME);

// Allocate GPU memory for Input / Output data
void* buffers = malloc(engine->getNbBindings() * sizeof(void*));
cudaMalloc(&buffers[inputIndex], batchSize * size_of_single_input);
cudaMalloc(&buffers[outputIndex], batchSize * size_of_single_output);

// Use CUDA streams to manage the concurrency of copying and executing
cudaStream_t stream;
cudaStreamCreate(&stream);

// Copy Input Data to the GPU
cudaMemcpyAsync(buffers[inputIndex], input, 
                batchSize * size_of_single_input, 
                cudaMemcpyHostToDevice, stream);

// Launch an instance of the GIE compute kernel
context.enqueue(batchSize, buffers, stream, nullptr);

// Copy Output Data to the Host
cudaMemcpyAsync(output, buffers[outputIndex], 
                batchSize * size_of_single_output, 
                cudaMemcpyDeviceToHost, stream));

// It is possible to have multiple instances of the code above
// in flight on the GPU in different streams.
// The host can then sync on a given stream and use the results
cudaStreamSynchronize(stream);

1.5 最大化tensorRT的性能和效率

tensorRT可以帮助我们简化部署神经网络，提升深度学习能力，使产品具有更高的性能和效率。
Build阶段判别网络优化的可能性，deployment阶段运行被优化的网络以减少延迟、增加吞吐率。
若运行存储在数据中心服务器端备份的网页或移动应用：tensorRT可以部署复杂多变的模型以增加终端使用者的智能化，并减轻终端重量。若使用tensorRT创造下一代设备，tensorRT可以帮忙部署高性能、高精度、高效能的网络。
Moreover，使用混合精度FP16数据运行神经网络inference，可以降低GPU功耗，减少一半的memory使用、提供更高的性能。

检索了，我的Host PC中并没有安装tensorRT，Host PC主要是用来训练，一般使用比较频繁的是DIGITS，而tensorRT是做Inference，故在Host端不会出现。

1.6 Update

上述文档大都写于2012年，时隔久远。 tensorRT新版本已经到4.01，做了一些改变，引入了一些新的特性。

引入新的层： Top-k, LSTM with projection, Constant, Softmax and Batch GEMM。
通过Fuse Layer(Vertical or horizental)优化多层感知机（MLP：Multi-Layer Perception）。
对于循环神经网络（RNN）、多层感知机（MLP）及神经机器翻译（NMT：Neural Machine Translation），都提供示例可以快速开始。
parser 引入ONNX模型，transorRT可对ONNX框架做出优化（like Caffe 2,Chainer,MxNet PyTorch etc.）。支持C++ and Python API。

ONNX support.png
对TensorFlow模型的支持。TensorFlow1.7提供简单的API使用tensorRT加速。对于不同版本Tensor Cores（FP32、FP16、INT8）自动做出优化。

image.png