使用TensorRT进行模型转换及部署主要涉及以下几个性能指标:
- Throughput 吞吐量
单位:qps, QPS, Queries Per Second 表示每秒能够相应的查询次数
由查询次数除以主机Walltime总和得到。如果该值明显低于GPU计算时间的倒数,说明GPU可能由于主机侧的开销或数据传输导致其未能充分利用
the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
- Latency
该值由 H2D 延迟, GPU 计算时间, 和 D2H 延迟相加得到,是推断单个查询的延迟。
the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
- End-to-End Host Latency 主机侧端到端延迟
单次查询的H2D被调用,到D2H完成所用的耗时,其包括等待之前查询完成所需时间。
the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.
- Enqueue Time 查询排队时间
主机侧进行单次查询排队延迟。如果该值大于GPU计算时间,说明GPU可能没有被充分利用
the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
- H2D Latency、Host to Device 延迟
将单次查询的输入张量传输至设备侧引起的延时
the latency for host-to-device data transfers for input tensors of a single query.
- GPU Compute Time、GPU计算时间
单次查询执行核函数引起的延时,用来衡量GPU用来完成计算(执行核函数)所需的时间
the GPU latency to execute the kernels for a query.
- D2H Latency、Device to Host 延迟
将单次查询的输出张量传输至主机侧引起的延时
the latency for device-to-host data transfers for output tensors of a single query.
- Total Host Walltime、主机Walltime[1]总和
主机侧首个查询开始排队到最后一个查询完成的Walltime总和
the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
- Total GPU Compute Time、GPU计算时间总和
所有查询的GPU耗时的总和。如果该值显著低于Total Host Walltime,说明GPU可能由于主机侧的开销和数据传输导致GPU没有被充分利用。
the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
示例:
以下为某模型推理的性能测试数据
Throughput: 56.2013 qps
Latency: min = 22 ms, max = 22.5906 ms, mean = 22.1677 ms, median = 22.1396 ms, percentile(99%) = 22.588 ms
End-to-End Host Latency: min = 34.3153 ms, max = 35.616 ms, mean = 35.2231 ms, median = 35.2316 ms, percentile(99%) = 35.5511 ms
Enqueue Time: min = 0.937988 ms, max = 2.3905 ms, mean = 1.54232 ms, median = 1.5459 ms, percentile(99%) = 1.79907 ms
H2D Latency: min = 4.42554 ms, max = 4.89941 ms, mean = 4.47215 ms, median = 4.43042 ms, percentile(99%) = 4.88135 ms
GPU Compute Time: min = 17.5708 ms, max = 17.8975 ms, mean = 17.689 ms, median = 17.6906 ms, percentile(99%) = 17.8646 ms
D2H Latency: min = 0.00292969 ms, max = 0.0129395 ms, mean = 0.00656637 ms, median = 0.0055542 ms, percentile(99%) = 0.0128784 ms
Total Host Walltime: 3.04263 s
Total GPU Compute Time: 3.02481 s
-
Walltime 表示从计算开始到计算结束等待的时间。 ↩