INT8卷积
量化前:
conv2d-fp32
量化后[1]:
- 量化编码:
fp32-to-int8-IO
离线权重量化编码:
在线输入量化编码: - INT8卷积计算:
conv2d-int8
- 反量化编码:
int32-to-fp32-IO
- 加法计算:
Add
TensoRT2.1卷积核伪代码实现[2]:
// I8 input tensors: I8_input, I8_weights,
// I8 output tensors: I8_output
// F32 bias (original bias from the F32 model)
// F32 scaling factors: input_scale, output_scale, weights_scale[K]
I32_gemm_out = I8_input * I8_weights // Compute INT8 GEMM (DP4A)
F32_gemm_out = (float)I32_gemm_out // Cast I32 GEMM output to F32 float
// At this point we have F32_gemm_out which is scaled by ( input_scale * weights_scale[K] ),
// but to store the final result in int8 we need to have scale equal to "output_scale", so we have to rescale:
// (this multiplication is done in F32, *_gemm_out arrays are in NCHW format)
For i in 0, ... K-1:
rescaled_F32_gemm_out[ :, i, :, :] = F32_gemm_out[ :, i, :, :] * \
[ output_scale / (input_scale * weights_scale[ i ] ) ]
// Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale"
rescaled_F32_gemm_out_with_bias = rescaled_F32_gemm_out + output_scale * bias
// Perform ReLU (in F32)F32_result = ReLU(rescaled_F32_gemm_out _with_bias)
// Convert to INT8 and save to globalI8_output = Saturate( Round_to_nearest_integer( F32_result ) )