-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Description
Download demo (Link:https://dubox.com/s/1S3PAyHFeBtyk-Xj-jeB-0Q Password:9gt7).
Refer to the readme or the following.
Problem
The fake int8 model is generated by PaddleSlim and the real int8 model is optimized model by save_quant_model.py
.
With the same input data, we find the results of fake int8 model and real int8 model have numerical difference. For most models, the numerical difference don't affect the statistical accuracy of many input samples. For specific models, the numerical difference will lead to complete incorrect results.
For mobilenetv2:
# Test 100 imgs, compare statistical accuracy.
python run_eval.py --model_path models/mobilenetv2_fp32
# test_acc1: 0.78, test_acc5: 0.95
python run_eval.py --model_path models/mobilenetv2_fake_int8
# test_acc1: 0.77, test_acc5: 0.93
python run_eval.py --model_path models/mobilenetv2_real_int8
# test_acc1: 0.77, test_acc5: 0.96
# Test 1 img, compare numerical difference.
python run_infer.py models/mobilenetv2_fp32
# max value: 0.868, arg_max: 65
python run_infer.py models/mobilenetv2_fake_int8
# max value: 0.835, arg_max: 65
python run_infer.py models/mobilenetv2_real_int8
# max value: 0.902, arg_max: 65
For mobilenetv3 model, we apply the origin QAT algorithm to generate a fake int8 model, but the accuracy of fake int8 model is lower than the fp32 model. Therefore, we use the PACT in QAT algorithm that adds an clip operation before fake_quantize_op, and the accuracy of the fake int8 model is the same as the fp32 model. PaddleLite deploys the fake int8 model on ARM CPU and the accuracy is the same. However, the fake int8 model deployed on Intel CPU by PaddleInference has complete incorrect results, the fake int8 model deployed on NV GPU by PaddleInference has 10% accuracy drop.
Note that, we skip quantizing the se_block in mobilenetv3 and set the --ops_to_quantize='conv2d,fc'
for save_quant_model.py
.
For 100 imgs, the statistical accuracy as follows:
- The fp32 mobilenetv3: test_acc1 is 0.78, test_acc5 is 0.96
- The fake int8 mobilenetv3: test_acc1 is 0.76, test_acc5 is 0.95
- The real int8 mobilenetv3: test_acc1 is 0.0, test_acc5 is 0.0
Anaylsis
After comparing the int8 model deployment on ARM CPU and Intel CPU, I find two main difference for now.
- Int8 range
For the quantize op and int8 op(conv2d, fc, etc), the range of output tensors is [-127, 127] on ARM CPU, but it is [-128, 127] or [0, 255] on Intel CPU. The difference of [-127, 127] and [-128, 127] maybe the main problem. Is ondDNN decide the output range? Can we fix the difference? When the int8 op is connected by relu or relu6 op, quantizing to [0, 255] maybe decrease the quantization loss. In oreder to carry out some test, I want to know how to set the output range as [-128, 127] for all int8 ops. Is it also decided by oneDNN?
- Quantize bias
On ARM CPU, PaddleLite don't quantize the bias of Conv and FC. On Intel CPU, PaddleInference quantizes the bias to int32. Does oneDNN support using fp32 bias in int8 kernel?
Compare intermediate tensor
Use Netorn to load the fp32 or int8 model, we know the intermediate tensor names.
python run_infer.py model_path tensor_name1 tensor_name2...
can run the model and fetch the intermediate tensors.
I have compare some intermediate tensors for the fake int8 mobilenetv3 and the real int8 mobilenetv3.
Tensor name in the fake int8 mobilenetv3 | Tensor name in the real int8 mobilenetv3 | Tensor info in the fake int8 mobilenetv3 | Tensor info in the real int8 mobilenetv3 |
---|---|---|---|
image.quantized | quantize/out/0 | avg: 59.359283 , min: -85.0 , max: 97.0 , arg_max: 134929 | avg: 59.35928199404762 , min: -85 , max: 97 , arg_max: 134929 |
batch_norm_0.tmp_2 | batch_norm_0.tmp_2 | avg: 2.2439551 , min: -10.029446 , max: 14.263074 , arg_max: 193725 | avg: 2.2423842 , min: -9.997816 , max: 14.297043 , arg_max: 193725 |
tmp_2 | tmp_2 | avg: 2.2459602 , min: -0.37499997 , max: 14.263075 , arg_max: 193725 | avg: 2.2411668 , min: -0.375 , max: 14.297043 , arg_max: 193725 |
relu_0.tmp_0.quantized | dequantize/in/1 | avg: 31.141064 , min: 0.0 , max: 127.0 , arg_max: 18106 | avg: 62.13507453762755 , min: 0 , max: 255 , arg_max: 18106 |
relu_1.tmp_0.quantized | dequantize/in/2 | avg: 31.885214 , min: 0.0 , max: 127.0 , arg_max: 12432 | avg: 65.45506816007654 , min: 0 , max: 255 , arg_max: 12432 |
elementwise_add_0.tmp_0.quantized | dequantize/in/38 | avg: 14.022595 , min: -49.0 , max: 105.0 , arg_max: 159600 | avg: 14.208042689732142 , min: -48 , max: 113 , arg_max: 157920 |
relu_2.tmp_0.quantized | dequantize/in/3 | avg: 14.235955 , min: 0.0 , max: 127.0 , arg_max: 30068 | avg: 29.650545081313776 , min: 0 , max: 255 , arg_max: 30068 |
relu_3.tmp_0.quantized | dequantize/in/4 | avg: 20.625692 , min: 0.0 , max: 127.0 , arg_max: 40085 | avg: 42.55670539700255 , min: 0 , max: 255 , arg_max: 16739 |
batch_norm_6.tmp_2.quantized | dequantize/in/5 | avg: 0.56964815 , min: -127.0 , max: 127.0 , arg_max: 45407 | avg: -0.604405824829932 , min: -128 , max: 127 , arg_max: 3965 |