Fake int8 model and real int8 model have difference outputs on Intel CPU 

Download demo (Link:https://dubox.com/s/1S3PAyHFeBtyk-Xj-jeB-0Q Password:9gt7).

Refer to the readme or the following.

# Problem

The fake int8 model is generated by PaddleSlim and the real int8 model is optimized model by `save_quant_model.py`. 

With the same input data, we find the results of fake int8 model and real int8 model have numerical difference. For most models, the numerical difference don't affect the statistical accuracy of many input samples. For specific models, the numerical difference will lead to complete incorrect results.

For mobilenetv2:
```
# Test 100 imgs, compare statistical accuracy.

python run_eval.py --model_path models/mobilenetv2_fp32
# test_acc1: 0.78, test_acc5: 0.95

python run_eval.py --model_path models/mobilenetv2_fake_int8
# test_acc1: 0.77, test_acc5: 0.93

python run_eval.py --model_path models/mobilenetv2_real_int8
# test_acc1: 0.77, test_acc5: 0.96
```
```
# Test 1 img, compare numerical difference.

python run_infer.py models/mobilenetv2_fp32
# max value: 0.868, arg_max: 65

python run_infer.py models/mobilenetv2_fake_int8
# max value: 0.835, arg_max: 65

python run_infer.py models/mobilenetv2_real_int8
# max value: 0.902, arg_max: 65
```

For mobilenetv3 model, we apply the origin QAT algorithm to generate a fake int8 model, but the accuracy of fake int8 model is lower than the fp32 model. Therefore, we use the PACT in QAT algorithm that adds an clip operation before fake_quantize_op, and the accuracy of the fake int8 model is the same as the fp32 model. PaddleLite deploys the fake int8 model on ARM CPU and the accuracy is the same. However, the fake int8 model deployed on Intel CPU by PaddleInference has complete incorrect results, the fake int8 model deployed on NV GPU by PaddleInference has 10% accuracy drop.

Note that, we skip quantizing the se_block in mobilenetv3 and set the `--ops_to_quantize='conv2d,fc'` for `save_quant_model.py`.
For 100 imgs, the statistical accuracy as follows:
* The fp32 mobilenetv3: test_acc1 is 0.78, test_acc5 is 0.96
* The fake int8 mobilenetv3: test_acc1 is 0.76, test_acc5 is 0.95
* The real int8 mobilenetv3: test_acc1 is 0.0, test_acc5 is 0.0



# Anaylsis

After comparing the int8 model deployment on ARM CPU and Intel CPU, I find two main difference for now. 

* Int8 range

For the quantize op and int8 op(conv2d, fc, etc), the range of output tensors is [-127, 127] on ARM CPU, but it is [-128, 127] or [0, 255] on Intel CPU. The difference of [-127, 127] and [-128, 127] maybe the main problem. Is ondDNN decide the output range? Can we fix the difference? When the int8 op is connected by relu or relu6 op, quantizing to [0, 255] maybe decrease the quantization loss. In oreder to carry out some test, I want to know how to set the output range as [-128, 127] for all int8 ops. Is it also decided by oneDNN?

* Quantize bias

On ARM CPU, PaddleLite don't quantize the bias of Conv and FC. On Intel CPU, PaddleInference quantizes the bias to int32. Does oneDNN support using fp32 bias in int8 kernel? 


# Compare intermediate tensor

Use Netorn to load the fp32 or int8 model, we know the intermediate tensor names.

`python run_infer.py model_path tensor_name1 tensor_name2...` can run the model and fetch the intermediate tensors.

I have compare some intermediate tensors for the fake int8 mobilenetv3 and the real int8 mobilenetv3.

Tensor name in the fake int8 mobilenetv3 | Tensor name in the real int8 mobilenetv3 | Tensor info in the fake int8 mobilenetv3 |  Tensor info in the real int8 mobilenetv3 
| - | - | - | - |
image.quantized | quantize/out/0 | avg: 59.359283 , min: -85.0 , max: 97.0 , arg_max: 134929 | avg: 59.35928199404762 , min: -85 , max: 97 , arg_max: 134929 
batch_norm_0.tmp_2 | batch_norm_0.tmp_2 | avg: 2.2439551 , min: -10.029446 , max: 14.263074 , arg_max: 193725 | avg: 2.2423842 , min: -9.997816 , max: 14.297043 , arg_max: 193725
tmp_2 | tmp_2 | avg: 2.2459602 , min: -0.37499997 , max: 14.263075 , arg_max: 193725 | avg: 2.2411668 , min: -0.375 , max: 14.297043 , arg_max: 193725
relu_0.tmp_0.quantized | dequantize/in/1 | avg: 31.141064 , min: 0.0 , max: 127.0 , arg_max: 18106 | avg: 62.13507453762755 , min: 0 , max: 255 , arg_max: 18106
relu_1.tmp_0.quantized | dequantize/in/2 | avg: 31.885214 , min: 0.0 , max: 127.0 , arg_max: 12432 | avg: 65.45506816007654 , min: 0 , max: 255 , arg_max: 12432
elementwise_add_0.tmp_0.quantized | dequantize/in/38 | avg: 14.022595 , min: -49.0 , max: 105.0 , arg_max: 159600 | avg: 14.208042689732142 , min: -48 , max: 113 , arg_max: 157920
relu_2.tmp_0.quantized | dequantize/in/3 | avg: 14.235955 , min: 0.0 , max: 127.0 , arg_max: 30068 | avg: 29.650545081313776 , min: 0 , max: 255 , arg_max: 30068
relu_3.tmp_0.quantized | dequantize/in/4 | avg: 20.625692 , min: 0.0 , max: 127.0 , arg_max: 40085 | avg: 42.55670539700255 , min: 0 , max: 255 , arg_max: 16739
batch_norm_6.tmp_2.quantized | dequantize/in/5 | avg: 0.56964815 , min: -127.0 , max: 127.0 , arg_max: 45407 | avg: -0.604405824829932 , min: -128 , max: 127 , arg_max: 3965



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fake int8 model and real int8 model have difference outputs on Intel CPU #31103

Problem

Anaylsis

Compare intermediate tensor

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tensor name in the fake int8 mobilenetv3	Tensor name in the real int8 mobilenetv3	Tensor info in the fake int8 mobilenetv3	Tensor info in the real int8 mobilenetv3
image.quantized	quantize/out/0	avg: 59.359283 , min: -85.0 , max: 97.0 , arg_max: 134929	avg: 59.35928199404762 , min: -85 , max: 97 , arg_max: 134929
batch_norm_0.tmp_2	batch_norm_0.tmp_2	avg: 2.2439551 , min: -10.029446 , max: 14.263074 , arg_max: 193725	avg: 2.2423842 , min: -9.997816 , max: 14.297043 , arg_max: 193725
tmp_2	tmp_2	avg: 2.2459602 , min: -0.37499997 , max: 14.263075 , arg_max: 193725	avg: 2.2411668 , min: -0.375 , max: 14.297043 , arg_max: 193725
relu_0.tmp_0.quantized	dequantize/in/1	avg: 31.141064 , min: 0.0 , max: 127.0 , arg_max: 18106	avg: 62.13507453762755 , min: 0 , max: 255 , arg_max: 18106
relu_1.tmp_0.quantized	dequantize/in/2	avg: 31.885214 , min: 0.0 , max: 127.0 , arg_max: 12432	avg: 65.45506816007654 , min: 0 , max: 255 , arg_max: 12432
elementwise_add_0.tmp_0.quantized	dequantize/in/38	avg: 14.022595 , min: -49.0 , max: 105.0 , arg_max: 159600	avg: 14.208042689732142 , min: -48 , max: 113 , arg_max: 157920
relu_2.tmp_0.quantized	dequantize/in/3	avg: 14.235955 , min: 0.0 , max: 127.0 , arg_max: 30068	avg: 29.650545081313776 , min: 0 , max: 255 , arg_max: 30068
relu_3.tmp_0.quantized	dequantize/in/4	avg: 20.625692 , min: 0.0 , max: 127.0 , arg_max: 40085	avg: 42.55670539700255 , min: 0 , max: 255 , arg_max: 16739
batch_norm_6.tmp_2.quantized	dequantize/in/5	avg: 0.56964815 , min: -127.0 , max: 127.0 , arg_max: 45407	avg: -0.604405824829932 , min: -128 , max: 127 , arg_max: 3965

Fake int8 model and real int8 model have difference outputs on Intel CPU #31103

Description

Problem

Anaylsis

Compare intermediate tensor

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions