Skip to content

Fake int8 model and real int8 model have difference outputs on Intel CPU  #31103

@juncaipeng

Description

@juncaipeng

Download demo (Link:https://dubox.com/s/1S3PAyHFeBtyk-Xj-jeB-0Q Password:9gt7).

Refer to the readme or the following.

Problem

The fake int8 model is generated by PaddleSlim and the real int8 model is optimized model by save_quant_model.py.

With the same input data, we find the results of fake int8 model and real int8 model have numerical difference. For most models, the numerical difference don't affect the statistical accuracy of many input samples. For specific models, the numerical difference will lead to complete incorrect results.

For mobilenetv2:

# Test 100 imgs, compare statistical accuracy.

python run_eval.py --model_path models/mobilenetv2_fp32
# test_acc1: 0.78, test_acc5: 0.95

python run_eval.py --model_path models/mobilenetv2_fake_int8
# test_acc1: 0.77, test_acc5: 0.93

python run_eval.py --model_path models/mobilenetv2_real_int8
# test_acc1: 0.77, test_acc5: 0.96
# Test 1 img, compare numerical difference.

python run_infer.py models/mobilenetv2_fp32
# max value: 0.868, arg_max: 65

python run_infer.py models/mobilenetv2_fake_int8
# max value: 0.835, arg_max: 65

python run_infer.py models/mobilenetv2_real_int8
# max value: 0.902, arg_max: 65

For mobilenetv3 model, we apply the origin QAT algorithm to generate a fake int8 model, but the accuracy of fake int8 model is lower than the fp32 model. Therefore, we use the PACT in QAT algorithm that adds an clip operation before fake_quantize_op, and the accuracy of the fake int8 model is the same as the fp32 model. PaddleLite deploys the fake int8 model on ARM CPU and the accuracy is the same. However, the fake int8 model deployed on Intel CPU by PaddleInference has complete incorrect results, the fake int8 model deployed on NV GPU by PaddleInference has 10% accuracy drop.

Note that, we skip quantizing the se_block in mobilenetv3 and set the --ops_to_quantize='conv2d,fc' for save_quant_model.py.
For 100 imgs, the statistical accuracy as follows:

  • The fp32 mobilenetv3: test_acc1 is 0.78, test_acc5 is 0.96
  • The fake int8 mobilenetv3: test_acc1 is 0.76, test_acc5 is 0.95
  • The real int8 mobilenetv3: test_acc1 is 0.0, test_acc5 is 0.0

Anaylsis

After comparing the int8 model deployment on ARM CPU and Intel CPU, I find two main difference for now.

  • Int8 range

For the quantize op and int8 op(conv2d, fc, etc), the range of output tensors is [-127, 127] on ARM CPU, but it is [-128, 127] or [0, 255] on Intel CPU. The difference of [-127, 127] and [-128, 127] maybe the main problem. Is ondDNN decide the output range? Can we fix the difference? When the int8 op is connected by relu or relu6 op, quantizing to [0, 255] maybe decrease the quantization loss. In oreder to carry out some test, I want to know how to set the output range as [-128, 127] for all int8 ops. Is it also decided by oneDNN?

  • Quantize bias

On ARM CPU, PaddleLite don't quantize the bias of Conv and FC. On Intel CPU, PaddleInference quantizes the bias to int32. Does oneDNN support using fp32 bias in int8 kernel?

Compare intermediate tensor

Use Netorn to load the fp32 or int8 model, we know the intermediate tensor names.

python run_infer.py model_path tensor_name1 tensor_name2... can run the model and fetch the intermediate tensors.

I have compare some intermediate tensors for the fake int8 mobilenetv3 and the real int8 mobilenetv3.

Tensor name in the fake int8 mobilenetv3 Tensor name in the real int8 mobilenetv3 Tensor info in the fake int8 mobilenetv3 Tensor info in the real int8 mobilenetv3
image.quantized quantize/out/0 avg: 59.359283 , min: -85.0 , max: 97.0 , arg_max: 134929 avg: 59.35928199404762 , min: -85 , max: 97 , arg_max: 134929
batch_norm_0.tmp_2 batch_norm_0.tmp_2 avg: 2.2439551 , min: -10.029446 , max: 14.263074 , arg_max: 193725 avg: 2.2423842 , min: -9.997816 , max: 14.297043 , arg_max: 193725
tmp_2 tmp_2 avg: 2.2459602 , min: -0.37499997 , max: 14.263075 , arg_max: 193725 avg: 2.2411668 , min: -0.375 , max: 14.297043 , arg_max: 193725
relu_0.tmp_0.quantized dequantize/in/1 avg: 31.141064 , min: 0.0 , max: 127.0 , arg_max: 18106 avg: 62.13507453762755 , min: 0 , max: 255 , arg_max: 18106
relu_1.tmp_0.quantized dequantize/in/2 avg: 31.885214 , min: 0.0 , max: 127.0 , arg_max: 12432 avg: 65.45506816007654 , min: 0 , max: 255 , arg_max: 12432
elementwise_add_0.tmp_0.quantized dequantize/in/38 avg: 14.022595 , min: -49.0 , max: 105.0 , arg_max: 159600 avg: 14.208042689732142 , min: -48 , max: 113 , arg_max: 157920
relu_2.tmp_0.quantized dequantize/in/3 avg: 14.235955 , min: 0.0 , max: 127.0 , arg_max: 30068 avg: 29.650545081313776 , min: 0 , max: 255 , arg_max: 30068
relu_3.tmp_0.quantized dequantize/in/4 avg: 20.625692 , min: 0.0 , max: 127.0 , arg_max: 40085 avg: 42.55670539700255 , min: 0 , max: 255 , arg_max: 16739
batch_norm_6.tmp_2.quantized dequantize/in/5 avg: 0.56964815 , min: -127.0 , max: 127.0 , arg_max: 45407 avg: -0.604405824829932 , min: -128 , max: 127 , arg_max: 3965

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions