QNN: huge accuracy degradation in 16a4w quantization

Dear @cccclai, @shewu-quic, @chunit-quic

Im trying to quantize the dummy llama model and run on my Qualcomm device
```
python ./examples/qualcomm/scripts/dummy_llama2.py --model SM8550 --device *** -b build_android --ptq 16a4w
```
However the output of the quantized model is far away from where it should be
```
/data/local/tmp/executorch/dummy_llama2_qnn...ile pulled. 0.1 MB/s (6144 bytes in 0.087s)
is_close? False
x86_golden tensor([[[ 0.2713,  0.5471, -0.3194,  ...,  0.1733, -0.7186, -1.1417],
         [ 0.2635,  0.0273, -0.1612,  ...,  1.2671, -1.4816, -0.6256],
         [ 0.1451, -0.5109,  0.0358,  ...,  0.4289, -0.3217, -1.4835]]],
       grad_fn=<UnsafeViewBackward0>)
device_out tensor([[[-0.3499, -0.3881,  0.5011,  ..., -0.2530,  0.3161, -0.0744],
         [ 0.4127, -0.4308, -0.5663,  ..., -0.3564,  0.0952,  0.7879],
         [-0.2407, -0.5039,  0.3697,  ..., -0.1345,  0.5565,  0.1253]]])
```
Is it a known issue or did I do something wrong? Are there guidelines or a strategy how to quantize these models in 16a4w and keep accuracy on reasonable level? I would be grateful if you could give some insight on it. Btw, non-quantized model outputs accurate results.

Full log:
```
opcode         name                      target                       args                           kwargs
-------------  ------------------------  ---------------------------  -----------------------------  --------
placeholder    arg55_1                   arg55_1                      ()                             {}
get_attr       lowered_module_0          lowered_module_0             ()                             {}
call_function  executorch_call_delegate  executorch_call_delegate     (lowered_module_0, arg55_1)    {}
call_function  getitem                   <built-in function getitem>  (executorch_call_delegate, 0)  {}
output         output                    output                       ((getitem,),)                  {}
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
./dummy_llama2/dummy_llama2_qnn.pte: 1 file pushed. 12.8 MB/s (611280 bytes in 0.046s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/aar...pushed. 33.4 MB/s (1545776 bytes in 0.044s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/hex...pushed. 37.5 MB/s (7360784 bytes in 0.187s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/aar... pushed. 27.7 MB/s (290504 bytes in 0.010s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/aar...ushed. 38.1 MB/s (26628144 bytes in 0.666s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/aar... pushed. 14.1 MB/s (229024 bytes in 0.015s)
build_android/examples/qualcomm/qnn_executo...shed. 38.0 MB/s (385895152 bytes in 9.695s)
build_android/backends/qualcomm/libqnn_exec...pushed. 36.9 MB/s (8854840 bytes in 0.229s)
/home/anzen/Projects/executorch/dummy_llama... file pushed. 0.0 MB/s (14 bytes in 0.002s)
/home/anzen/Projects/executorch/dummy_llama... file pushed. 0.0 MB/s (12 bytes in 0.002s)
I 00:00:00.003985 executorch:qnn_executor_runner.cpp:81] Model file dummy_llama2_qnn.pte is loaded.
I 00:00:00.004075 executorch:qnn_executor_runner.cpp:90] Using method forward
I 00:00:00.004103 executorch:qnn_executor_runner.cpp:138] Setting up planned buffer 0, size 6160.
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 2
[WARNING] [Qnn ExecuTorch]:  <W> Initializing HtpProvider

[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols


[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols


[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols


[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[INFO] [Qnn ExecuTorch]: Running level=3 optimization.
I 00:00:00.205369 executorch:qnn_executor_runner.cpp:161] Method loaded.
I 00:00:00.205507 executorch:qnn_executor_runner.cpp:166] Inputs prepared.
I 00:00:00.205798 executorch:qnn_executor_runner.cpp:171] Number of inputs: 1
I 00:00:00.206130 executorch:qnn_executor_runner.cpp:232] Perform 0 inference for warming up
I 00:00:00.206200 executorch:qnn_executor_runner.cpp:238] Start inference (0)
[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols


I 00:00:00.207318 executorch:qnn_executor_runner.cpp:256] 1 inference took 1.069000 ms, avg 1.069000 ms
I 00:00:00.207700 executorch:qnn_executor_runner.cpp:298] 1 inference took 1.069000 ms, avg 1.069000 ms
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols


[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols


[WARNING] [Qnn ExecuTorch]:  <W> This META does not have Alloc2 Support

[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]:  <W> qnnOpPackageManager: hexagon unload op package function pointer is nullptr!

[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/executorch/dummy_llama2_qnn...ile pulled. 0.1 MB/s (6144 bytes in 0.087s)
is_close? False
x86_golden tensor([[[ 0.2713,  0.5471, -0.3194,  ...,  0.1733, -0.7186, -1.1417],
         [ 0.2635,  0.0273, -0.1612,  ...,  1.2671, -1.4816, -0.6256],
         [ 0.1451, -0.5109,  0.0358,  ...,  0.4289, -0.3217, -1.4835]]],
       grad_fn=<UnsafeViewBackward0>)
device_out tensor([[[-0.3499, -0.3881,  0.5011,  ..., -0.2530,  0.3161, -0.0744],
         [ 0.4127, -0.4308, -0.5663,  ..., -0.3564,  0.0952,  0.7879],
         [-0.2407, -0.5039,  0.3697,  ..., -0.1345,  0.5565,  0.1253]]])
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QNN: huge accuracy degradation in 16a4w quantization #2590

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

QNN: huge accuracy degradation in 16a4w quantization #2590

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions