Closed
Description
Dear @cccclai, @shewu-quic, @chunit-quic
Im trying to quantize the dummy llama model and run on my Qualcomm device
python ./examples/qualcomm/scripts/dummy_llama2.py --model SM8550 --device *** -b build_android --ptq 16a4w
However the output of the quantized model is far away from where it should be
/data/local/tmp/executorch/dummy_llama2_qnn...ile pulled. 0.1 MB/s (6144 bytes in 0.087s)
is_close? False
x86_golden tensor([[[ 0.2713, 0.5471, -0.3194, ..., 0.1733, -0.7186, -1.1417],
[ 0.2635, 0.0273, -0.1612, ..., 1.2671, -1.4816, -0.6256],
[ 0.1451, -0.5109, 0.0358, ..., 0.4289, -0.3217, -1.4835]]],
grad_fn=<UnsafeViewBackward0>)
device_out tensor([[[-0.3499, -0.3881, 0.5011, ..., -0.2530, 0.3161, -0.0744],
[ 0.4127, -0.4308, -0.5663, ..., -0.3564, 0.0952, 0.7879],
[-0.2407, -0.5039, 0.3697, ..., -0.1345, 0.5565, 0.1253]]])
Is it a known issue or did I do something wrong? Are there guidelines or a strategy how to quantize these models in 16a4w and keep accuracy on reasonable level? I would be grateful if you could give some insight on it. Btw, non-quantized model outputs accurate results.
Full log:
opcode name target args kwargs
------------- ------------------------ --------------------------- ----------------------------- --------
placeholder arg55_1 arg55_1 () {}
get_attr lowered_module_0 lowered_module_0 () {}
call_function executorch_call_delegate executorch_call_delegate (lowered_module_0, arg55_1) {}
call_function getitem <built-in function getitem> (executorch_call_delegate, 0) {}
output output output ((getitem,),) {}
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
./dummy_llama2/dummy_llama2_qnn.pte: 1 file pushed. 12.8 MB/s (611280 bytes in 0.046s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/aar...pushed. 33.4 MB/s (1545776 bytes in 0.044s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/hex...pushed. 37.5 MB/s (7360784 bytes in 0.187s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/aar... pushed. 27.7 MB/s (290504 bytes in 0.010s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/aar...ushed. 38.1 MB/s (26628144 bytes in 0.666s)
/opt/qcom/aistack/qnn/2.19.4.240226/lib/aar... pushed. 14.1 MB/s (229024 bytes in 0.015s)
build_android/examples/qualcomm/qnn_executo...shed. 38.0 MB/s (385895152 bytes in 9.695s)
build_android/backends/qualcomm/libqnn_exec...pushed. 36.9 MB/s (8854840 bytes in 0.229s)
/home/anzen/Projects/executorch/dummy_llama... file pushed. 0.0 MB/s (14 bytes in 0.002s)
/home/anzen/Projects/executorch/dummy_llama... file pushed. 0.0 MB/s (12 bytes in 0.002s)
I 00:00:00.003985 executorch:qnn_executor_runner.cpp:81] Model file dummy_llama2_qnn.pte is loaded.
I 00:00:00.004075 executorch:qnn_executor_runner.cpp:90] Using method forward
I 00:00:00.004103 executorch:qnn_executor_runner.cpp:138] Setting up planned buffer 0, size 6160.
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 2
[WARNING] [Qnn ExecuTorch]: <W> Initializing HtpProvider
[WARNING] [Qnn ExecuTorch]: <W> Function not called, PrepareLib isn't loaded!
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[WARNING] [Qnn ExecuTorch]: <W> sg_stubPtr is not null, skip loadRemoteSymbols
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> sg_stubPtr is not null, skip loadRemoteSymbols
[WARNING] [Qnn ExecuTorch]: <W> Function not called, PrepareLib isn't loaded!
[WARNING] [Qnn ExecuTorch]: <W> sg_stubPtr is not null, skip loadRemoteSymbols
[WARNING] [Qnn ExecuTorch]: <W> Function not called, PrepareLib isn't loaded!
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[INFO] [Qnn ExecuTorch]: Running level=3 optimization.
I 00:00:00.205369 executorch:qnn_executor_runner.cpp:161] Method loaded.
I 00:00:00.205507 executorch:qnn_executor_runner.cpp:166] Inputs prepared.
I 00:00:00.205798 executorch:qnn_executor_runner.cpp:171] Number of inputs: 1
I 00:00:00.206130 executorch:qnn_executor_runner.cpp:232] Perform 0 inference for warming up
I 00:00:00.206200 executorch:qnn_executor_runner.cpp:238] Start inference (0)
[WARNING] [Qnn ExecuTorch]: <W> sg_stubPtr is not null, skip loadRemoteSymbols
I 00:00:00.207318 executorch:qnn_executor_runner.cpp:256] 1 inference took 1.069000 ms, avg 1.069000 ms
I 00:00:00.207700 executorch:qnn_executor_runner.cpp:298] 1 inference took 1.069000 ms, avg 1.069000 ms
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> sg_stubPtr is not null, skip loadRemoteSymbols
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[WARNING] [Qnn ExecuTorch]: <W> sg_stubPtr is not null, skip loadRemoteSymbols
[WARNING] [Qnn ExecuTorch]: <W> This META does not have Alloc2 Support
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: <W> qnnOpPackageManager: hexagon unload op package function pointer is nullptr!
[WARNING] [Qnn ExecuTorch]: <W> Function not called, PrepareLib isn't loaded!
/data/local/tmp/executorch/dummy_llama2_qnn...ile pulled. 0.1 MB/s (6144 bytes in 0.087s)
is_close? False
x86_golden tensor([[[ 0.2713, 0.5471, -0.3194, ..., 0.1733, -0.7186, -1.1417],
[ 0.2635, 0.0273, -0.1612, ..., 1.2671, -1.4816, -0.6256],
[ 0.1451, -0.5109, 0.0358, ..., 0.4289, -0.3217, -1.4835]]],
grad_fn=<UnsafeViewBackward0>)
device_out tensor([[[-0.3499, -0.3881, 0.5011, ..., -0.2530, 0.3161, -0.0744],
[ 0.4127, -0.4308, -0.5663, ..., -0.3564, 0.0952, 0.7879],
[-0.2407, -0.5039, 0.3697, ..., -0.1345, 0.5565, 0.1253]]])