Skip to content

[QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large #10226

Open
@tiger-of-shawn

Description

@tiger-of-shawn

use the latest executorch codebase:

get the pte file:

Image

the file size is:

-rw-rw-r-- 1 2.9G Apr 16 06:53 test.pte

while the float model size is:

-rw-rw-r-- 1 2.4G Oct 23 03:12 assets/models/Llama-3.2-1B/original/consolidated.00.pth

the convert script is:

#Export Llama Model
function export_llama {
model_path="$1"
# 4 bits weight only quantize
python -m examples.models.llama.export_llama
-t "$model_path/original/tokenizer.model"
--checkpoint "$model_path/original/consolidated.00.pth"
-p "$model_path/original/params.json"
--disable_dynamic_shape
--qnn
--pt2e_quantize qnn_16a4w
--model llama3_2
-d fp32
--use_kv_cache
--num_sharding 1
--soc_model SM8650
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
-v
--output_name="test.pte"
}

why the pte file size is larger than float model ?


when I use the v0.4 executorch codebase to get the pte with the same configure, the pte's size is normal:

Image

-rw-rw-r-- 1 1.1G Apr 16 06:57 output.pte

cc @cccclai @winskuo-quic @shewu-quic @cbilgin @larryliu0820 @mergennachin @helunwencser @jackzhxng

Metadata

Metadata

Assignees

Labels

module: llmIssues related to LLM examples and apps, and to the extensions/llm/ codemodule: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm

Type

No type

Projects

Status

To triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions