Skip to content

Increased memory usage for int8/int4 weight only quantization compared to gpt-fast #346

Closed
@HDCharles

Description

@HDCharles

if you compare peak memory usage for gpt-fast and torchao's quantization APIs, the torchAO APIs have extremely high peak memory usage (though perf is the same)

with the new benchmark code we can see https://github.com/pytorch/ao/blob/main/torchao/_models/llama/benchmark_results.txt

Lines 5,7 and 6 correspond to no, int8wo, and int4wo quant on llama2-7b which show a peak_mems of 13.88, 14.50 and 15.92 GB respectfully (perf: 105.02, 147.03, 199.81 tok/s)

meanwhile for gpt-fast if i run

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --compile
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 --groupsize 64
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4.g64.pth --compile
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --compile

we see for no, int8wo and int4wo quant on llama2-7b peak mem: 13.88, 7.74, 4.48 GB (perf: 105.09, 150.58, 204 tok/s)

@cpuhrsch @jerryzh168 @msaroufim @supriyar

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions