Closed
Description
if you compare peak memory usage for gpt-fast and torchao's quantization APIs, the torchAO APIs have extremely high peak memory usage (though perf is the same)
with the new benchmark code we can see https://github.com/pytorch/ao/blob/main/torchao/_models/llama/benchmark_results.txt
Lines 5,7 and 6 correspond to no, int8wo, and int4wo quant on llama2-7b which show a peak_mems of 13.88, 14.50 and 15.92 GB respectfully (perf: 105.02, 147.03, 199.81 tok/s)
meanwhile for gpt-fast if i run
export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --compile
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 --groupsize 64
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4.g64.pth --compile
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --compile
we see for no, int8wo and int4wo quant on llama2-7b peak mem: 13.88, 7.74, 4.48 GB (perf: 105.09, 150.58, 204 tok/s)
Metadata
Metadata
Assignees
Labels
No labels