Skip to content

GPTQ Quantization (3-bit and 4-bit) #9

Closed
@MarkSchmidty

Description

@MarkSchmidty

4-bit quantization tends to come at a cost of output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit/2-bit) quantization methods and even when compared with uncompressed fp16 inference.

image

It would be good to see benchmarks on the existing implementation. It's possible there is substantial quality loss from the 4-bit quantization. It's also possible that it isn't very substantial. We'd have to see benchmarks to know.

The related project GPTQ-for-LLaMA has some benchmarks available for their implementation.

Refernces:
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
The case for 4-bit precision: k-bit Inference Scaling Laws

Related work:
https://github.com/qwopqwop200/GPTQ-for-LLaMA/

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions