Skip to content

2-bit integer quantization  #456

Closed
Closed
@ggerganov

Description

@ggerganov

Add Q2_0 and Q2_1 quantization support to ggml:

  • Follow the existing Q4_0 and Q4_1 implementations
  • Implement reference scalar quantization and dequantization routines
  • I suspect we might have to use QK == 16 in this case to compensate for further accuracy losses
  • Add SIMD support for a specific architecture - investigate best strategy to perform the ggml_vec_dot_q2() computation
  • No need to implement ggml_vec_mad_q2() - these will be deprecated soon
  • Compute perplexity scores

The expected model sizes for 7B and QK == 16 are:

  • Q2_0 - 3.2 GB

For QK == 32 we have:

  • Q2_0 - 2.4 GB
  • Q2_1 - 3.2 GB

Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions