2-bit integer quantization 

Add `Q2_0` and `Q2_1` quantization support to `ggml`:

- Follow the existing `Q4_0` and `Q4_1` implementations
- Implement [reference scalar quantization and dequantization routines](https://github.com/ggerganov/llama.cpp/blob/3cd8dde0d1357b7f11bdd25c45d5bf5e97e284a0/ggml.c#L407-L449)
- I suspect we might have to use `QK == 16` in this case to compensate for further accuracy losses
- Add SIMD support for a specific architecture - investigate best strategy to perform the `ggml_vec_dot_q2()` computation
- No need to implement `ggml_vec_mad_q2()` - these will be deprecated soon
- Compute perplexity scores

The expected model sizes for 7B and `QK == 16` are:

- `Q2_0` - 3.2 GB

For `QK == 32` we have:

- `Q2_0` - 2.4 GB
- `Q2_1` - 3.2 GB

Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2-bit integer quantization #456

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

2-bit integer quantization #456

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions