Pulling new quantization format Q4_1_O into upstream ggml

When developing [rwkv.cpp](https://github.com/saharNooby/rwkv.cpp), I've discovered that existing quantization formats `Q4_0` and `Q4_1` break RWKV (that is, perplexity becomes 10x higher and the output is garbage). I've documented my observations in [this issue](https://github.com/saharNooby/rwkv.cpp/issues/12). Looks like this is caused both by outliers in weights, and outliers in activations.

To solve this, I've created a new format `Q4_1_O`. [Commit in rwkv.cpp](https://github.com/saharNooby/rwkv.cpp/commit/c40941d9d010a3e0cc3748705eac5d747e72451a). [Comparisons](https://github.com/saharNooby/rwkv.cpp/issues/12#issuecomment-1500875133).

Most important things about the format:

- it is based on `Q4_1`
- it stores `min` & `delta` values in `FP16`, not `FP32`
- per 32-element block, it losslessly stores a single absmax `FP16` value (called "outlier") and its index in the block; all other values are quantized as if there was no outlier
- matmul is done in `FP32`, that is, I dequantize the matrix and multuply it by activations already in `FP32`
- per-token latency is ~the same as `FP32`~ 40% slower than `FP16` (on my machine)
- perplexity is, as expected with any quantization, slightly higher than `FP16`, but principle "it's better to use quantized X+1 model than `FP16` X model" holds

TL;DR: store single outlier value per block unquantized; dot in FP32.

---

Recently, it became clear that my `ggml` fork and upstream `ggml` (in `llama.cpp`/here) began to greatly diverge: [Code difference is getting more between ggml and rwkv.cpp](https://github.com/saharNooby/rwkv.cpp/issues/25).

I would like to keep interventions in my copy of `ggml` as small as possible, so I can pull latest optimizations/fixes without the need to apply all my changes again.

Specifically, I ask: does it sound like `Q4_1_O` format belongs to upstream `ggml`? If so, I can create a PR here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulling new quantization format Q4_1_O into upstream ggml #89

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development