Pulling new quantization format Q4_1_O into upstream ggml #89
Description
When developing rwkv.cpp, I've discovered that existing quantization formats Q4_0
and Q4_1
break RWKV (that is, perplexity becomes 10x higher and the output is garbage). I've documented my observations in this issue. Looks like this is caused both by outliers in weights, and outliers in activations.
To solve this, I've created a new format Q4_1_O
. Commit in rwkv.cpp. Comparisons.
Most important things about the format:
- it is based on
Q4_1
- it stores
min
&delta
values inFP16
, notFP32
- per 32-element block, it losslessly stores a single absmax
FP16
value (called "outlier") and its index in the block; all other values are quantized as if there was no outlier - matmul is done in
FP32
, that is, I dequantize the matrix and multuply it by activations already inFP32
- per-token latency is
the same as40% slower thanFP32
FP16
(on my machine) - perplexity is, as expected with any quantization, slightly higher than
FP16
, but principle "it's better to use quantized X+1 model thanFP16
X model" holds
TL;DR: store single outlier value per block unquantized; dot in FP32.
Recently, it became clear that my ggml
fork and upstream ggml
(in llama.cpp
/here) began to greatly diverge: Code difference is getting more between ggml and rwkv.cpp.
I would like to keep interventions in my copy of ggml
as small as possible, so I can pull latest optimizations/fixes without the need to apply all my changes again.
Specifically, I ask: does it sound like Q4_1_O
format belongs to upstream ggml
? If so, I can create a PR here.