Skip to content

Pulling new quantization format Q4_1_O into upstream ggml #89

Closed as not planned
@saharNooby

Description

When developing rwkv.cpp, I've discovered that existing quantization formats Q4_0 and Q4_1 break RWKV (that is, perplexity becomes 10x higher and the output is garbage). I've documented my observations in this issue. Looks like this is caused both by outliers in weights, and outliers in activations.

To solve this, I've created a new format Q4_1_O. Commit in rwkv.cpp. Comparisons.

Most important things about the format:

  • it is based on Q4_1
  • it stores min & delta values in FP16, not FP32
  • per 32-element block, it losslessly stores a single absmax FP16 value (called "outlier") and its index in the block; all other values are quantized as if there was no outlier
  • matmul is done in FP32, that is, I dequantize the matrix and multuply it by activations already in FP32
  • per-token latency is the same as FP32 40% slower than FP16 (on my machine)
  • perplexity is, as expected with any quantization, slightly higher than FP16, but principle "it's better to use quantized X+1 model than FP16 X model" holds

TL;DR: store single outlier value per block unquantized; dot in FP32.


Recently, it became clear that my ggml fork and upstream ggml (in llama.cpp/here) began to greatly diverge: Code difference is getting more between ggml and rwkv.cpp.

I would like to keep interventions in my copy of ggml as small as possible, so I can pull latest optimizations/fixes without the need to apply all my changes again.

Specifically, I ask: does it sound like Q4_1_O format belongs to upstream ggml? If so, I can create a PR here.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions