Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM] Support 8bit/4bit weights decompression for Matmul primitive #2081

Open
dmitry-gorokhov opened this issue Sep 4, 2024 · 4 comments
Open
Labels
enhancement A feature or an optimization request help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64

Comments

@dmitry-gorokhov
Copy link
Contributor

Problem statement

LLM workloads oriented on best latency are memory bound. Inference speed is limited by model weights access through DDR. That’s why major optimization technique is weights compression (4bits weights compression might bring up-to 4 times better latency in comparison with bf16/fp16 weights).

Preferred solution

OneDNN already extended x64 brgemm Matmul primitive (8bit, 4bit) to support the following decompression math:

  1. Decompress block of weight in temp buffer (via brgemm_matmul_copy_b): w_fp = (w_compressed - zp)*scale.
  2. Call regular fp Matmul on the weight block.

Since floating point Brgemm Matmul is already implemented for aarch64 (at least with SVE) the proposal is to extended it to support compressed weights (in the same way it is done for x64).

The request is to support the following options:

  1. i4/u4/i8/u8 weights input + fp32/fp16/bf16 activations.
  2. additional input for scales (per output channel values for int8, grouped for int4). Data type: FP32/FP16
  3. optional zero point value (per output channel values for int8, grouped for int4). It can be equal to weights element type, but we can also convert to FP32/FP16 if impl prefers it.
@dmitry-gorokhov dmitry-gorokhov added the enhancement A feature or an optimization request label Sep 4, 2024
@vpirogov vpirogov added help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 labels Sep 4, 2024
@mgouicem
Copy link
Contributor

mgouicem commented Oct 4, 2024

@theComputeKid

@theComputeKid
Copy link
Contributor

Thanks. I was expecting this to eventually be requested.

@jondea
Copy link
Contributor

jondea commented Oct 10, 2024

What's the value proposition decompressing to FP16/32 rather than just doing the matmul in Int8? Wouldn't you expect Int8 matmul to have 2/4x the throughput?

@dzarukin
Copy link
Contributor

@jondea, it's more about the price to enable that int8 matmul. Since activations are in fp16/fp32, to have int8 computations one needs to quantize them. That involves quantization on-the-fly which is very complex technique for the library and users to adopt.

On the contrary, upgrading weights will decrease the compute power but doesn't require extra actions from the user, works out-of-the-box just with setting a single attribute (fpmath-mode=DT:true). This is the core proposition - better bandwidth and performance with a single attribute from user perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

No branches or pull requests

6 participants