[ARM] Support 8bit/4bit weights decompression for Matmul primitive #2081
Labels
enhancement
A feature or an optimization request
help wanted
platform:cpu-aarch64
Codeowner: @oneapi-src/onednn-cpu-aarch64
Problem statement
LLM workloads oriented on best latency are memory bound. Inference speed is limited by model weights access through DDR. That’s why major optimization technique is weights compression (4bits weights compression might bring up-to 4 times better latency in comparison with bf16/fp16 weights).
Preferred solution
OneDNN already extended x64 brgemm Matmul primitive (8bit, 4bit) to support the following decompression math:
Since floating point Brgemm Matmul is already implemented for aarch64 (at least with SVE) the proposal is to extended it to support compressed weights (in the same way it is done for x64).
The request is to support the following options:
The text was updated successfully, but these errors were encountered: