New/other Matrix multiplication algorithm implementation #1971

vineel96 · 2024-06-20T06:20:00Z

Hello,

The current implementation for matrix multiplication uses BRGEMM algorithm. Is there any implementation of "Low Rank Approximation approach" for matrix multiplication in oneDNN? Is there any experiments done by intel team for "Low Rank Approximation" implementation and found any bottlenecks for this algorithm compared to BRGEMM?
Any experiments done to replace broadcast kernel of BRGEMM algorithm with other kernels? Any performance comparison results for the same?
Thanks.

mgouicem · 2024-06-20T08:20:47Z

Hi @vineel96

I guess you are referring to sampling algorithms right? oneDNN typically does not do such things and implements algorithms that are equivalent (in terms of operations) to a naive Matrix Multiplication. Regarding brgemm, it is not an algorithm per-se, but mostly a flexible low level abstraction for assembly kernels.
BRGEMM has multiple strategies depending on problem shapes and ISA (e.g. AMX kernels do not rely on broadcast strategy). For implementations working on vector registers, we indeed use broadcast kernels (either implicit or explicit broadcasts). I think we did some experiments at some point with dot-product based approach (with horizontal reduction), but performance was typically low for problem shapes that are encountered in Deep Learning.

vineel96 · 2024-06-20T08:38:06Z

@mgouicem thanks for the information.

Yes. Is there a chance to implement lossless compression methods for matrices which reduces the GFLOP for matrix multiplication as a whole which helps in reducing its timing in onednn?
Any scope to further improve the MatMul by reducing its overall GFLOP? like introducing new algorithms ?

vpirogov · 2024-06-21T17:56:18Z

When it comes to lossless compression oneDNN supports sparse tensors with two packing schemes:

Compressed sparse row (CSR) widely used in scientific applications. Support for Coordinate format (COO) is coming shortly.
Hardware-specific packed formats.
You can find more information in the Developer Guide.

Another compression approach involves converting one of the matrices to low precision (int8 or int4). This is lossy compression method, but it's widely used to accelerate large language models. See GPT-Q RFC and matmul weights decompression example for more details.

If you have a proof of concept for other techniques please share.

vineel96 · 2024-06-22T07:57:02Z

Hi @vpirogov,
Thanks for the reply.

Do COO gives performance boost only if DL model uses sparse tensors? But most of the DL models from hugging face will not be using sparse tensors right? Also, does anyone is already implementing this work or can anyone take it up to complete?
Also anyone is working on GPT-Quantization (GPT-Q) enablement? If not, anyone can take it to complete?
Currently I am looking into lossless compression methods for matrix multiplication where we either reduce total FLOP of matmul or increase FLOP per sec as it scales across no of threads. I am looking for papers for this regard and found one (https://arxiv.org/pdf/2203.14540) dont know how it helps. Any links/papers/suggestions in this regard is appreciated.
Thanks.

vpirogov · 2024-06-27T17:10:24Z

Do COO gives performance boost only if DL model uses sparse tensors? But most of the DL models from hugging face will not be using sparse tensors right? Also, does anyone is already implementing this work or can anyone take it up to complete?

Sparse storage formats like COO and CSR require relatively high levels of zeroes in weights tensor to bring performance value. Think ~80% or more. In context of LLMs hardware specific weight compression on SPR is demonstrated to bring performance benefits on some models by OpenVINO.

Also anyone is working on GPT-Quantization (GPT-Q) enablement? If not, anyone can take it to complete?

GPT-Q is supported in oneDNN v3.5, which has int8 weights support on CPUs with Intel AMX and int8/int4 support on Intel GPUs. See release notes.

Currently I am looking into lossless compression methods for matrix multiplication where we either reduce total FLOP of matmul or increase FLOP per sec as it scales across no of threads. I am looking for papers for this regard and found one (https://arxiv.org/pdf/2203.14540) dont know how it helps. Any links/papers/suggestions in this regard is appreciated.

The paper seems to offer a generalization of CSR format that exploits data duplication in addition to dropping zeroes. I would expect it to be a marginal improvement over classical CSR storage schemes.

vineel96 · 2024-06-28T03:47:01Z

Thanks @vpirogov

For COO rfc , does anyone is already implementing this work or can anyone take it up to complete it?
Also is it allowed in oneDNN to keep extra checks on input matrix, say whether its diagonal matrix or some pattern matrix and based on that run an optimized algorithm? Similar to this paper: https://arxiv.org/abs/2208.07129 (Fast hardware-aware matrix-free algorithm for higher-order finite-element discretized matrix multivector products on distributed systems)

vpirogov · 2024-06-28T21:18:41Z

@vineel96, we are working on the initial COO support and GPU optimized version towards the next release, oneDNN v3.6. Feel free to contribute optimized implementation for your favorite platform.

Choosing algorithm based on the input matrix structure is possible with oneDNN, though this check would have to be done during primitive execution and may negatively affect performance.

vineel96 · 2024-07-02T04:41:33Z

@vpirogov
COO support and GPT-Q support is there for ARM cpu's? or anyone should to take up this task?

vpirogov · 2024-07-09T21:09:25Z

No, neither of these are supported on Arm CPUs.

@jondea, @milpuz01, do you have anything in the works for GPT-Q or sparse formats?

vineel96 · 2024-08-28T02:52:56Z

@vpirogov , any one worked and completed this RFC : https://github.com/oneapi-src/oneDNN/tree/rfcs/rfcs/20200812-multidim-matmul ? or its still need to be implemented? Do option 1 gives any performance boost compared to brgemm? Any comments on this RFC?

vpirogov · 2024-08-29T17:30:26Z

@vineel96, the RFC you are referring to is related to API. We extended matmul primitive on all platforms to support batched case. Performance is implementation specific and comparison to BRGEMM makes no sense here.

vineel96 added the question label Jun 20, 2024

vineel96 changed the title ~~Matrix multiplication algorithm implementation~~ New/other Matrix multiplication algorithm implementation Jun 20, 2024

vpirogov assigned vpirogov and mgouicem Jun 21, 2024

vpirogov added enhancement A feature or an optimization request platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 help wanted and removed question labels Jul 10, 2024

vpirogov unassigned vpirogov and mgouicem Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New/other Matrix multiplication algorithm implementation #1971

New/other Matrix multiplication algorithm implementation #1971

vineel96 commented Jun 20, 2024

mgouicem commented Jun 20, 2024

vineel96 commented Jun 20, 2024

vpirogov commented Jun 21, 2024

vineel96 commented Jun 22, 2024 •

edited

Loading

vpirogov commented Jun 27, 2024

vineel96 commented Jun 28, 2024 •

edited

Loading

vpirogov commented Jun 28, 2024

vineel96 commented Jul 2, 2024

vpirogov commented Jul 9, 2024

vineel96 commented Aug 28, 2024

vpirogov commented Aug 29, 2024

New/other Matrix multiplication algorithm implementation #1971

New/other Matrix multiplication algorithm implementation #1971

Comments

vineel96 commented Jun 20, 2024

mgouicem commented Jun 20, 2024

vineel96 commented Jun 20, 2024

vpirogov commented Jun 21, 2024

vineel96 commented Jun 22, 2024 • edited Loading

vpirogov commented Jun 27, 2024

vineel96 commented Jun 28, 2024 • edited Loading

vpirogov commented Jun 28, 2024

vineel96 commented Jul 2, 2024

vpirogov commented Jul 9, 2024

vineel96 commented Aug 28, 2024

vpirogov commented Aug 29, 2024

vineel96 commented Jun 22, 2024 •

edited

Loading

vineel96 commented Jun 28, 2024 •

edited

Loading