Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to modify oneDNN to enable GEMM operation acceleration on your own hardware #2114

Closed
nanzh-19 opened this issue Sep 24, 2024 · 7 comments
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 question

Comments

@nanzh-19
Copy link

My use case is inference acceleration on a CPU using TensorFlow Serving, and my hardware architecture is AArch64 (ARMv8). Currently, I've noticed that with oneDNN enabled, the performance bottleneck is in GEMM. I want to create a fused operator for GEMM and ReLU. Which parts of the code should I modify to improve performance? Thank you for your assistance!

@mgouicem mgouicem added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Sep 24, 2024
@mgouicem
Copy link
Contributor

Hi @nanzh-19 you can fuse relu with matmul operator at the oneDNN API level by using post-ops. You can find a full example of Matmul + ReLU here.

Tagging @milpuz01 @cfRod @jondea for guidance on Tensorflow integration.

@nanzh-19
Copy link
Author

Hi @mgouicem. Thank you for your response! We are aiming to optimize inference on an unpublished ARM architecture machine. It seems that oneDNN might not have specific information about our hardware, which could explain why we aren't achieving optimal performance. My understanding is that oneDNN is primarily optimized for existing ARM architectures. Is this correct? I appreciate your insights!

@theComputeKid
Copy link
Contributor

@nanzh-19 : as @mgouicem mentioned, oneDNN can pass down a fused GEMM and ReLU to ACL, where it can execute an optimised operation.

In addition to his example, you can pass activation info in GEMMInfo here: https://github.com/ARM-software/ComputeLibrary/blob/de7288cb71e6b9190f52e50a44ed68c309e4a041/arm_compute/function_info/GEMMInfo.h#L86

And then specify ReLU as here: https://github.com/ARM-software/ComputeLibrary/blob/de7288cb71e6b9190f52e50a44ed68c309e4a041/arm_compute/function_info/ActivationLayerInfo.h#L49

Even if you have unreleased hardware, it is important to note that we optimise for architectural features such as SVE/NEON/SME etc, rather than the machine description or vendor. So I believe that you should get good performance.

Hope that helps, feel free to inquire more.

@nanzh-19
Copy link
Author

nanzh-19 commented Sep 24, 2024

Thank you for your comment. I have the following questions:

We believe that OneDNN has not been thoroughly optimized for our machine. For example, OneDNN is not aware of our machine's memory structure, which may lead to suboptimal matrix blocking for GEMM. Therefore, we would like to optimize based on OneDNN's code.

The reason for the lack of optimization of OneDNN for our machine is that we compared x86 servers with our own servers. When observing inference performance on a single NUMA node, we found that our machine's performance significantly degraded after using OneDNN.

image
The data in the figure represents the inference throughput.

@mgouicem
Copy link
Contributor

@nanzh-19 there could multiple things at play here and log files might be helpful (if you can share any).
In general here are a few things:

  • did you build oneDNN with acl? Or did you use oneDNN as part of a framework (and if the latter, which one)?
  • your OS might have to include support for your custom hardware. In particular, aarch64 jitted implementations get the system topology using a mix of hwcap and system files, see Linux code here.
  • which threading runtime did you use ?

@nanzh-19
Copy link
Author

nanzh-19 commented Sep 27, 2024

Thank you for @mgouicem 's comment. I've identified the reason for the performance discrepancy: on the ARM architecture, the ACL's arm_gemm is being called, while on the x86 architecture, brgemm is used. My current issue is how to modify TensorFlow's calls so that brgemm can be used on the ARM architecture as well.

@theComputeKid
Copy link
Contributor

Depending on how you got/built tensorflow, I believe aarch64 has a much older version of oneDNN than x86. If you want to investigate further, you could also try using benchdnn directly on the latest oneDNN from main and see the difference for your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 question
Projects
None yet
Development

No branches or pull requests

4 participants