-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to modify oneDNN to enable GEMM operation acceleration on your own hardware #2114
Comments
Hi @mgouicem. Thank you for your response! We are aiming to optimize inference on an unpublished ARM architecture machine. It seems that oneDNN might not have specific information about our hardware, which could explain why we aren't achieving optimal performance. My understanding is that oneDNN is primarily optimized for existing ARM architectures. Is this correct? I appreciate your insights! |
@nanzh-19 : as @mgouicem mentioned, oneDNN can pass down a fused GEMM and ReLU to ACL, where it can execute an optimised operation. In addition to his example, you can pass activation info in GEMMInfo here: https://github.com/ARM-software/ComputeLibrary/blob/de7288cb71e6b9190f52e50a44ed68c309e4a041/arm_compute/function_info/GEMMInfo.h#L86 And then specify ReLU as here: https://github.com/ARM-software/ComputeLibrary/blob/de7288cb71e6b9190f52e50a44ed68c309e4a041/arm_compute/function_info/ActivationLayerInfo.h#L49 Even if you have unreleased hardware, it is important to note that we optimise for architectural features such as SVE/NEON/SME etc, rather than the machine description or vendor. So I believe that you should get good performance. Hope that helps, feel free to inquire more. |
Thank you for your comment. I have the following questions: We believe that OneDNN has not been thoroughly optimized for our machine. For example, OneDNN is not aware of our machine's memory structure, which may lead to suboptimal matrix blocking for GEMM. Therefore, we would like to optimize based on OneDNN's code. The reason for the lack of optimization of OneDNN for our machine is that we compared x86 servers with our own servers. When observing inference performance on a single NUMA node, we found that our machine's performance significantly degraded after using OneDNN. |
@nanzh-19 there could multiple things at play here and log files might be helpful (if you can share any).
|
Thank you for @mgouicem 's comment. I've identified the reason for the performance discrepancy: on the ARM architecture, the ACL's arm_gemm is being called, while on the x86 architecture, brgemm is used. My current issue is how to modify TensorFlow's calls so that brgemm can be used on the ARM architecture as well. |
Depending on how you got/built tensorflow, I believe aarch64 has a much older version of oneDNN than x86. If you want to investigate further, you could also try using benchdnn directly on the latest oneDNN from main and see the difference for your use case. |
My use case is inference acceleration on a CPU using TensorFlow Serving, and my hardware architecture is AArch64 (ARMv8). Currently, I've noticed that with oneDNN enabled, the performance bottleneck is in GEMM. I want to create a fused operator for GEMM and ReLU. Which parts of the code should I modify to improve performance? Thank you for your assistance!
The text was updated successfully, but these errors were encountered: