Add build option for ARM NCHWc kernels#26171
Conversation
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Added a comment regarding the performance of NCHWc ARM kernels and their default state.
|
Hi all, I have noticed a unit test failure likely associated with this PR using KleidiAI on the Mac M4. Interestingly the failure only happens when the test is run as part of the full onnxruntime_test_all suite. Yet if you run it in isolation it passes. This points to a potential variable that has not been reset. Unit Test Name: NchwcOptimizerTests.ConvNoBiasAddFusion Reproduce instructions: ./onnxruntime_test_all - Shows test failure |
Thanks @damdoo01-arm ! Hi @Rohanjames1997 - Could you please take a look when you get a chance ? Our partners from ARM recently encountered the above test failure that seems to originate from the NCHWc ARM64 support (#25580). Thanks! |
|
Hi @damdoo01-arm , thanks for reporting! I tried reproducing it, but I don't have the same setup. So a few questions:
Also, any idea why the CI did not catch this? @hariharans29🤔 |
|
@hariharans29, please ensure you update the Readme or other documentation so that it is clear to all how to enable this amazing feature. Thanks! |
Hi @Rohanjames1997 - If I were to take an educated guess, I think this will only repro on a machine that has SME2 supported (Mac M4) not just on a build with KleidiAI is enabled. This is the PR that introduced KleidiAI SME2 Conv kernels for ARM64 - https://github.com/microsoft/onnxruntime/pull/25187/files#diff-ae80f8c17f8c3c31a01bff6f1058df55c4287ce3f6741a4bb73df3a24253b7c0. Perhaps, there is an edge case to be accounted for somewhere at the boundary of the 2 PRs. Unfortunately, that is all I can think of right now. any idea why the CI did not catch this? |
We will document it and announce it in the next release, for now enabling it is as simple as using the build flag in this PR to build the feature from main |
|
Hi @Rohanjames1997, |
|
Thanks @damdoo01-arm , Is the test failing only on a SME2-supported machine like @hariharans29 suggested? I couldn't reproduce this on a Neoverse-V1 or a V2 machine. |
|
Apologies for the delay @Rohanjames1997, since I have an M4, I can attempt to diagnose and attempt to solve it, I'll post here with any updates, Damien. |
### Description Add a build option for new kernels introduced in #25580 ### Motivation and Context This enables building ORT with NCHWc ARM kernels. At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting ARM platforms. Once the gap is closed for smaller thread counts, it can be turned on by default. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description Add a build option for new kernels introduced in microsoft#25580 ### Motivation and Context This enables building ORT with NCHWc ARM kernels. At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting ARM platforms. Once the gap is closed for smaller thread counts, it can be turned on by default. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Description
Add a build option for new kernels introduced in #25580
Motivation and Context
This enables building ORT with NCHWc ARM kernels.
At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels
is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting
ARM platforms.
Once the gap is closed for smaller thread counts, it can be turned on by default.