[MLAS] Add 8-bit weights ARM64 Gemm implementation by hariharans29 · Pull Request #25110 · microsoft/onnxruntime

hariharans29 · 2025-06-18T23:24:34Z

Description

Enable 8-bit weights Gemm on ARM64 via MLAS

Supports 2 flavors of the 8-bit Gemm kernel - one uses vdotq (U8U8) and the other uses vusdotq (U8S8) on platforms where I8MM is supported.
Provides access to these new MLAS Gemm kernels via the MatmulNBits contrib operator
Tests:
MLAS
3 new sets of tests:
- SQ8BitQuantA : Tests the dynamic activation quantization MLAS kernel (fp32 -> uint8_t or fp32 -> int8_t on I8MM platforms)
- SQ8BitPrepack: Tests the prepacking of the weights for the 8-bit Gemm kernels
- SQ8BitGemm: Tests the 8-bit Gemm kernels
MatmulNBits contrib tests
- Enables the 8-bit Gemm tests on ARM64 (previously only enabled on x86)

Motivation and Context

Enable 8-bit weights Gemm on ARM64 via MLAS

Based on work and contribution by @fajin-corp

Phi-4-mini-instruct perf numbers (before and after this change):

Rebasing

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…nnxruntime into hari/matmul8bits_arm

hariharans29 · 2025-09-03T04:24:21Z

Pending final perf validation and accuracy verification post PR comments addressing

onnxruntime/core/mlas/lib/qnbitgemm.cpp

onnxruntime/core/mlas/inc/mlas_qnbit.h

onnxruntime/core/mlas/lib/platform.cpp

onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp

hariharans29 · 2025-09-04T21:51:14Z

Merging as the React Native pipeline failure is unrelated to this change

@fajin-corp

### Description Enable 8-bit weights Gemm on ARM64 via MLAS 1. Supports 2 flavors of the 8-bit Gemm kernel - one uses `vdotq` (U8U8) and the other uses `vusdotq` (U8S8) on platforms where I8MM is supported. 2. Provides access to these new MLAS Gemm kernels via the `MatmulNBits` contrib operator 3. Tests: **MLAS** 3 new sets of tests: - `SQ8BitQuantA` : Tests the dynamic activation quantization MLAS kernel (`fp32 -> uint8_t` or `fp32 -> int8_t` on I8MM platforms) - `SQ8BitPrepack`: Tests the prepacking of the weights for the 8-bit Gemm kernels - `SQ8BitGemm`: Tests the 8-bit Gemm kernels **MatmulNBits contrib tests** - Enables the 8-bit Gemm tests on ARM64 (previously only enabled on x86) ### Motivation and Context Enable 8-bit weights Gemm on ARM64 via MLAS Based on work and contribution by @fajin-corp Phi-4-mini-instruct perf numbers (before and after this change): <img width="593" height="179" alt="image" src="https://github.com/user-attachments/assets/d81b9059-b8db-407c-8c0f-527099f9358c" /> --------- Co-authored-by: Jing Fang <fajin@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description Cherry-pick the following PRs: #25943 #25937 #25917 #25909 #25898 #25897 #25888 #25881 #25830 #25619 #25575 #25572 #25558 #25530 #25474 #25455 #25110 Also two dependent PRs for qMoE cpu: #25877 #25822 --------- Co-authored-by: xiaomsft <136376084+xiaomsft@users.noreply.github.com> Co-authored-by: Xiaoyan Hu <xiaoh@microsoft.com> Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com> Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com> Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: Emmanuel <91394589+kobby-kobbs@users.noreply.github.com> Co-authored-by: Emmanuel Assumang <eassumang@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: praneshgo <pranesh.iitp@gmail.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Jing Fang <fajin@microsoft.com> Co-authored-by: Ishwar Raut <iraut@nvidia.com>

…ts buffer (#25971) ### Description The memory alignment for the pre-packed weights buffer was accidentally changed for 8-bit Gemms on x86 while supporting the ARM64 equivalent 8-bit Gemm kernel in #25110. This change in alignment could either cause perf penalty or seg-fault depending on the platform while the corresponding aligned data load instruction is executed in the Gemm kernel. This changes fixes it as well as adds back a couple of tests to the MLAS 8-bit Gemm test suite and fixes a minor nit in the test file. ### Motivation and Context Resolve packaging pipeline crash

…ts buffer (#25971) ### Description The memory alignment for the pre-packed weights buffer was accidentally changed for 8-bit Gemms on x86 while supporting the ARM64 equivalent 8-bit Gemm kernel in #25110. This change in alignment could either cause perf penalty or seg-fault depending on the platform while the corresponding aligned data load instruction is executed in the Gemm kernel. This changes fixes it as well as adds back a couple of tests to the MLAS 8-bit Gemm test suite and fixes a minor nit in the test file. ### Motivation and Context Resolve packaging pipeline crash (cherry picked from commit 96f4595)

fajin-corp and others added 11 commits April 28, 2025 23:29

finished prepack

b52a1ce

changed interface to support blocksum2

0523106

finished quantb for quant a unsigned

fd92ab8

finished quantize a

ed5cf8d

finished Q8Int8GemmR2xC8Neon

b9b9691

finished kernels

685baff

fixed build

6747330

passed prepack

b087317

finished ut for quant a

196c04c

fixed build

353d460

Merge remote-tracking branch 'origin/main' into hari/matmul8bits_arm

4d62e32

Rebasing

github-actions bot reviewed Jun 18, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp Show resolved Hide resolved

hariharans29 and others added 9 commits June 18, 2025 20:24

Comment out some 4 bit tests

e88e32d

Apple I8MM check

58011b0

Tests

acc4b81

Tests 2

2700493

Update onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

76de326

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Changes

159d4d3

Fixes

e4bc74e

Re-enable 4 bit tests

e92055b

Stage

94f3022

github-actions bot reviewed Jun 25, 2025

View reviewed changes

hariharans29 added 3 commits June 25, 2025 01:25

Some tests work

61c1872

Git attempt

16da92b

Lint attempt

3ce481d

github-actions bot reviewed Jun 25, 2025

View reviewed changes

hariharans29 and others added 4 commits June 25, 2025 12:34

Update onnxruntime/test/mlas/unittest/test_sq8bitgemm.cpp

29f66bd

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

More changesc

987574b

Merge branch 'hari/matmul8bits_arm' of https://github.com/microsoft/o…

d921b06

…nnxruntime into hari/matmul8bits_arm

Fix tests

cf92e6f

jywu-msft previously approved these changes Sep 3, 2025

View reviewed changes

hariharans29 changed the title ~~[MLAS] Add 8-bit weights ARM64 Gemm implementation~~ [DO NOT MERGE][MLAS] Add 8-bit weights ARM64 Gemm implementation Sep 3, 2025

hariharans29 marked this pull request as draft September 3, 2025 04:28

More fixes

eefa72c

hariharans29 dismissed jywu-msft’s stale review via eefa72c September 3, 2025 06:58

hariharans29 marked this pull request as ready for review September 3, 2025 08:21

hariharans29 changed the title ~~[DO NOT MERGE][MLAS] Add 8-bit weights ARM64 Gemm implementation~~ [MLAS] Add 8-bit weights ARM64 Gemm implementation Sep 3, 2025

edgchen1 reviewed Sep 4, 2025

View reviewed changes

edgchen1 previously approved these changes Sep 4, 2025

View reviewed changes

PR comments

e1da3d5

hariharans29 dismissed edgchen1’s stale review via e1da3d5 September 4, 2025 18:21

Missed out on one

77dff22

edgchen1 previously approved these changes Sep 4, 2025

View reviewed changes

Remove guards

7404cb3

hariharans29 dismissed edgchen1’s stale review via 7404cb3 September 4, 2025 19:35

Merge remote-tracking branch 'origin/main' into hari/matmul8bits_arm

edb3d72

edgchen1 approved these changes Sep 4, 2025

View reviewed changes

hariharans29 merged commit 31dcc60 into main Sep 4, 2025
90 of 92 checks passed

hariharans29 deleted the hari/matmul8bits_arm branch September 4, 2025 21:51

tianleiwu mentioned this pull request Sep 4, 2025

cherry picks for 1.23.0 release #25959

Merged

tianleiwu added cherry-picked Cherry-picked for a cherrypicks branch and removed release:1.23.0 labels Sep 4, 2025

hariharans29 mentioned this pull request Sep 6, 2025

Revert accidental memory alignment change for x86 for prepacked weights buffer #25971

Merged

Conversation

hariharans29 commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hariharans29 commented Jun 18, 2025 •

edited

Loading

hariharans29 commented Sep 3, 2025 •

edited

Loading