Add ARM SDPA layer implementation by Abandon-ht · Pull Request #6698 · Tencent/ncnn

Abandon-ht · 2026-05-03T13:52:35Z

Implement SDPA_arm for aarch64 platform by reusing Gemm_arm and Softmax_arm sub-layers.

Support fp16 storage via ASIMDHP detection
Support int8 scale term passthrough
Support kv cache and attention mask
Reference sdpa_x86.cpp architecture

Implement SDPA_arm for aarch64 platform by reusing Gemm_arm and Softmax_arm sub-layers. - Support fp16 storage via ASIMDHP detection - Support int8 scale term passthrough - Support kv cache and attention mask - Reference sdpa_x86.cpp architecture

tencent-adm · 2026-05-03T13:52:55Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2026-05-04T06:42:00Z

Codecov Report

❌ Patch coverage is 99.04762% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.58%. Comparing base (d0d5063) to head (6e6bef7).

Files with missing lines	Patch %	Lines
src/layer/arm/sdpa_arm.cpp	99.04%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6698      +/-   ##
==========================================
- Coverage   93.95%   93.58%   -0.38%     
==========================================
  Files         933      934       +1     
  Lines      299730   299822      +92     
==========================================
- Hits       281624   280593    -1031     
- Misses      18106    19229    +1123

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR adds an aarch64/ARM CPU implementation of the SDPA (scaled dot-product attention) layer by composing existing sub-layers (Gemm + Softmax), mirroring the existing x86 SDPA implementation and supporting optional attention mask + KV-cache.

Changes:

Introduce SDPA_arm layer class and lifecycle (create/destroy pipeline).
Implement SDPA_arm::forward() using Gemm for Q*K^T and Attn*V, plus Softmax for attention normalization.
Add ARM-specific option handling for fp16 storage (ASIMDHP) and int8 scale-term passthrough.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/layer/arm/sdpa_arm.h	Declares the new ARM SDPA layer and its Gemm/Softmax sub-layer pointers.
src/layer/arm/sdpa_arm.cpp	Implements ARM SDPA pipeline setup/teardown and the forward path with mask + KV-cache handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
+    const int num_heads_per_group = num_heads / num_group;
+
+    Mat qk_cross(dst_seqlen, src_seqlen, num_heads, elemsize, opt.workspace_allocator);


+        return retqk;
+
+    Mat& top_blob = top_blobs[0];
+    top_blob.create(out_embed_dim, src_seqlen, num_heads, elemsize, opt.blob_allocator);


nihui · 2026-05-04T14:45:26Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e6bef7d3a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T14:52:06Z

+
+    const int num_heads_per_group = num_heads / num_group;
+
+    Mat qk_cross(dst_seqlen, src_seqlen, num_heads, elemsize, opt.workspace_allocator);


Allocate qk_cross as fp32 for Gemm output

qk_gemm is configured with output_elemtype = fp32, but qk_cross is allocated with query.elemsize. On ARM FP16 paths (use_fp16_storage enabled), this makes qk_cross.channel(i) 16-bit while Gemm needs 32-bit, so Gemm reallocates an internal output Mat instead of writing into the provided channel view; qk_cross then remains uninitialized before softmax, producing incorrect attention scores for fp16 inference.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-04T14:52:06Z

+        return retqk;
+
+    Mat& top_blob = top_blobs[0];
+    top_blob.create(out_embed_dim, src_seqlen, num_heads, elemsize, opt.blob_allocator);


Preallocate top_blob with fp32 elemsize

qkv_gemm is also configured to output fp32, but top_blob is created with query.elemsize. When the query is fp16, each top_blob.channel(i) view has 16-bit elemsize, so Gemm cannot reuse it and writes into a detached allocation; the original top_blob channels are not populated, yielding incorrect final SDPA outputs on fp16-capable ARM targets.

Useful? React with 👍 / 👎.

Add ARM SDPA layer implementation

f37865c

Implement SDPA_arm for aarch64 platform by reusing Gemm_arm and Softmax_arm sub-layers. - Support fp16 storage via ASIMDHP detection - Support int8 scale term passthrough - Support kv cache and attention mask - Reference sdpa_x86.cpp architecture

github-actions Bot added the arm label May 3, 2026

apply code-format changes

6e6bef7

nihui closed this May 4, 2026

nihui reopened this May 4, 2026

nihui requested a review from Copilot May 4, 2026 12:16

Copilot started reviewing on behalf of nihui May 4, 2026 12:17 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ARM SDPA layer implementation#6698

Add ARM SDPA layer implementation#6698
Abandon-ht wants to merge 2 commits into
Tencent:masterfrom
Abandon-ht:master

Abandon-ht commented May 3, 2026

Uh oh!

tencent-adm commented May 3, 2026

Uh oh!

codecov-commenter commented May 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

nihui commented May 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		const int num_heads_per_group = num_heads / num_group;

		Mat qk_cross(dst_seqlen, src_seqlen, num_heads, elemsize, opt.workspace_allocator);

Conversation

Abandon-ht commented May 3, 2026

Uh oh!

tencent-adm commented May 3, 2026

Uh oh!

codecov-commenter commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nihui commented May 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented May 4, 2026 •

edited

Loading