Skip to content

Add ARM SDPA layer implementation#6698

Open
Abandon-ht wants to merge 2 commits into
Tencent:masterfrom
Abandon-ht:master
Open

Add ARM SDPA layer implementation#6698
Abandon-ht wants to merge 2 commits into
Tencent:masterfrom
Abandon-ht:master

Conversation

@Abandon-ht
Copy link
Copy Markdown
Contributor

Implement SDPA_arm for aarch64 platform by reusing Gemm_arm and Softmax_arm sub-layers.

  • Support fp16 storage via ASIMDHP detection
  • Support int8 scale term passthrough
  • Support kv cache and attention mask
  • Reference sdpa_x86.cpp architecture

Implement SDPA_arm for aarch64 platform by reusing Gemm_arm and
Softmax_arm sub-layers.
- Support fp16 storage via ASIMDHP detection
- Support int8 scale term passthrough
- Support kv cache and attention mask
- Reference sdpa_x86.cpp architecture
@github-actions github-actions Bot added the arm label May 3, 2026
@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@nihui nihui closed this May 4, 2026
@nihui nihui reopened this May 4, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 4, 2026

Codecov Report

❌ Patch coverage is 99.04762% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.58%. Comparing base (d0d5063) to head (6e6bef7).

Files with missing lines Patch % Lines
src/layer/arm/sdpa_arm.cpp 99.04% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6698      +/-   ##
==========================================
- Coverage   93.95%   93.58%   -0.38%     
==========================================
  Files         933      934       +1     
  Lines      299730   299822      +92     
==========================================
- Hits       281624   280593    -1031     
- Misses      18106    19229    +1123     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an aarch64/ARM CPU implementation of the SDPA (scaled dot-product attention) layer by composing existing sub-layers (Gemm + Softmax), mirroring the existing x86 SDPA implementation and supporting optional attention mask + KV-cache.

Changes:

  • Introduce SDPA_arm layer class and lifecycle (create/destroy pipeline).
  • Implement SDPA_arm::forward() using Gemm for Q*K^T and Attn*V, plus Softmax for attention normalization.
  • Add ARM-specific option handling for fp16 storage (ASIMDHP) and int8 scale-term passthrough.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/layer/arm/sdpa_arm.h Declares the new ARM SDPA layer and its Gemm/Softmax sub-layer pointers.
src/layer/arm/sdpa_arm.cpp Implements ARM SDPA pipeline setup/teardown and the forward path with mask + KV-cache handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


const int num_heads_per_group = num_heads / num_group;

Mat qk_cross(dst_seqlen, src_seqlen, num_heads, elemsize, opt.workspace_allocator);
return retqk;

Mat& top_blob = top_blobs[0];
top_blob.create(out_embed_dim, src_seqlen, num_heads, elemsize, opt.blob_allocator);
@nihui
Copy link
Copy Markdown
Member

nihui commented May 4, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e6bef7d3a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


const int num_heads_per_group = num_heads / num_group;

Mat qk_cross(dst_seqlen, src_seqlen, num_heads, elemsize, opt.workspace_allocator);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allocate qk_cross as fp32 for Gemm output

qk_gemm is configured with output_elemtype = fp32, but qk_cross is allocated with query.elemsize. On ARM FP16 paths (use_fp16_storage enabled), this makes qk_cross.channel(i) 16-bit while Gemm needs 32-bit, so Gemm reallocates an internal output Mat instead of writing into the provided channel view; qk_cross then remains uninitialized before softmax, producing incorrect attention scores for fp16 inference.

Useful? React with 👍 / 👎.

return retqk;

Mat& top_blob = top_blobs[0];
top_blob.create(out_embed_dim, src_seqlen, num_heads, elemsize, opt.blob_allocator);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preallocate top_blob with fp32 elemsize

qkv_gemm is also configured to output fp32, but top_blob is created with query.elemsize. When the query is fp16, each top_blob.channel(i) view has 16-bit elemsize, so Gemm cannot reuse it and writes into a detached allocation; the original top_blob channels are not populated, yielding incorrect final SDPA outputs on fp16-capable ARM targets.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants