Add ARM SDPA layer implementation#6698
Conversation
Implement SDPA_arm for aarch64 platform by reusing Gemm_arm and Softmax_arm sub-layers. - Support fp16 storage via ASIMDHP detection - Support int8 scale term passthrough - Support kv cache and attention mask - Reference sdpa_x86.cpp architecture
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6698 +/- ##
==========================================
- Coverage 93.95% 93.58% -0.38%
==========================================
Files 933 934 +1
Lines 299730 299822 +92
==========================================
- Hits 281624 280593 -1031
- Misses 18106 19229 +1123 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds an aarch64/ARM CPU implementation of the SDPA (scaled dot-product attention) layer by composing existing sub-layers (Gemm + Softmax), mirroring the existing x86 SDPA implementation and supporting optional attention mask + KV-cache.
Changes:
- Introduce
SDPA_armlayer class and lifecycle (create/destroy pipeline). - Implement
SDPA_arm::forward()using Gemm forQ*K^TandAttn*V, plus Softmax for attention normalization. - Add ARM-specific option handling for fp16 storage (ASIMDHP) and int8 scale-term passthrough.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/layer/arm/sdpa_arm.h | Declares the new ARM SDPA layer and its Gemm/Softmax sub-layer pointers. |
| src/layer/arm/sdpa_arm.cpp | Implements ARM SDPA pipeline setup/teardown and the forward path with mask + KV-cache handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| const int num_heads_per_group = num_heads / num_group; | ||
|
|
||
| Mat qk_cross(dst_seqlen, src_seqlen, num_heads, elemsize, opt.workspace_allocator); |
| return retqk; | ||
|
|
||
| Mat& top_blob = top_blobs[0]; | ||
| top_blob.create(out_embed_dim, src_seqlen, num_heads, elemsize, opt.blob_allocator); |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e6bef7d3a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| const int num_heads_per_group = num_heads / num_group; | ||
|
|
||
| Mat qk_cross(dst_seqlen, src_seqlen, num_heads, elemsize, opt.workspace_allocator); |
There was a problem hiding this comment.
Allocate qk_cross as fp32 for Gemm output
qk_gemm is configured with output_elemtype = fp32, but qk_cross is allocated with query.elemsize. On ARM FP16 paths (use_fp16_storage enabled), this makes qk_cross.channel(i) 16-bit while Gemm needs 32-bit, so Gemm reallocates an internal output Mat instead of writing into the provided channel view; qk_cross then remains uninitialized before softmax, producing incorrect attention scores for fp16 inference.
Useful? React with 👍 / 👎.
| return retqk; | ||
|
|
||
| Mat& top_blob = top_blobs[0]; | ||
| top_blob.create(out_embed_dim, src_seqlen, num_heads, elemsize, opt.blob_allocator); |
There was a problem hiding this comment.
Preallocate top_blob with fp32 elemsize
qkv_gemm is also configured to output fp32, but top_blob is created with query.elemsize. When the query is fp16, each top_blob.channel(i) view has 16-bit elemsize, so Gemm cannot reuse it and writes into a detached allocation; the original top_blob channels are not populated, yielding incorrect final SDPA outputs on fp16-capable ARM targets.
Useful? React with 👍 / 👎.
Implement SDPA_arm for aarch64 platform by reusing Gemm_arm and Softmax_arm sub-layers.