feat(deepseek-v4): harden attention fast paths#93
Merged
Conversation
Port DeepSeek V4 sparse attention hardening, local prefill top-k support, MXFP4 gather workspace reuse, and packed FP8 QKV DeepGEMM default fallback. Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
lightseek-bot
approved these changes
May 12, 2026
jasl
added a commit
to jasl/tokenspeed
that referenced
this pull request
May 12, 2026
…ightseekorg#93) Squashes 19 SM12x commits (preserved at branch `codex/ds4-sm12x-poc-prerebase`) onto upstream/main, picking up lightseekorg#93 (DSv4 attention fast-path hardening), lightseekorg#94, lightseekorg#97, lightseekorg#99 and the intervening commits. Original commit chain (newest first): ec18ff5 docs(ds4-sm12x): record T3-α prototype landing + 6% e2e regression 97e63c9 feat(sm12x-moe): wire persistent dispatch through sm12x_mxfp4_moe_forward 762705e fix(sm12x-moe): re-export new tensorcore wrappers from ops package init d7fc65e feat(sm12x-moe): wire tensorcore MoE forward orchestrator 069dc17 feat(sm12x-moe): add tensorcore W2 GEMM + weighted-reduce kernels a4b813e feat(sm12x-moe): add tensorcore W13 GEMM kernel (T3-α step 1) 2302fb0 docs(ds4-sm12x): record T1-α graph-capture island d3212ac fix(deepseek-v4): drop default max-topk gate on SM12x sparse-MLA fast path 57534ba fix(deepseek-v4): make sparse-MLA prefill workspace capture-safe aa88a47 fix(deepseek-v4): vectorize indexer FP8 cache read for graph capture 84773aa refactor(deepseek-v4): remove Triton siblings from SM12x output projection 2b4fbf9 feat(deepseek-v4): add SM12x CUDA fused inverse-RoPE + FP8 quant kernel 4ba3229 docs(rejected-experiments): record DSv4 einsum multi-token tile failure 0831acd feat(deepseek-v4): port DeepGEMM SM120 FP8 einsum for output projection b2572c4 docs(rejected-experiments): record DSv4 output-proj B_TOKEN=16 tile failure aff5c53 fix(deepseek-v4): gate SM12x CUDA output-projection to tokens==1 only b20104a feat(platform): tighten SM12x scope to SM120 + SM121 whitelist 7eb2845 feat(deepseek-v4): add SM12x CUDA attention output-projection einsum 486ccc1 Add DeepSeek V4 SM12x PoC runtime and kernels Conflicts resolved (3 files): - test/runtime/test_deepseek_v4_config.py: merge imports -- keep both DeepseekV4Attention (upstream) and DeepseekV4Indexer (SM12x), plus _mhc_post_reference / _mhc_pre_reference that the SM12x tests still rely on. - tokenspeed-kernel/.../csrc/deepseek_v4_attention_binding.cu: keep both upstream's deepseek_v4_indexer_topk_prefill forward decl + FFI export and the SM12x kernel forward decls/exports added by the PoC. - tokenspeed-kernel/.../cuda/deepseek_v4_attention.py: keep both upstream's `indexer_topk_prefill` Python wrapper and the SM12x helper functions (mhc pre/post, sparse MLA fp8 cache bindings, inverse-RoPE grouped, etc.). Follow-ups: - Re-build the kernel on the SM120 workstation and verify focused regression suites still green. - The T3-α diagnostic loop (~6% end-to-end persistent-vs-warp regression) resumes after this rebase lands. Signed-off-by: jasl <jasl9187@hotmail.com>
jasl
added a commit
to jasl/tokenspeed
that referenced
this pull request
May 12, 2026
…fill gate Upstream PR lightseekorg#93 added a ``num_reqs = int(metadata.seq_lens.numel())`` check to ``forward_deepseek_v4_prefill`` before deciding between the single-chunk path and the per-chunk loop. The SM12x backend test's ``_metadata`` helper was created before this gate existed and only set ``forward_mode``. Add a single-request ``seq_lens=tensor([2])`` so the test keeps exercising the single-chunk path that still routes through ``_prefill_workspace`` + ``_forward_sparse_mla_reference`` (the methods the test mocks). Signed-off-by: jasl <jasl9187@hotmail.com>
jasl
added a commit
to jasl/tokenspeed
that referenced
this pull request
May 12, 2026
… probe A1 cached-scratch hypothesis was tested (commit 6245872) and ruled out: persistent_A1 = 16.12 / 57.76 ms vs persistent = 16.10 / 57.80 ms, effectively identical. The 6% end-to-end regression is therefore not graph-pool fragmentation and lives in the non-MoE kernels surrounding the orchestrator -- which the isolated 30-layer MoE-only graph bench also implicates (persistent IS faster in isolation). Verdict: keep the tensorcore stack in tree, ship with `warp` as default, unblock the throughput goal by pivoting to other levers. The next-step section now lists T2-α (sparse-MLA SWA/C4/C128 split) as the highest-priority lever, with a DeepGEMM SF-transformation fallback probe right after (the post-rebase log shows DeepGEMM falling back to reference FP8 linear per layer, which may explain part of the 17.73 → 17.13 tok/s warp regression after upstream PR lightseekorg#93 landed). T3-α profile-driven diagnosis and W2-bias fuse stay on the queue but behind the higher-leverage attention work. Signed-off-by: jasl <jasl9187@hotmail.com>
jasl
added a commit
to jasl/tokenspeed
that referenced
this pull request
May 12, 2026
Upstream PR lightseekorg#93 added a pre-flight DeepGEMM ``fp8_gemm_nt`` call to ``DeepseekV4Attention._compute_qr_kv``: on success it replaces the reference FP8 linear path, on failure it logs a WARNING per layer and falls back. DeepGEMM does not support SM120/SM121 yet (see PR ``deepseek-ai/DeepGEMM#324`` + ``reference_deepgemm_sm120`` memory), so on the RTX Pro 6000 workstation every layer fires: DeepSeek V4 DeepGEMM FP8 linear failed; falling back to reference FP8 linear. reason=RuntimeError: Assertion error (csrc/apis/layout.hpp:59): Unknown SF transformation The existing per-layer ``_deepseek_v4_deep_gemm_linear_disabled`` flag already catches this for steady-state replay, but it costs one failed call + one WARNING per layer at boot. Mirror the pattern used by ``_deepseek_v4_deepgemm_fp4_indexer_enabled_for_platform``: short- circuit ``_deepseek_v4_get_fp8_linear_deep_gemm`` to ``None`` on SM12x so the platform never tries the DeepGEMM path. Non-SM12x platforms keep the new fast path. Signed-off-by: jasl <jasl9187@hotmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation