feat(deepseek-v4): harden attention fast paths by dongjiyingdjy · Pull Request #93 · lightseekorg/tokenspeed

dongjiyingdjy · 2026-05-12T09:15:31Z

Summary

add and wire the DeepSeek V4 prefill indexer top-k CUDA path
harden DeepSeek V4 sparse attention and prefill metadata handling
add QKV FP8 DeepGEMM fallback handling and packed UE8M0 activation-scale support
tighten CUDA/DeepGEMM build discovery to current CUDA/Python environment paths

Validation

pre-commit run --all-files
GSM8K 50 samples on DeepSeek-V4-Flash: strict exact_match 0.96 +/- 0.028, flexible exact_match 0.96 +/- 0.028

Port DeepSeek V4 sparse attention hardening, local prefill top-k support, MXFP4 gather workspace reuse, and packed FP8 QKV DeepGEMM default fallback. Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

…ightseekorg#93) Squashes 19 SM12x commits (preserved at branch `codex/ds4-sm12x-poc-prerebase`) onto upstream/main, picking up lightseekorg#93 (DSv4 attention fast-path hardening), lightseekorg#94, lightseekorg#97, lightseekorg#99 and the intervening commits. Original commit chain (newest first): ec18ff5 docs(ds4-sm12x): record T3-α prototype landing + 6% e2e regression 97e63c9 feat(sm12x-moe): wire persistent dispatch through sm12x_mxfp4_moe_forward 762705e fix(sm12x-moe): re-export new tensorcore wrappers from ops package init d7fc65e feat(sm12x-moe): wire tensorcore MoE forward orchestrator 069dc17 feat(sm12x-moe): add tensorcore W2 GEMM + weighted-reduce kernels a4b813e feat(sm12x-moe): add tensorcore W13 GEMM kernel (T3-α step 1) 2302fb0 docs(ds4-sm12x): record T1-α graph-capture island d3212ac fix(deepseek-v4): drop default max-topk gate on SM12x sparse-MLA fast path 57534ba fix(deepseek-v4): make sparse-MLA prefill workspace capture-safe aa88a47 fix(deepseek-v4): vectorize indexer FP8 cache read for graph capture 84773aa refactor(deepseek-v4): remove Triton siblings from SM12x output projection 2b4fbf9 feat(deepseek-v4): add SM12x CUDA fused inverse-RoPE + FP8 quant kernel 4ba3229 docs(rejected-experiments): record DSv4 einsum multi-token tile failure 0831acd feat(deepseek-v4): port DeepGEMM SM120 FP8 einsum for output projection b2572c4 docs(rejected-experiments): record DSv4 output-proj B_TOKEN=16 tile failure aff5c53 fix(deepseek-v4): gate SM12x CUDA output-projection to tokens==1 only b20104a feat(platform): tighten SM12x scope to SM120 + SM121 whitelist 7eb2845 feat(deepseek-v4): add SM12x CUDA attention output-projection einsum 486ccc1 Add DeepSeek V4 SM12x PoC runtime and kernels Conflicts resolved (3 files): - test/runtime/test_deepseek_v4_config.py: merge imports -- keep both DeepseekV4Attention (upstream) and DeepseekV4Indexer (SM12x), plus _mhc_post_reference / _mhc_pre_reference that the SM12x tests still rely on. - tokenspeed-kernel/.../csrc/deepseek_v4_attention_binding.cu: keep both upstream's deepseek_v4_indexer_topk_prefill forward decl + FFI export and the SM12x kernel forward decls/exports added by the PoC. - tokenspeed-kernel/.../cuda/deepseek_v4_attention.py: keep both upstream's `indexer_topk_prefill` Python wrapper and the SM12x helper functions (mhc pre/post, sparse MLA fp8 cache bindings, inverse-RoPE grouped, etc.). Follow-ups: - Re-build the kernel on the SM120 workstation and verify focused regression suites still green. - The T3-α diagnostic loop (~6% end-to-end persistent-vs-warp regression) resumes after this rebase lands. Signed-off-by: jasl <jasl9187@hotmail.com>

…fill gate Upstream PR lightseekorg#93 added a ``num_reqs = int(metadata.seq_lens.numel())`` check to ``forward_deepseek_v4_prefill`` before deciding between the single-chunk path and the per-chunk loop. The SM12x backend test's ``_metadata`` helper was created before this gate existed and only set ``forward_mode``. Add a single-request ``seq_lens=tensor([2])`` so the test keeps exercising the single-chunk path that still routes through ``_prefill_workspace`` + ``_forward_sparse_mla_reference`` (the methods the test mocks). Signed-off-by: jasl <jasl9187@hotmail.com>

… probe A1 cached-scratch hypothesis was tested (commit 6245872) and ruled out: persistent_A1 = 16.12 / 57.76 ms vs persistent = 16.10 / 57.80 ms, effectively identical. The 6% end-to-end regression is therefore not graph-pool fragmentation and lives in the non-MoE kernels surrounding the orchestrator -- which the isolated 30-layer MoE-only graph bench also implicates (persistent IS faster in isolation). Verdict: keep the tensorcore stack in tree, ship with `warp` as default, unblock the throughput goal by pivoting to other levers. The next-step section now lists T2-α (sparse-MLA SWA/C4/C128 split) as the highest-priority lever, with a DeepGEMM SF-transformation fallback probe right after (the post-rebase log shows DeepGEMM falling back to reference FP8 linear per layer, which may explain part of the 17.73 → 17.13 tok/s warp regression after upstream PR lightseekorg#93 landed). T3-α profile-driven diagnosis and W2-bias fuse stay on the queue but behind the higher-leverage attention work. Signed-off-by: jasl <jasl9187@hotmail.com>

Upstream PR lightseekorg#93 added a pre-flight DeepGEMM ``fp8_gemm_nt`` call to ``DeepseekV4Attention._compute_qr_kv``: on success it replaces the reference FP8 linear path, on failure it logs a WARNING per layer and falls back. DeepGEMM does not support SM120/SM121 yet (see PR ``deepseek-ai/DeepGEMM#324`` + ``reference_deepgemm_sm120`` memory), so on the RTX Pro 6000 workstation every layer fires: DeepSeek V4 DeepGEMM FP8 linear failed; falling back to reference FP8 linear. reason=RuntimeError: Assertion error (csrc/apis/layout.hpp:59): Unknown SF transformation The existing per-layer ``_deepseek_v4_deep_gemm_linear_disabled`` flag already catches this for steady-state replay, but it costs one failed call + one WARNING per layer at boot. Mirror the pattern used by ``_deepseek_v4_deepgemm_fp4_indexer_enabled_for_platform``: short- circuit ``_deepseek_v4_get_fp8_linear_deep_gemm`` to ``None`` on SM12x so the platform never tries the DeepGEMM path. Non-SM12x platforms keep the new fast path. Signed-off-by: jasl <jasl9187@hotmail.com>

dongjiyingdjy added 2 commits May 12, 2026 09:11

feat(deepseek-v4): harden attention fast paths

b24ac2e

Port DeepSeek V4 sparse attention hardening, local prefill top-k support, MXFP4 gather workspace reuse, and packed FP8 QKV DeepGEMM default fallback. Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

fix(deepseek-v4): tighten fast path hardening

19b19da

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

dongjiyingdjy requested a review from a team as a code owner May 12, 2026 09:15

lightseek-bot approved these changes May 12, 2026

View reviewed changes

lightseek-bot merged commit 683df07 into main May 12, 2026
53 of 54 checks passed

lightseek-bot deleted the pr-stack/v4-pr4 branch May 12, 2026 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deepseek-v4): harden attention fast paths#93

feat(deepseek-v4): harden attention fast paths#93
lightseek-bot merged 2 commits into
mainfrom
pr-stack/v4-pr4

dongjiyingdjy commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dongjiyingdjy commented May 12, 2026

Summary

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants