Skip to content

feat(deepseek-v4): harden attention fast paths#93

Merged
lightseek-bot merged 2 commits into
mainfrom
pr-stack/v4-pr4
May 12, 2026
Merged

feat(deepseek-v4): harden attention fast paths#93
lightseek-bot merged 2 commits into
mainfrom
pr-stack/v4-pr4

Conversation

@dongjiyingdjy
Copy link
Copy Markdown
Contributor

Summary

  • add and wire the DeepSeek V4 prefill indexer top-k CUDA path
  • harden DeepSeek V4 sparse attention and prefill metadata handling
  • add QKV FP8 DeepGEMM fallback handling and packed UE8M0 activation-scale support
  • tighten CUDA/DeepGEMM build discovery to current CUDA/Python environment paths

Validation

  • pre-commit run --all-files
  • GSM8K 50 samples on DeepSeek-V4-Flash: strict exact_match 0.96 +/- 0.028, flexible exact_match 0.96 +/- 0.028

Port DeepSeek V4 sparse attention hardening, local prefill top-k support, MXFP4 gather workspace reuse, and packed FP8 QKV DeepGEMM default fallback.

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
@dongjiyingdjy dongjiyingdjy requested a review from a team as a code owner May 12, 2026 09:15
@lightseek-bot lightseek-bot merged commit 683df07 into main May 12, 2026
53 of 54 checks passed
@lightseek-bot lightseek-bot deleted the pr-stack/v4-pr4 branch May 12, 2026 18:11
jasl added a commit to jasl/tokenspeed that referenced this pull request May 12, 2026
…ightseekorg#93)

Squashes 19 SM12x commits (preserved at branch
`codex/ds4-sm12x-poc-prerebase`) onto upstream/main, picking up lightseekorg#93
(DSv4 attention fast-path hardening), lightseekorg#94, lightseekorg#97, lightseekorg#99 and the
intervening commits.

Original commit chain (newest first):
  ec18ff5 docs(ds4-sm12x): record T3-α prototype landing + 6% e2e regression
  97e63c9 feat(sm12x-moe): wire persistent dispatch through sm12x_mxfp4_moe_forward
  762705e fix(sm12x-moe): re-export new tensorcore wrappers from ops package init
  d7fc65e feat(sm12x-moe): wire tensorcore MoE forward orchestrator
  069dc17 feat(sm12x-moe): add tensorcore W2 GEMM + weighted-reduce kernels
  a4b813e feat(sm12x-moe): add tensorcore W13 GEMM kernel (T3-α step 1)
  2302fb0 docs(ds4-sm12x): record T1-α graph-capture island
  d3212ac fix(deepseek-v4): drop default max-topk gate on SM12x sparse-MLA fast path
  57534ba fix(deepseek-v4): make sparse-MLA prefill workspace capture-safe
  aa88a47 fix(deepseek-v4): vectorize indexer FP8 cache read for graph capture
  84773aa refactor(deepseek-v4): remove Triton siblings from SM12x output projection
  2b4fbf9 feat(deepseek-v4): add SM12x CUDA fused inverse-RoPE + FP8 quant kernel
  4ba3229 docs(rejected-experiments): record DSv4 einsum multi-token tile failure
  0831acd feat(deepseek-v4): port DeepGEMM SM120 FP8 einsum for output projection
  b2572c4 docs(rejected-experiments): record DSv4 output-proj B_TOKEN=16 tile failure
  aff5c53 fix(deepseek-v4): gate SM12x CUDA output-projection to tokens==1 only
  b20104a feat(platform): tighten SM12x scope to SM120 + SM121 whitelist
  7eb2845 feat(deepseek-v4): add SM12x CUDA attention output-projection einsum
  486ccc1 Add DeepSeek V4 SM12x PoC runtime and kernels

Conflicts resolved (3 files):
  - test/runtime/test_deepseek_v4_config.py:
      merge imports -- keep both DeepseekV4Attention (upstream) and
      DeepseekV4Indexer (SM12x), plus _mhc_post_reference /
      _mhc_pre_reference that the SM12x tests still rely on.
  - tokenspeed-kernel/.../csrc/deepseek_v4_attention_binding.cu:
      keep both upstream's deepseek_v4_indexer_topk_prefill forward
      decl + FFI export and the SM12x kernel forward decls/exports
      added by the PoC.
  - tokenspeed-kernel/.../cuda/deepseek_v4_attention.py:
      keep both upstream's `indexer_topk_prefill` Python wrapper and
      the SM12x helper functions (mhc pre/post, sparse MLA fp8 cache
      bindings, inverse-RoPE grouped, etc.).

Follow-ups:
  - Re-build the kernel on the SM120 workstation and verify focused
    regression suites still green.
  - The T3-α diagnostic loop (~6% end-to-end persistent-vs-warp
    regression) resumes after this rebase lands.
Signed-off-by: jasl <jasl9187@hotmail.com>
jasl added a commit to jasl/tokenspeed that referenced this pull request May 12, 2026
…fill gate

Upstream PR lightseekorg#93 added a ``num_reqs = int(metadata.seq_lens.numel())`` check
to ``forward_deepseek_v4_prefill`` before deciding between the single-chunk
path and the per-chunk loop. The SM12x backend test's ``_metadata`` helper
was created before this gate existed and only set ``forward_mode``.

Add a single-request ``seq_lens=tensor([2])`` so the test keeps exercising
the single-chunk path that still routes through ``_prefill_workspace`` +
``_forward_sparse_mla_reference`` (the methods the test mocks).

Signed-off-by: jasl <jasl9187@hotmail.com>
jasl added a commit to jasl/tokenspeed that referenced this pull request May 12, 2026
… probe

A1 cached-scratch hypothesis was tested (commit 6245872) and ruled out:
persistent_A1 = 16.12 / 57.76 ms vs persistent = 16.10 / 57.80 ms,
effectively identical. The 6% end-to-end regression is therefore not
graph-pool fragmentation and lives in the non-MoE kernels surrounding
the orchestrator -- which the isolated 30-layer MoE-only graph bench
also implicates (persistent IS faster in isolation).

Verdict: keep the tensorcore stack in tree, ship with `warp` as default,
unblock the throughput goal by pivoting to other levers. The next-step
section now lists T2-α (sparse-MLA SWA/C4/C128 split) as the
highest-priority lever, with a DeepGEMM SF-transformation fallback
probe right after (the post-rebase log shows DeepGEMM falling back to
reference FP8 linear per layer, which may explain part of the
17.73 → 17.13 tok/s warp regression after upstream PR lightseekorg#93 landed).
T3-α profile-driven diagnosis and W2-bias fuse stay on the queue but
behind the higher-leverage attention work.

Signed-off-by: jasl <jasl9187@hotmail.com>
jasl added a commit to jasl/tokenspeed that referenced this pull request May 12, 2026
Upstream PR lightseekorg#93 added a pre-flight DeepGEMM ``fp8_gemm_nt`` call to
``DeepseekV4Attention._compute_qr_kv``: on success it replaces the
reference FP8 linear path, on failure it logs a WARNING per layer and
falls back. DeepGEMM does not support SM120/SM121 yet (see PR
``deepseek-ai/DeepGEMM#324`` + ``reference_deepgemm_sm120`` memory),
so on the RTX Pro 6000 workstation every layer fires:

    DeepSeek V4 DeepGEMM FP8 linear failed; falling back to reference
    FP8 linear. reason=RuntimeError: Assertion error
    (csrc/apis/layout.hpp:59): Unknown SF transformation

The existing per-layer ``_deepseek_v4_deep_gemm_linear_disabled`` flag
already catches this for steady-state replay, but it costs one failed
call + one WARNING per layer at boot. Mirror the pattern used by
``_deepseek_v4_deepgemm_fp4_indexer_enabled_for_platform``: short-
circuit ``_deepseek_v4_get_fp8_linear_deep_gemm`` to ``None`` on SM12x
so the platform never tries the DeepGEMM path. Non-SM12x platforms
keep the new fast path.

Signed-off-by: jasl <jasl9187@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants