Skip to content

[Question] DSA VS MLA Prefill Benchmark On H100 #149

@ZavierXing

Description

@ZavierXing
  1. Motivation

In the DeepSeek-V3.2 paper, the "Inference costs of DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp on H800 clusters" shows that DSA outperforms MLA in the prefilling stage when sequence length exceeds ~12K ().

However, in my profiling on H100 (PCIe/SXM), I cannot reproduce this crossover point.

Image Image
  1. Micro-benchmark Implementation

DSA path: deep_gemm.fp8_mqa_logits -> mock_topk_index -> flash_mla_sparse_fwd.
MLA path: Standard FlashAttention-3

Key Parameters:
DSA Config: qk_dim=576, v_dim=512, n_heads_q=128, n_heads_k=1
Index Config: qk_dim=128, n_heads_q=64, n_heads_k=1.
MLA Config: qk_dim=192, v_dim=192, n_heads_q=128, n_heads_kv=128.

Precision: torch.bfloat16 (Main) / e4m3 (index).

https://github.com/ZavierXing/FlashMLA/blob/bench/benchmark/bench_dsa.py

  • GPU: NVIDIA H100 NVL
  • CUDA Version: 12.6
  • PyTorchVersion: 2.6.0
  • tilelang Version: 0.1.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions