- 
                Notifications
    You must be signed in to change notification settings 
- Fork 292
[Benchmark] Add Mamba2_chunk_scan benchmark #1109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| WalkthroughThis PR adds a comprehensive benchmark suite for Mamba2's chunk-scan operation. It introduces a documentation file describing the Mamba2_chunk_scan benchmark with environment details, reproduction steps, and results, alongside a Python benchmark script that implements a reference program, configuration generator, and an autotuned chunk-scanning kernel with tiling and staging support. Changes
 Sequence DiagramsequenceDiagram
    participant main as Main Script
    participant config as get_configs()
    participant autotune as Autotuner
    participant kernel as chunk_scan_fwd
    participant exec as Kernel Execution
    main->>config: Generate parameter combinations
    config-->>main: List of (block_M, block_N, block_K, block_Dstate, num_stages)
    main->>autotune: Invoke autotuner with configs
    autotune->>kernel: Test each config
    kernel->>exec: Execute tiled kernel
    exec-->>kernel: Measure latency
    kernel-->>autotune: Return timing
    autotune-->>main: Best config & TFLOPs
    main->>main: Print results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
 Pre-merge checks and finishing touches❌ Failed checks (1 warning)
 ✅ Passed checks (2 passed)
 ✨ Finishing touches
 🧪 Generate unit tests (beta)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
| 👋 Hi! Thank you for contributing to the TileLang project. Please remember to run  We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
benchmark/mamba2/benchmark_mamba_chunk_scan.py (3)
4-4: Replace star import with explicit import.The star import makes it unclear which names are being used and can cause namespace conflicts. Since only the
autotunedecorator is used, import it explicitly.Apply this diff:
-from tilelang.autotuner import * +from tilelang.autotuner import autotune
90-90: Add a comment documenting the constant.The constant
p = 1.44269504(log₂(e)) is used to convert natural logarithm operations to base-2 exponentials viaexp2. A brief comment would improve code clarity.Apply this diff:
- p = 1.44269504 + p = 1.44269504 # log2(e), used for converting exp(x) to exp2(x * log2(e))
215-215: Add a comment explaining the FLOPs formula.The FLOPs calculation is complex and would benefit from a brief explanation of what each term represents (causal CB@X GEMM and C@prev_states GEMM).
Apply this diff:
+ # FLOPs: causal CB@X GEMM (~0.5 factor for triangular) + C@prev_states GEMM total_flops = 2 * batch * seq_len * chunk_size * heads * dim * 0.5 + 2 * batch * seq_len * heads * dim * dstate
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
- benchmark/mamba2/mamba_benchmark_result.pngis excluded by- !**/*.png
📒 Files selected for processing (2)
- benchmark/mamba2/README.md(1 hunks)
- benchmark/mamba2/benchmark_mamba_chunk_scan.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchmark/mamba2/benchmark_mamba_chunk_scan.py (10)
tilelang/autotuner/tuner.py (1)
autotune(722-815)tilelang/jit/__init__.py (1)
jit(237-310)tilelang/transform/pass_config.py (1)
PassConfigKey(6-144)tilelang/language/allocate.py (2)
alloc_fragment(56-67)
alloc_shared(24-39)tilelang/language/annotations.py (1)
annotate_layout(24-35)tilelang/layout/swizzle.py (1)
make_swizzled_layout(10-18)tilelang/language/builtin.py (1)
no_set_max_nreg(174-177)tilelang/language/parallel.py (1)
Parallel(8-28)tilelang/language/pipeline.py (1)
Pipelined(9-46)tilelang/language/tir/op.py (1)
if_then_else(2906-2936)
🪛 Ruff (0.14.1)
benchmark/mamba2/benchmark_mamba_chunk_scan.py
4-4: from tilelang.autotuner import * used; unable to detect undefined names
(F403)
25-25: Unpacked variable batch is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
25-25: Unpacked variable headdim is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
67-67: autotune may be undefined, or defined from star imports
(F405)
🔇 Additional comments (3)
benchmark/mamba2/benchmark_mamba_chunk_scan.py (2)
57-64: LGTM!The config generator correctly produces a Cartesian product of tuning parameters for the autotuner.
67-200: LGTM on kernel structure!The kernel is well-designed with proper memory hierarchy usage (fragment, shared, and global memory), pipelining for latency hiding, and correct synchronization. The computation logic for chunk scanning with decay and causal masking appears sound.
benchmark/mamba2/README.md (1)
48-53: No action needed.The image file
mamba_benchmark_result.pngexists atbenchmark/mamba2/mamba_benchmark_result.png. The figure reference in the README is valid, and there are no broken links.
|  | ||
| ## Results | ||
|  | ||
| | Seq_len| Latency (s) | Throughput (TFLOPs) | | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the latency units in the table header.
The table header indicates "Latency (s)" (seconds), but the values shown (0.169, 0.329, etc.) and the reproduction code (line 33) indicate these are milliseconds. This inconsistency could confuse users.
Apply this diff:
-| Seq_len| Latency (s) | Throughput (TFLOPs) |
+| Seq_len| Latency (ms) | Throughput (TFLOPs) |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | Seq_len| Latency (s) | Throughput (TFLOPs) | | |
| | Seq_len| Latency (ms) | Throughput (TFLOPs) | | 
🤖 Prompt for AI Agents
In benchmark/mamba2/README.md around line 39, the table header currently says
"Latency (s)" but the values and reproduction indicate milliseconds; update the
header cell to "Latency (ms)" (or "Latency (ms)" with consistent spacing) so the
units match the shown values and the reproduction code.
| <a href="mamba_benchmark_result.png"> | ||
| <img src="mamba_benchmark_result.png" alt="Mamba2_chunk_scan Performance Comparison on H100"> | ||
| </a> | ||
| <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption> | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix GPU model inconsistency.
The figure caption mentions "NVIDIA H100" but the Environment section (line 8) specifies "NVIDIA H800 SXM". Ensure the GPU model is consistent throughout the document.
Apply this diff:
-  <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption>
+  <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H800</figcaption>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption> | |
| <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H800</figcaption> | 
🤖 Prompt for AI Agents
In benchmark/mamba2/README.md around line 52, the figure caption currently says
"NVIDIA H100" which conflicts with the Environment section at line 8 that
specifies "NVIDIA H800 SXM"; update the caption to match the Environment section
by replacing "NVIDIA H100" with "NVIDIA H800 SXM" so the GPU model is consistent
throughout the document.
This pull request introduces a new benchmark for the Mamba2 chunk scan kernel, including both documentation and a Python benchmarking script. The main focus is to provide reproducible performance measurements and a reference implementation for the chunk scan operation, as well as a configurable autotuned kernel for throughput evaluation.
Benchmark documentation and script:
benchmark/mamba2/README.md, detailing throughput results, environment setup, and reproduction instructions for the Mamba2 chunk scan kernel.Benchmarking and kernel implementation:
benchmark_mamba_chunk_scan.py, which includes:ref_program) for correctness comparison.chunk_scan_fwd) using TileLang for efficient chunk scan computation, supporting various block sizes and stages.Summary by CodeRabbit
New Features
Documentation