[Benchmark] Add Mamba2_chunk_scan benchmark #1109

chengyupku · 2025-10-22T14:38:27Z

This pull request introduces a new benchmark for the Mamba2 chunk scan kernel, including both documentation and a Python benchmarking script. The main focus is to provide reproducible performance measurements and a reference implementation for the chunk scan operation, as well as a configurable autotuned kernel for throughput evaluation.

Benchmark documentation and script:

Added a comprehensive benchmark report in benchmark/mamba2/README.md, detailing throughput results, environment setup, and reproduction instructions for the Mamba2 chunk scan kernel.

Benchmarking and kernel implementation:

Introduced benchmark_mamba_chunk_scan.py, which includes:
- A reference implementation of the chunk scan operation (ref_program) for correctness comparison.
- An autotuned, configurable kernel (chunk_scan_fwd) using TileLang for efficient chunk scan computation, supporting various block sizes and stages.
- Command-line interface for running benchmarks with different parameters, reporting latency and throughput, and printing the best configuration found.

Summary by CodeRabbit

New Features
- Introduces a flexible benchmarking tool for detailed performance evaluation and comprehensive optimization of attention mechanisms with advanced automated parameter tuning, fully customizable block and stage sizes, and optimized hardware-accelerated execution modes.
Documentation
- Provides comprehensive setup and configuration instructions, reproducible step-by-step benchmark procedures with concrete example parameters, complete baseline performance metrics including latency and throughput measurements across various configurations, and detailed reference visualization graphs.

coderabbitai · 2025-10-22T14:38:39Z

Walkthrough

This PR adds a comprehensive benchmark suite for Mamba2's chunk-scan operation. It introduces a documentation file describing the Mamba2_chunk_scan benchmark with environment details, reproduction steps, and results, alongside a Python benchmark script that implements a reference program, configuration generator, and an autotuned chunk-scanning kernel with tiling and staging support.

Changes

Cohort / File(s)	Summary
Mamba2 Chunk-Scan Benchmark `benchmark/mamba2/README.md`, `benchmark/mamba2/benchmark_mamba_chunk_scan.py`	Adds documentation and benchmark script for Mamba2 chunk-scanning attention. Includes reference implementation of dense attention-like computation with decay masking, configuration generation utility, autotuned forward kernel with tiled and staged execution, and measurement infrastructure.

Sequence Diagram

sequenceDiagram
    participant main as Main Script
    participant config as get_configs()
    participant autotune as Autotuner
    participant kernel as chunk_scan_fwd
    participant exec as Kernel Execution

    main->>config: Generate parameter combinations
    config-->>main: List of (block_M, block_N, block_K, block_Dstate, num_stages)
    main->>autotune: Invoke autotuner with configs
    autotune->>kernel: Test each config
    kernel->>exec: Execute tiled kernel
    exec-->>kernel: Measure latency
    kernel-->>autotune: Return timing
    autotune-->>main: Best config & TFLOPs
    main->>main: Print results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hopping through benchmarks with Mamba so fleet,
Chunk-scanning attention, a performance feat!
Autotuning kernels in tiles neatly laid,
Latency whispers, throughput displayed! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "[Benchmark] Add Mamba2_chunk_scan benchmark" directly aligns with the main changes in the changeset. The PR adds a new benchmark for the Mamba2 chunk scan kernel, which includes both a documentation README file and a Python benchmarking script with an autotuned kernel and reference implementation. The title is concise, specific about what is being added (a Mamba2 chunk scan benchmark), and uses a clear categorical prefix [Benchmark] that helps with code history scanning. A teammate reviewing the repository history would understand that this PR introduces a new benchmark without ambiguity.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-10-22T14:38:45Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

benchmark/mamba2/benchmark_mamba_chunk_scan.py (3)
4-4: Replace star import with explicit import.

The star import makes it unclear which names are being used and can cause namespace conflicts. Since only the autotune decorator is used, import it explicitly.

Apply this diff:
-from tilelang.autotuner import *
+from tilelang.autotuner import autotune
90-90: Add a comment documenting the constant.

The constant p = 1.44269504 (log₂(e)) is used to convert natural logarithm operations to base-2 exponentials via exp2. A brief comment would improve code clarity.

Apply this diff:
-    p = 1.44269504
+    p = 1.44269504  # log2(e), used for converting exp(x) to exp2(x * log2(e))
215-215: Add a comment explaining the FLOPs formula.

The FLOPs calculation is complex and would benefit from a brief explanation of what each term represents (causal CB@X GEMM and C@prev_states GEMM).

Apply this diff:
+    # FLOPs: causal CB@X GEMM (~0.5 factor for triangular) + C@prev_states GEMM
     total_flops = 2 * batch * seq_len * chunk_size * heads * dim * 0.5 + 2 * batch * seq_len * heads * dim * dstate

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e28433e and 18e5176.

⛔ Files ignored due to path filters (1)

benchmark/mamba2/mamba_benchmark_result.png is excluded by !**/*.png

📒 Files selected for processing (2)

benchmark/mamba2/README.md (1 hunks)
benchmark/mamba2/benchmark_mamba_chunk_scan.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

benchmark/mamba2/benchmark_mamba_chunk_scan.py (10)

tilelang/autotuner/tuner.py (1)

autotune (722-815)

tilelang/jit/__init__.py (1)

jit (237-310)

tilelang/transform/pass_config.py (1)

PassConfigKey (6-144)

tilelang/language/allocate.py (2)

alloc_fragment (56-67)

alloc_shared (24-39)

tilelang/language/annotations.py (1)

annotate_layout (24-35)

tilelang/layout/swizzle.py (1)

make_swizzled_layout (10-18)

tilelang/language/builtin.py (1)

no_set_max_nreg (174-177)

tilelang/language/parallel.py (1)

Parallel (8-28)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/language/tir/op.py (1)

if_then_else (2906-2936)

🪛 Ruff (0.14.1)

benchmark/mamba2/benchmark_mamba_chunk_scan.py

4-4: from tilelang.autotuner import * used; unable to detect undefined names

(F403)

25-25: Unpacked variable batch is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

25-25: Unpacked variable headdim is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

67-67: autotune may be undefined, or defined from star imports

(F405)

🔇 Additional comments (3)

benchmark/mamba2/benchmark_mamba_chunk_scan.py (2)

57-64: LGTM!

The config generator correctly produces a Cartesian product of tuning parameters for the autotuner.

67-200: LGTM on kernel structure!

The kernel is well-designed with proper memory hierarchy usage (fragment, shared, and global memory), pipelining for latency hiding, and correct synchronization. The computation logic for chunk scanning with decay and causal masking appears sound.

benchmark/mamba2/README.md (1)

48-53: No action needed.

The image file mamba_benchmark_result.png exists at benchmark/mamba2/mamba_benchmark_result.png. The figure reference in the README is valid, and there are no broken links.

coderabbitai · 2025-10-22T14:41:46Z

benchmark/mamba2/README.md

+
+## Results
+
+| Seq_len| Latency (s) | Throughput (TFLOPs) |


⚠️ Potential issue | 🟡 Minor

Correct the latency units in the table header.

The table header indicates "Latency (s)" (seconds), but the values shown (0.169, 0.329, etc.) and the reproduction code (line 33) indicate these are milliseconds. This inconsistency could confuse users.

Apply this diff:

-| Seq_len| Latency (s) | Throughput (TFLOPs) | +| Seq_len| Latency (ms) | Throughput (TFLOPs) |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

| Seq_len| Latency (s) | Throughput (TFLOPs) |

| Seq_len| Latency (ms) | Throughput (TFLOPs) |

🤖 Prompt for AI Agents

In benchmark/mamba2/README.md around line 39, the table header currently says "Latency (s)" but the values and reproduction indicate milliseconds; update the header cell to "Latency (ms)" (or "Latency (ms)" with consistent spacing) so the units match the shown values and the reproduction code.

coderabbitai · 2025-10-22T14:41:46Z

benchmark/mamba2/README.md

+  <a href="mamba_benchmark_result.png">
+    <img src="mamba_benchmark_result.png" alt="Mamba2_chunk_scan Performance Comparison on H100">
+   </a>
+  <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption>


⚠️ Potential issue | 🟡 Minor

Fix GPU model inconsistency.

The figure caption mentions "NVIDIA H100" but the Environment section (line 8) specifies "NVIDIA H800 SXM". Ensure the GPU model is consistent throughout the document.

Apply this diff:

- <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption> + <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H800</figcaption>

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

<figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption>

<figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H800</figcaption>

🤖 Prompt for AI Agents

In benchmark/mamba2/README.md around line 52, the figure caption currently says "NVIDIA H100" which conflicts with the Environment section at line 8 that specifies "NVIDIA H800 SXM"; update the caption to match the Environment section by replacing "NVIDIA H100" with "NVIDIA H800 SXM" so the GPU model is consistent throughout the document.

[Benchmark] Add Mamba2_chunk_scan benchmark

18e5176

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

chengyupku merged commit 717f7b5 into tile-ai:main Oct 22, 2025
7 checks passed

kurisu6912 mentioned this pull request Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Open

coderabbitai bot mentioned this pull request Oct 27, 2025

[Benchmark] Update triton and helion baselines in mamba-chuk-scan #1131

Merged

kurisu6912 mentioned this pull request Oct 28, 2025

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Benchmark] Add Mamba2_chunk_scan benchmark #1109

[Benchmark] Add Mamba2_chunk_scan benchmark #1109

Uh oh!

chengyupku commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 22, 2025

Uh oh!

coderabbitai bot Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	\| Seq_len\| Latency (s) \| Throughput (TFLOPs) \|
	\| Seq_len\| Latency (ms) \| Throughput (TFLOPs) \|

	<figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption>
	<figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H800</figcaption>

[Benchmark] Add Mamba2_chunk_scan benchmark #1109

[Benchmark] Add Mamba2_chunk_scan benchmark #1109

Uh oh!

Conversation

chengyupku commented Oct 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chengyupku commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 22, 2025 •

edited

Loading