Skip to content

Conversation

@chengyupku
Copy link
Contributor

@chengyupku chengyupku commented Oct 22, 2025

This pull request introduces a new benchmark for the Mamba2 chunk scan kernel, including both documentation and a Python benchmarking script. The main focus is to provide reproducible performance measurements and a reference implementation for the chunk scan operation, as well as a configurable autotuned kernel for throughput evaluation.

Benchmark documentation and script:

  • Added a comprehensive benchmark report in benchmark/mamba2/README.md, detailing throughput results, environment setup, and reproduction instructions for the Mamba2 chunk scan kernel.

Benchmarking and kernel implementation:

  • Introduced benchmark_mamba_chunk_scan.py, which includes:
    • A reference implementation of the chunk scan operation (ref_program) for correctness comparison.
    • An autotuned, configurable kernel (chunk_scan_fwd) using TileLang for efficient chunk scan computation, supporting various block sizes and stages.
    • Command-line interface for running benchmarks with different parameters, reporting latency and throughput, and printing the best configuration found.

Summary by CodeRabbit

  • New Features

    • Introduces a flexible benchmarking tool for detailed performance evaluation and comprehensive optimization of attention mechanisms with advanced automated parameter tuning, fully customizable block and stage sizes, and optimized hardware-accelerated execution modes.
  • Documentation

    • Provides comprehensive setup and configuration instructions, reproducible step-by-step benchmark procedures with concrete example parameters, complete baseline performance metrics including latency and throughput measurements across various configurations, and detailed reference visualization graphs.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Walkthrough

This PR adds a comprehensive benchmark suite for Mamba2's chunk-scan operation. It introduces a documentation file describing the Mamba2_chunk_scan benchmark with environment details, reproduction steps, and results, alongside a Python benchmark script that implements a reference program, configuration generator, and an autotuned chunk-scanning kernel with tiling and staging support.

Changes

Cohort / File(s) Summary
Mamba2 Chunk-Scan Benchmark
benchmark/mamba2/README.md, benchmark/mamba2/benchmark_mamba_chunk_scan.py
Adds documentation and benchmark script for Mamba2 chunk-scanning attention. Includes reference implementation of dense attention-like computation with decay masking, configuration generation utility, autotuned forward kernel with tiled and staged execution, and measurement infrastructure.

Sequence Diagram

sequenceDiagram
    participant main as Main Script
    participant config as get_configs()
    participant autotune as Autotuner
    participant kernel as chunk_scan_fwd
    participant exec as Kernel Execution

    main->>config: Generate parameter combinations
    config-->>main: List of (block_M, block_N, block_K, block_Dstate, num_stages)
    main->>autotune: Invoke autotuner with configs
    autotune->>kernel: Test each config
    kernel->>exec: Execute tiled kernel
    exec-->>kernel: Measure latency
    kernel-->>autotune: Return timing
    autotune-->>main: Best config & TFLOPs
    main->>main: Print results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hopping through benchmarks with Mamba so fleet,
Chunk-scanning attention, a performance feat!
Autotuning kernels in tiles neatly laid,
Latency whispers, throughput displayed! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "[Benchmark] Add Mamba2_chunk_scan benchmark" directly aligns with the main changes in the changeset. The PR adds a new benchmark for the Mamba2 chunk scan kernel, which includes both a documentation README file and a Python benchmarking script with an autotuned kernel and reference implementation. The title is concise, specific about what is being added (a Mamba2 chunk scan benchmark), and uses a clear categorical prefix [Benchmark] that helps with code history scanning. A teammate reviewing the repository history would understand that this PR introduces a new benchmark without ambiguity.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
benchmark/mamba2/benchmark_mamba_chunk_scan.py (3)

4-4: Replace star import with explicit import.

The star import makes it unclear which names are being used and can cause namespace conflicts. Since only the autotune decorator is used, import it explicitly.

Apply this diff:

-from tilelang.autotuner import *
+from tilelang.autotuner import autotune

90-90: Add a comment documenting the constant.

The constant p = 1.44269504 (log₂(e)) is used to convert natural logarithm operations to base-2 exponentials via exp2. A brief comment would improve code clarity.

Apply this diff:

-    p = 1.44269504
+    p = 1.44269504  # log2(e), used for converting exp(x) to exp2(x * log2(e))

215-215: Add a comment explaining the FLOPs formula.

The FLOPs calculation is complex and would benefit from a brief explanation of what each term represents (causal CB@X GEMM and C@prev_states GEMM).

Apply this diff:

+    # FLOPs: causal CB@X GEMM (~0.5 factor for triangular) + C@prev_states GEMM
     total_flops = 2 * batch * seq_len * chunk_size * heads * dim * 0.5 + 2 * batch * seq_len * heads * dim * dstate
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e28433e and 18e5176.

⛔ Files ignored due to path filters (1)
  • benchmark/mamba2/mamba_benchmark_result.png is excluded by !**/*.png
📒 Files selected for processing (2)
  • benchmark/mamba2/README.md (1 hunks)
  • benchmark/mamba2/benchmark_mamba_chunk_scan.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchmark/mamba2/benchmark_mamba_chunk_scan.py (10)
tilelang/autotuner/tuner.py (1)
  • autotune (722-815)
tilelang/jit/__init__.py (1)
  • jit (237-310)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-144)
tilelang/language/allocate.py (2)
  • alloc_fragment (56-67)
  • alloc_shared (24-39)
tilelang/language/annotations.py (1)
  • annotate_layout (24-35)
tilelang/layout/swizzle.py (1)
  • make_swizzled_layout (10-18)
tilelang/language/builtin.py (1)
  • no_set_max_nreg (174-177)
tilelang/language/parallel.py (1)
  • Parallel (8-28)
tilelang/language/pipeline.py (1)
  • Pipelined (9-46)
tilelang/language/tir/op.py (1)
  • if_then_else (2906-2936)
🪛 Ruff (0.14.1)
benchmark/mamba2/benchmark_mamba_chunk_scan.py

4-4: from tilelang.autotuner import * used; unable to detect undefined names

(F403)


25-25: Unpacked variable batch is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


25-25: Unpacked variable headdim is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


67-67: autotune may be undefined, or defined from star imports

(F405)

🔇 Additional comments (3)
benchmark/mamba2/benchmark_mamba_chunk_scan.py (2)

57-64: LGTM!

The config generator correctly produces a Cartesian product of tuning parameters for the autotuner.


67-200: LGTM on kernel structure!

The kernel is well-designed with proper memory hierarchy usage (fragment, shared, and global memory), pipelining for latency hiding, and correct synchronization. The computation logic for chunk scanning with decay and causal masking appears sound.

benchmark/mamba2/README.md (1)

48-53: No action needed.

The image file mamba_benchmark_result.png exists at benchmark/mamba2/mamba_benchmark_result.png. The figure reference in the README is valid, and there are no broken links.


## Results

| Seq_len| Latency (s) | Throughput (TFLOPs) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct the latency units in the table header.

The table header indicates "Latency (s)" (seconds), but the values shown (0.169, 0.329, etc.) and the reproduction code (line 33) indicate these are milliseconds. This inconsistency could confuse users.

Apply this diff:

-| Seq_len| Latency (s) | Throughput (TFLOPs) |
+| Seq_len| Latency (ms) | Throughput (TFLOPs) |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Seq_len| Latency (s) | Throughput (TFLOPs) |
| Seq_len| Latency (ms) | Throughput (TFLOPs) |
🤖 Prompt for AI Agents
In benchmark/mamba2/README.md around line 39, the table header currently says
"Latency (s)" but the values and reproduction indicate milliseconds; update the
header cell to "Latency (ms)" (or "Latency (ms)" with consistent spacing) so the
units match the shown values and the reproduction code.

<a href="mamba_benchmark_result.png">
<img src="mamba_benchmark_result.png" alt="Mamba2_chunk_scan Performance Comparison on H100">
</a>
<figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix GPU model inconsistency.

The figure caption mentions "NVIDIA H100" but the Environment section (line 8) specifies "NVIDIA H800 SXM". Ensure the GPU model is consistent throughout the document.

Apply this diff:

-  <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption>
+  <figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H800</figcaption>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H100</figcaption>
<figcaption style="text-align: center;">Performance comparison across compilers on NVIDIA H800</figcaption>
🤖 Prompt for AI Agents
In benchmark/mamba2/README.md around line 52, the figure caption currently says
"NVIDIA H100" which conflicts with the Environment section at line 8 that
specifies "NVIDIA H800 SXM"; update the caption to match the Environment section
by replacing "NVIDIA H100" with "NVIDIA H800 SXM" so the GPU model is consistent
throughout the document.

@chengyupku chengyupku merged commit 717f7b5 into tile-ai:main Oct 22, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant