[Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel #1104

chengyupku · 2025-10-22T09:34:30Z

This pull request introduces improvements to both the Python example for linear attention and the CUDA device code for loading matrix data. The main changes focus on enabling fast math optimizations, improving type safety in the Python kernel definition, and updating CUDA device function annotations for better inlining and performance.

Python Example Improvements:

Enabled fast math optimizations in the chunk_scan_fwd function by passing TL_ENABLE_FAST_MATH: True to the tilelang.jit decorator.
Added # type: ignore comments to tensor arguments in the main kernel function for better type safety and compatibility.
Inserted a call to T.no_set_max_nreg() in the kernel to potentially improve register usage.

CUDA Device Code Updates:

Changed the annotation for all ptx_ldmatrix_* functions in ldsm.h from TL_DEVICE_NOINLINE to TL_DEVICE, allowing the compiler to inline these functions and potentially improve performance. [1] [2] [3] [4] [5] [6]

Summary by CodeRabbit

Optimization & Performance
- Enabled aggressive compiler optimizations to improve kernel execution speed and computational efficiency
- Optimized register allocation strategies and memory access patterns for more efficient resource utilization
- Modified compiler directives to enable enhanced inline optimization opportunities, potentially improving overall performance for computation-heavy workloads

github-actions · 2025-10-22T09:34:45Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-10-22T09:34:54Z

Walkthrough

These changes involve compiler optimization adjustments and function signature reformatting. The Mamba chunk scan example enables fast-math optimization and adjusts register usage, while six CUDA matrix-load functions have NOINLINE qualifiers removed to permit compiler inlining.

Changes

Cohort / File(s)	Summary
Mamba chunk scan JIT optimization `examples/linear_attention/example_mamba_chunk_scan.py`	Added `pass_configs` with `TL_ENABLE_FAST_MATH: True` to JIT decorator; expanded `main` function signature to multi-line format with explicit parameter annotations and per-parameter `# type: ignore` comments; inserted `T.no_set_max_nreg()` call for register usage adjustment.
CUDA matrix-load function inlining `src/tl_templates/cuda/ldsm.h`	Removed `_NOINLINE` qualifier from six `ptx_ldmatrix_*` functions (`x1`, `x2`, `x4`, `x1_trans`, `x2_trans`, `x4_trans`), changing signatures from `TL_DEVICE_NOINLINE` to `TL_DEVICE` to enable compiler inlining. Function bodies remain unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The Python file contains straightforward configuration and formatting changes with a new register-tuning call. The CUDA header changes are highly repetitive—the same qualifier removal applied identically across six similar functions. While the changes span different files and domains, their homogeneous nature and narrow scope minimize review complexity.

Poem

🐰 A sprig of optimization grows,
Fast math blooms where the kernel goes,
No inlines fade, let compilers decide,
Registers rest with graceful pride. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "[Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel" refers to real aspects of the changeset: it identifies the two primary components being modified (ldmatrix functions and mamba scan kernel), and it specifically describes the nature of the ldmatrix change (enabling forceinline by removing NOINLINE qualifiers). The mamba kernel portion is described more generically as "update," but this is still meaningful enough to indicate changes to that component. The title avoids vague terms like "misc updates" and provides enough specificity that a teammate scanning history would understand that optimization-related changes were made to these two areas. While a more detailed title mentioning "enable fast-math" for the mamba kernel would be more comprehensive, the current title still clearly communicates the scope and nature of the changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

examples/linear_attention/example_mamba_chunk_scan.py (2)

98-108: Prefer targeted type ignores or a typing shim over blanket # type: ignore.

Replace broad ignores with per-tool codes (e.g., # type: ignore[arg-type] for mypy or # pyright: ignore[reportArgumentType]), or introduce lightweight aliases/stubs for T.Tensor to satisfy the checker without suppressing everything.

145-146: Disabling max register cap: measure reg count, spills, and occupancy.

This can improve ILP but also reduce occupancy; please capture ptxas registers/spills and Nsight occupancy before/after. Consider making this tunable (e.g., autotune knob) if results are workload-dependent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5683e6a and 76e31cb.

📒 Files selected for processing (2)

examples/linear_attention/example_mamba_chunk_scan.py (3 hunks)
src/tl_templates/cuda/ldsm.h (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/linear_attention/example_mamba_chunk_scan.py (3)

tilelang/jit/__init__.py (1)

jit (237-310)

tilelang/transform/pass_config.py (1)

PassConfigKey (6-144)

tilelang/language/builtin.py (1)

no_set_max_nreg (174-177)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

🔇 Additional comments (2)

examples/linear_attention/example_mamba_chunk_scan.py (1)

74-79: Fast-math enablement looks good; validate numerical drift.

Good perf-oriented change; please re-run the reference checks and capture max |diff|/relative error across several shapes to ensure rtol=1e-2, atol=1e-2 still holds. If borderline, gate via a flag/env for easy rollback.

src/tl_templates/cuda/ldsm.h (1)

7-63: No issues found—TL_DEVICE correctly includes __forceinline__.

The verification confirms that TL_DEVICE in both src/tl_templates/cuda/common.h and src/tl_templates/hip/common.h is properly defined as __forceinline__ __device__. The ldmatrix loaders in ldsm.h will be inlined as intended, and the code requires no changes.

[Refactor] Use forceinline in ldmatrix and update mamba scan kernel

76e31cb

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

chengyupku merged commit 8a5eb56 into tile-ai:main Oct 22, 2025
7 checks passed

chengyupku mentioned this pull request Oct 22, 2025

Question about mamba chunk_scan kernel performance #707

Closed

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Open

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel #1104

[Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel #1104

chengyupku commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

coderabbitai bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Refactor] Use forceinline in ldmatrix and update mamba scan kernel #1104

[Refactor] Use forceinline in ldmatrix and update mamba scan kernel #1104

Conversation

chengyupku commented Oct 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

coderabbitai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel #1104

[Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel #1104

chengyupku commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 22, 2025 •

edited

Loading