Skip to content

Conversation

@chengyupku
Copy link
Contributor

@chengyupku chengyupku commented Oct 22, 2025

This pull request introduces improvements to both the Python example for linear attention and the CUDA device code for loading matrix data. The main changes focus on enabling fast math optimizations, improving type safety in the Python kernel definition, and updating CUDA device function annotations for better inlining and performance.

Python Example Improvements:

  • Enabled fast math optimizations in the chunk_scan_fwd function by passing TL_ENABLE_FAST_MATH: True to the tilelang.jit decorator.
  • Added # type: ignore comments to tensor arguments in the main kernel function for better type safety and compatibility.
  • Inserted a call to T.no_set_max_nreg() in the kernel to potentially improve register usage.

CUDA Device Code Updates:

  • Changed the annotation for all ptx_ldmatrix_* functions in ldsm.h from TL_DEVICE_NOINLINE to TL_DEVICE, allowing the compiler to inline these functions and potentially improve performance. [1] [2] [3] [4] [5] [6]

Summary by CodeRabbit

  • Optimization & Performance
    • Enabled aggressive compiler optimizations to improve kernel execution speed and computational efficiency
    • Optimized register allocation strategies and memory access patterns for more efficient resource utilization
    • Modified compiler directives to enable enhanced inline optimization opportunities, potentially improving overall performance for computation-heavy workloads

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Walkthrough

These changes involve compiler optimization adjustments and function signature reformatting. The Mamba chunk scan example enables fast-math optimization and adjusts register usage, while six CUDA matrix-load functions have NOINLINE qualifiers removed to permit compiler inlining.

Changes

Cohort / File(s) Summary
Mamba chunk scan JIT optimization
examples/linear_attention/example_mamba_chunk_scan.py
Added pass_configs with TL_ENABLE_FAST_MATH: True to JIT decorator; expanded main function signature to multi-line format with explicit parameter annotations and per-parameter # type: ignore comments; inserted T.no_set_max_nreg() call for register usage adjustment.
CUDA matrix-load function inlining
src/tl_templates/cuda/ldsm.h
Removed _NOINLINE qualifier from six ptx_ldmatrix_* functions (x1, x2, x4, x1_trans, x2_trans, x4_trans), changing signatures from TL_DEVICE_NOINLINE to TL_DEVICE to enable compiler inlining. Function bodies remain unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The Python file contains straightforward configuration and formatting changes with a new register-tuning call. The CUDA header changes are highly repetitive—the same qualifier removal applied identically across six similar functions. While the changes span different files and domains, their homogeneous nature and narrow scope minimize review complexity.

Poem

🐰 A sprig of optimization grows,
Fast math blooms where the kernel goes,
No inlines fade, let compilers decide,
Registers rest with graceful pride.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "[Refactor] Use forceinline in ldmatrix and update mamba scan kernel" refers to real aspects of the changeset: it identifies the two primary components being modified (ldmatrix functions and mamba scan kernel), and it specifically describes the nature of the ldmatrix change (enabling forceinline by removing NOINLINE qualifiers). The mamba kernel portion is described more generically as "update," but this is still meaningful enough to indicate changes to that component. The title avoids vague terms like "misc updates" and provides enough specificity that a teammate scanning history would understand that optimization-related changes were made to these two areas. While a more detailed title mentioning "enable fast-math" for the mamba kernel would be more comprehensive, the current title still clearly communicates the scope and nature of the changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
examples/linear_attention/example_mamba_chunk_scan.py (2)

98-108: Prefer targeted type ignores or a typing shim over blanket # type: ignore.

Replace broad ignores with per-tool codes (e.g., # type: ignore[arg-type] for mypy or # pyright: ignore[reportArgumentType]), or introduce lightweight aliases/stubs for T.Tensor to satisfy the checker without suppressing everything.


145-146: Disabling max register cap: measure reg count, spills, and occupancy.

This can improve ILP but also reduce occupancy; please capture ptxas registers/spills and Nsight occupancy before/after. Consider making this tunable (e.g., autotune knob) if results are workload-dependent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5683e6a and 76e31cb.

📒 Files selected for processing (2)
  • examples/linear_attention/example_mamba_chunk_scan.py (3 hunks)
  • src/tl_templates/cuda/ldsm.h (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/linear_attention/example_mamba_chunk_scan.py (3)
tilelang/jit/__init__.py (1)
  • jit (237-310)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-144)
tilelang/language/builtin.py (1)
  • no_set_max_nreg (174-177)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Quick Lint
🔇 Additional comments (2)
examples/linear_attention/example_mamba_chunk_scan.py (1)

74-79: Fast-math enablement looks good; validate numerical drift.

Good perf-oriented change; please re-run the reference checks and capture max |diff|/relative error across several shapes to ensure rtol=1e-2, atol=1e-2 still holds. If borderline, gate via a flag/env for easy rollback.

src/tl_templates/cuda/ldsm.h (1)

7-63: No issues found—TL_DEVICE correctly includes __forceinline__.

The verification confirms that TL_DEVICE in both src/tl_templates/cuda/common.h and src/tl_templates/hip/common.h is properly defined as __forceinline__ __device__. The ldmatrix loaders in ldsm.h will be inlined as intended, and the code requires no changes.

@chengyupku chengyupku merged commit 8a5eb56 into tile-ai:main Oct 22, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant