- 
                Notifications
    You must be signed in to change notification settings 
- Fork 292
          [Refactor] Use forceinline in ldmatrix and update mamba scan kernel
          #1104
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the TileLang project. Please remember to run  We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 | 
| WalkthroughThese changes involve compiler optimization adjustments and function signature reformatting. The Mamba chunk scan example enables fast-math optimization and adjusts register usage, while six CUDA matrix-load functions have NOINLINE qualifiers removed to permit compiler inlining. Changes
 Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes The Python file contains straightforward configuration and formatting changes with a new register-tuning call. The CUDA header changes are highly repetitive—the same qualifier removal applied identically across six similar functions. While the changes span different files and domains, their homogeneous nature and narrow scope minimize review complexity. Poem
 Pre-merge checks and finishing touches❌ Failed checks (1 warning)
 ✅ Passed checks (2 passed)
 ✨ Finishing touches
 🧪 Generate unit tests (beta)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
examples/linear_attention/example_mamba_chunk_scan.py (2)
98-108: Prefer targeted type ignores or a typing shim over blanket# type: ignore.Replace broad ignores with per-tool codes (e.g.,
# type: ignore[arg-type]for mypy or# pyright: ignore[reportArgumentType]), or introduce lightweight aliases/stubs forT.Tensorto satisfy the checker without suppressing everything.
145-146: Disabling max register cap: measure reg count, spills, and occupancy.This can improve ILP but also reduce occupancy; please capture ptxas
registers/spillsand Nsight occupancy before/after. Consider making this tunable (e.g., autotune knob) if results are workload-dependent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
- examples/linear_attention/example_mamba_chunk_scan.py(3 hunks)
- src/tl_templates/cuda/ldsm.h(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/linear_attention/example_mamba_chunk_scan.py (3)
tilelang/jit/__init__.py (1)
jit(237-310)tilelang/transform/pass_config.py (1)
PassConfigKey(6-144)tilelang/language/builtin.py (1)
no_set_max_nreg(174-177)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Quick Lint
🔇 Additional comments (2)
examples/linear_attention/example_mamba_chunk_scan.py (1)
74-79: Fast-math enablement looks good; validate numerical drift.Good perf-oriented change; please re-run the reference checks and capture max |diff|/relative error across several shapes to ensure rtol=1e-2, atol=1e-2 still holds. If borderline, gate via a flag/env for easy rollback.
src/tl_templates/cuda/ldsm.h (1)
7-63: No issues found—TL_DEVICEcorrectly includes__forceinline__.The verification confirms that
TL_DEVICEin bothsrc/tl_templates/cuda/common.handsrc/tl_templates/hip/common.his properly defined as__forceinline__ __device__. The ldmatrix loaders in ldsm.h will be inlined as intended, and the code requires no changes.
This pull request introduces improvements to both the Python example for linear attention and the CUDA device code for loading matrix data. The main changes focus on enabling fast math optimizations, improving type safety in the Python kernel definition, and updating CUDA device function annotations for better inlining and performance.
Python Example Improvements:
chunk_scan_fwdfunction by passingTL_ENABLE_FAST_MATH: Trueto thetilelang.jitdecorator.# type: ignorecomments to tensor arguments in themainkernel function for better type safety and compatibility.T.no_set_max_nreg()in the kernel to potentially improve register usage.CUDA Device Code Updates:
ptx_ldmatrix_*functions inldsm.hfromTL_DEVICE_NOINLINEtoTL_DEVICE, allowing the compiler to inline these functions and potentially improve performance. [1] [2] [3] [4] [5] [6]Summary by CodeRabbit