-
Couldn't load subscription status.
- Fork 286
[Language] Expose T.get_warp_idx_sync and T.shuffle_elect for efficient thread election
#989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds four new TileLang (TL) intrinsics for lane/warp queries (get_lane_idx, get_warp_idx_sync, get_warp_idx, get_warp_group_idx) across C++ builtin registration, header declarations, CUDA codegen emission, CUDA device intrinsics, Python wrappers, tests, and a small index-handling tweak in a JIT adapter. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Py as Python API (tilelang.language.builtin)
participant TL as TL Op Registry
participant IR as Lowering / CallNode
participant CG as CUDA Codegen (codegen_cuda.cc)
participant CU as CUDA Device Helpers (tl_templates/cuda/intrin.h)
Py->>TL: call get_lane_idx()/get_warp_idx(_sync)/get_warp_group_idx()
TL->>IR: produce Call to tl.* builtin (0 args, kPure)
IR->>CG: lower/visit CallNode for tl.* intrinsic
CG->>CU: emit tl::get_* device call (with optional params)
CU-->>Py: runtime returns lane/warp/group index value
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tilelang/language/builtin.py (1)
350-371: Document the compile-time constant requirement.The
thread_extentparameter is used as a C++ template argument in the codegen (tl::tl_shuffle_elect<thread_extent>()), which requires a compile-time constant. However, the current documentation doesn't explicitly state this constraint.Consider adding a note to clarify this requirement:
def shuffle_elect(thread_extent: int) -> PrimExpr: """Elect exactly one lane within a logical thread group. Parameters ---------- thread_extent : int Size (in threads) of the group in which a single lane should be elected. Passing 0 elects a single lane in the entire thread block. + Must be a compile-time constant (integer literal).Alternatively, you could add runtime validation:
def shuffle_elect(thread_extent: int) -> PrimExpr: """...""" if not isinstance(thread_extent, int): raise TypeError(f"thread_extent must be an integer literal, got {type(thread_extent)}") return tir.call_intrin("bool", tir.op.Op.get("tl.tl_shuffle_elect"), thread_extent)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/op/builtin.cc(1 hunks)src/op/builtin.h(1 hunks)src/target/codegen_cuda.cc(1 hunks)src/tl_templates/cuda/intrin.h(2 hunks)tilelang/language/builtin.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
tilelang/language/builtin.py (1)
tilelang/language/tir/op.py (1)
call_intrin(119-144)
src/op/builtin.cc (1)
tilelang/language/builtin.py (4)
get_lane_idx(283-297)get_warp_idx_sync(300-314)get_warp_idx(317-330)get_warp_group_idx(333-347)
src/op/builtin.h (1)
tilelang/language/builtin.py (4)
get_lane_idx(283-297)get_warp_idx_sync(300-314)get_warp_idx(317-330)get_warp_group_idx(333-347)
src/tl_templates/cuda/intrin.h (2)
src/tl_templates/cuda/gemm_sm90.h (1)
tl(231-385)tilelang/language/builtin.py (4)
get_lane_idx(283-297)get_warp_idx_sync(300-314)get_warp_idx(317-330)get_warp_group_idx(333-347)
src/target/codegen_cuda.cc (1)
tilelang/language/builtin.py (4)
get_lane_idx(283-297)get_warp_idx_sync(300-314)get_warp_idx(317-330)get_warp_group_idx(333-347)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format-check
🔇 Additional comments (5)
src/tl_templates/cuda/intrin.h (1)
13-21: LGTM! Well-structured additions.The four new index helpers are correctly placed outside the SM90+ conditional block, making them available across all CUDA architectures. The delegation to CUTLASS canonical functions is appropriate, and the namespace structure is preserved correctly.
src/op/builtin.h (1)
361-391: LGTM!The four new intrinsic declarations follow the existing pattern and are well-documented. The positioning between
warpgroup_wait()andwait_wgmma()is logical and maintains consistency with the file structure.src/target/codegen_cuda.cc (1)
1971-1978: LGTM!The CUDA codegen correctly emits the
tl::get_*()calls for the new intrinsics. The implementation is straightforward and consistent with other zero-argument intrinsic patterns in the codebase.src/op/builtin.cc (1)
221-239: LGTM! Marking intrinsics as pure is appropriate.All four intrinsics are correctly registered with zero inputs and
kPureeffect kind. Whileget_warp_idx_syncuses__shfl_syncunder the hood (which has implicit warp synchronization), marking it askPureis reasonable because:
- The function is deterministic (same inputs → same outputs)
- It doesn't modify observable state or memory
- The warp-level synchronization is implicit in CUDA's execution model
- This aligns with how similar warp-level operations are typically classified
tilelang/language/builtin.py (1)
283-347: LGTM! Well-documented index query functions.The four new index query functions (
get_lane_idx,get_warp_idx_sync,get_warp_idx,get_warp_group_idx) are well-implemented with comprehensive docstrings, clear examples, and detailed implementation notes explaining the underlying CUDA/CUTLASS mechanisms.
|
Some CUDA kernels treat, e.g. 5/6 warps, as a warp group. So I suggest the index calculation can be more general. For example, give warp size (AMD 64 threads per warp) and warp group size with an arch-specific default value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tilelang/jit/adapter/base.py (1)
33-39: Fix off-by-one bounds checks and error messages in_legalize_result_idx
- In the int branch (around line 24), change the upper bound check to
>= len(params)and update the message:- if result_idx > len(params) or result_idx < -len(params): + if result_idx >= len(params) or result_idx < -len(params): ... - f"result_idx should be an integer between {-len(params) - 1} and {len(params) - 1}" + f"result_idx should be an integer between {-len(params)} and {len(params) - 1}"- In the list branch (around line 33), adjust only the message:
- f"result_idx should be an integer between {-len(params) - 1} and {len(params) - 1}" + f"result_idx should be an integer between {-len(params)} and {len(params) - 1}"
🧹 Nitpick comments (3)
src/tl_templates/cuda/intrin.h (1)
51-60: Consider parameterizing warp size in tl_shuffle_elect for consistencyElsewhere you allow non-32 warp sizes (HIP/AMD 64). tl_shuffle_elect still divides by 32. Consider using
detail::default_warp_size()(or a template param) instead of 32 to keep behavior consistent across vendors.tilelang/language/builtin.py (1)
407-429: Enforcethread_extentbe a compile-time constant
Add a check inshuffle_elect(tilelang/language/builtin.py) to requirethread_extentis atir.IntImm(e.g.assert isinstance(thread_extent, tir.IntImm)), so non-literal arguments error out before C++ codegen.testing/python/language/test_tilelang_language_get_warp_info.py (1)
160-208: LGTM: Comprehensive test coverage.The test suite covers both default configurations and custom parameter overrides for all five intrinsics. The
@tilelang.testing.requires_cudadecorator correctly ensures tests only run on CUDA-capable hardware.Consider adding brief docstrings to document what each test validates, especially for tests like
test_shuffle_elect_block_leaderwhere the behavior being tested (electing block leader withthread_extent=0) may not be immediately obvious.Example:
@tilelang.testing.requires_cuda def test_shuffle_elect_block_leader(): """Test shuffle_elect with thread_extent=0 elects a single block-wide leader.""" run_shuffle_elect(num_threads=128, thread_extent=0)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/op/builtin.cc(1 hunks)src/op/builtin.h(1 hunks)src/target/codegen_cuda.cc(1 hunks)src/tl_templates/cuda/intrin.h(2 hunks)testing/python/language/test_tilelang_language_get_warp_info.py(1 hunks)tilelang/jit/adapter/base.py(1 hunks)tilelang/language/builtin.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/op/builtin.cc
🧰 Additional context used
🧬 Code graph analysis (6)
tilelang/jit/adapter/base.py (1)
tilelang/jit/kernel.py (1)
params(466-467)
src/op/builtin.h (1)
tilelang/language/builtin.py (4)
get_lane_idx(297-319)get_warp_idx_sync(322-343)get_warp_idx(346-367)get_warp_group_idx(370-404)
src/tl_templates/cuda/intrin.h (1)
tilelang/language/builtin.py (4)
get_lane_idx(297-319)get_warp_idx_sync(322-343)get_warp_idx(346-367)get_warp_group_idx(370-404)
testing/python/language/test_tilelang_language_get_warp_info.py (3)
tilelang/utils/target.py (1)
check_hip_availability(33-43)tilelang/language/builtin.py (5)
get_lane_idx(297-319)get_warp_idx_sync(322-343)get_warp_idx(346-367)get_warp_group_idx(370-404)shuffle_elect(407-428)tilelang/jit/adapter/base.py (1)
get_kernel_source(51-52)
src/target/codegen_cuda.cc (1)
tilelang/language/builtin.py (4)
get_lane_idx(297-319)get_warp_idx_sync(322-343)get_warp_idx(346-367)get_warp_group_idx(370-404)
tilelang/language/builtin.py (2)
src/op/builtin.h (1)
tvm(13-506)tilelang/language/tir/op.py (1)
call_intrin(119-144)
🪛 Ruff (0.13.3)
testing/python/language/test_tilelang_language_get_warp_info.py
135-135: Avoid specifying long messages outside the exception class
(TRY003)
143-143: Avoid specifying long messages outside the exception class
(TRY003)
tilelang/language/builtin.py
25-25: Avoid specifying long messages outside the exception class
(TRY003)
401-402: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (13)
src/tl_templates/cuda/intrin.h (1)
13-60: Defaults and helpers look goodArch-aware defaults, argument sanitization, and linear index math are fine.
src/op/builtin.h (1)
361-392: LGTM: Op declarations match codegen and Python APINew builtin op declarations align with emitters and wrappers.
tilelang/language/builtin.py (5)
14-26: LGTM: argument normalization helperAccepts int/PrimExpr, returns IntImm or None; straightforward.
297-320: get_lane_idx wrapper is correctLowers with optional warp_size; docs match behavior.
322-344: get_warp_idx_sync wrapper is correctMatches intended lowering; optional warp_size supported.
346-368: get_warp_idx wrapper is correctConsistent with codegen/device side.
370-405: get_warp_group_idx wrapper enforces valid args; goodArgument gating is sensible; lowering matches codegen expectation.
testing/python/language/test_tilelang_language_get_warp_info.py (6)
1-10: LGTM: Clean imports and architecture-aware setup.The imports are well-organized, and the HIP availability check enables cross-platform testing. The default warps per group value (4) aligns with NVIDIA's typical warp-group configuration.
12-22: LGTM: Architecture-aware parameter resolution.The helper functions correctly provide architecture-specific defaults (32 for NVIDIA, 64 for AMD), allowing tests to run across different GPU vendors while supporting explicit overrides.
24-86: LGTM: Well-structured kernel builders.The JIT-decorated kernel builders follow a consistent pattern and correctly exercise each intrinsic (
T.get_lane_idx,T.get_warp_idx_sync,T.get_warp_idx,T.get_warp_group_idx,T.shuffle_elect). The parameter forwarding is correct.
89-119: LGTM: Reference calculations are correct.The reference tensors for lane and warp index tests correctly implement the expected semantics:
- Lane index:
thread_id % warp_size- Warp index (sync and non-sync):
thread_id // warp_sizeThe use of
torch.testing.assert_closefor validation is appropriate.
122-138: LGTM: Warp group index calculation is correct.The reference calculation
thread_id // (warp_size * warps_per_group)correctly implements the expected warp group indexing semantics. The defensive check forthreads_per_group <= 0is reasonable, though unlikely to trigger in normal usage.
210-211: LGTM: Standard test runner setup.The main block correctly invokes the test framework, and the commented debug line is useful for manual testing during development.
| } else if (op->op.same_as(tl::get_lane_idx())) { | ||
| ICHECK_LE(op->args.size(), 1) | ||
| << "tl.get_lane_idx expects at most one argument <warp_size>."; | ||
| os << "tl::get_lane_idx("; | ||
| if (!op->args.empty()) { | ||
| os << PrintExpr(op->args[0]); | ||
| } | ||
| os << ")"; | ||
| } else if (op->op.same_as(tl::get_warp_idx_sync())) { | ||
| ICHECK_LE(op->args.size(), 1) | ||
| << "tl.get_warp_idx_sync expects at most one argument <warp_size>."; | ||
| os << "tl::get_warp_idx_sync("; | ||
| if (!op->args.empty()) { | ||
| os << PrintExpr(op->args[0]); | ||
| } | ||
| os << ")"; | ||
| } else if (op->op.same_as(tl::get_warp_idx())) { | ||
| ICHECK_LE(op->args.size(), 1) | ||
| << "tl.get_warp_idx expects at most one argument <warp_size>."; | ||
| os << "tl::get_warp_idx("; | ||
| if (!op->args.empty()) { | ||
| os << PrintExpr(op->args[0]); | ||
| } | ||
| os << ")"; | ||
| } else if (op->op.same_as(tl::get_warp_group_idx())) { | ||
| ICHECK_LE(op->args.size(), 2) | ||
| << "tl.get_warp_group_idx expects <warp_size, warps_per_group>."; | ||
| os << "tl::get_warp_group_idx("; | ||
| for (size_t i = 0; i < op->args.size(); ++i) { | ||
| if (i != 0) { | ||
| os << ", "; | ||
| } | ||
| os << PrintExpr(op->args[i]); | ||
| } | ||
| os << ")"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing include for tl intrinsics to avoid undefined references
Emission for tl::get_lane_idx / tl::get_warp_idx(_sync) / tl::get_warp_group_idx looks correct and arity checks are fine. However, there’s no include for the tl intrinsics header; this can fail to compile when these symbols aren’t already brought in indirectly.
Include the header alongside other tl headers in Finish():
decl_stream << "#include <tl_templates/cuda/gemm.h>\n";
if (enable_sparse_gemm_) {
decl_stream << "#include <tl_templates/cuda/gemm_sp.h>\n";
}
decl_stream << "#include <tl_templates/cuda/copy.h>\n";
decl_stream << "#include <tl_templates/cuda/reduce.h>\n";
decl_stream << "#include <tl_templates/cuda/ldsm.h>\n";
decl_stream << "#include <tl_templates/cuda/threadblock_swizzle.h>\n";
decl_stream << "#include <tl_templates/cuda/debug.h>\n";
+ decl_stream << "#include <tl_templates/cuda/intrin.h>\n";
decl_stream << "#ifdef ENABLE_BF16\n";📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if (op->op.same_as(tl::get_lane_idx())) { | |
| ICHECK_LE(op->args.size(), 1) | |
| << "tl.get_lane_idx expects at most one argument <warp_size>."; | |
| os << "tl::get_lane_idx("; | |
| if (!op->args.empty()) { | |
| os << PrintExpr(op->args[0]); | |
| } | |
| os << ")"; | |
| } else if (op->op.same_as(tl::get_warp_idx_sync())) { | |
| ICHECK_LE(op->args.size(), 1) | |
| << "tl.get_warp_idx_sync expects at most one argument <warp_size>."; | |
| os << "tl::get_warp_idx_sync("; | |
| if (!op->args.empty()) { | |
| os << PrintExpr(op->args[0]); | |
| } | |
| os << ")"; | |
| } else if (op->op.same_as(tl::get_warp_idx())) { | |
| ICHECK_LE(op->args.size(), 1) | |
| << "tl.get_warp_idx expects at most one argument <warp_size>."; | |
| os << "tl::get_warp_idx("; | |
| if (!op->args.empty()) { | |
| os << PrintExpr(op->args[0]); | |
| } | |
| os << ")"; | |
| } else if (op->op.same_as(tl::get_warp_group_idx())) { | |
| ICHECK_LE(op->args.size(), 2) | |
| << "tl.get_warp_group_idx expects <warp_size, warps_per_group>."; | |
| os << "tl::get_warp_group_idx("; | |
| for (size_t i = 0; i < op->args.size(); ++i) { | |
| if (i != 0) { | |
| os << ", "; | |
| } | |
| os << PrintExpr(op->args[i]); | |
| } | |
| os << ")"; | |
| // In Finish(), alongside the other TL headers: | |
| decl_stream << "#include <tl_templates/cuda/gemm.h>\n"; | |
| if (enable_sparse_gemm_) { | |
| decl_stream << "#include <tl_templates/cuda/gemm_sp.h>\n"; | |
| } | |
| decl_stream << "#include <tl_templates/cuda/copy.h>\n"; | |
| decl_stream << "#include <tl_templates/cuda/reduce.h>\n"; | |
| decl_stream << "#include <tl_templates/cuda/ldsm.h>\n"; | |
| decl_stream << "#include <tl_templates/cuda/threadblock_swizzle.h>\n"; | |
| decl_stream << "#include <tl_templates/cuda/debug.h>\n"; | |
| decl_stream << "#include <tl_templates/cuda/intrin.h>\n"; | |
| decl_stream << "#ifdef ENABLE_BF16\n"; |
🤖 Prompt for AI Agents
In src/target/codegen_cuda.cc around lines 1971 to 2005, the code emits calls to
tl::get_lane_idx, tl::get_warp_idx(_sync) and tl::get_warp_group_idx but the
translation layer intrinsics header isn’t being included in Finish(), which can
cause undefined reference/compile errors; add an #include for the tl intrinsics
header (the header that declares
tl::get_lane_idx/get_warp_idx/get_warp_group_idx) alongside the other tl headers
in the Finish() function so the symbols are declared when these calls are
emitted.
| def run_shuffle_elect(num_threads: int = 128, thread_extent: int = 64): | ||
| if thread_extent < 0: | ||
| raise ValueError("thread_extent must be non-negative.") | ||
| kernel = _shuffle_elect_kernel(num_threads, thread_extent) | ||
| A = kernel() | ||
| print(kernel.get_kernel_source()) | ||
| print(A) | ||
| indices = torch.arange(num_threads, device=A.device, dtype=torch.int64) | ||
| if thread_extent == 0: | ||
| mask = indices == 0 | ||
| elif thread_extent > 0: | ||
| mask = (indices % thread_extent) == 0 | ||
| else: | ||
| mask = torch.zeros_like(indices, dtype=torch.bool) | ||
| ref = mask.to(dtype=A.dtype, device=A.device) | ||
| torch.testing.assert_close(A.cpu(), ref.cpu()) | ||
| return A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unreachable dead code.
Lines 153-154 are unreachable because:
- Line 142 raises an exception if
thread_extent < 0 - Line 149 handles
thread_extent == 0 - Line 151 handles
thread_extent > 0
The else branch can never execute.
Apply this diff to remove the dead code:
if thread_extent == 0:
mask = indices == 0
elif thread_extent > 0:
mask = (indices % thread_extent) == 0
- else:
- mask = torch.zeros_like(indices, dtype=torch.bool)
ref = mask.to(dtype=A.dtype, device=A.device)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def run_shuffle_elect(num_threads: int = 128, thread_extent: int = 64): | |
| if thread_extent < 0: | |
| raise ValueError("thread_extent must be non-negative.") | |
| kernel = _shuffle_elect_kernel(num_threads, thread_extent) | |
| A = kernel() | |
| print(kernel.get_kernel_source()) | |
| print(A) | |
| indices = torch.arange(num_threads, device=A.device, dtype=torch.int64) | |
| if thread_extent == 0: | |
| mask = indices == 0 | |
| elif thread_extent > 0: | |
| mask = (indices % thread_extent) == 0 | |
| else: | |
| mask = torch.zeros_like(indices, dtype=torch.bool) | |
| ref = mask.to(dtype=A.dtype, device=A.device) | |
| torch.testing.assert_close(A.cpu(), ref.cpu()) | |
| return A | |
| def run_shuffle_elect(num_threads: int = 128, thread_extent: int = 64): | |
| if thread_extent < 0: | |
| raise ValueError("thread_extent must be non-negative.") | |
| kernel = _shuffle_elect_kernel(num_threads, thread_extent) | |
| A = kernel() | |
| print(kernel.get_kernel_source()) | |
| print(A) | |
| indices = torch.arange(num_threads, device=A.device, dtype=torch.int64) | |
| if thread_extent == 0: | |
| mask = indices == 0 | |
| elif thread_extent > 0: | |
| mask = (indices % thread_extent) == 0 | |
| ref = mask.to(dtype=A.dtype, device=A.device) | |
| torch.testing.assert_close(A.cpu(), ref.cpu()) | |
| return A |
🧰 Tools
🪛 Ruff (0.13.3)
143-143: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_get_warp_info.py around lines
141 to 157, remove the unreachable final else branch (lines 153-154) that sets
mask to torch.zeros_like(...) because thread_extent < 0 is already prevented by
the early ValueError and the other branches handle ==0 and >0; simply delete
that else block so mask is only set in the existing thread_extent == 0 and
thread_extent > 0 branches (or convert the second branch to an else if you
prefer), leaving mask always defined before building ref.
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
tilelang/src/tl_templates/cuda/intrin.h
Lines 100 to 103 in 65b77aa
| return __shfl_sync(0xffffffff, // full warp mask | |
| (threadIdx.x / 32) % | |
| (thread_extent / 32), // warp index within group | |
| 0 // take the value from lane 0 |
tl_shuffle_elect
The general branch of tl_shuffle_elect divides by (thread_extent / 32) without validating that thread_extent is at least one warp and divisible by 32. When a caller passes any thread_extent in the range 1–31, the expression instantiates the template with a zero denominator, causing a compile-time error. For thread extents such as 48 or 96, integer division truncates to 1 or 3 warps, so the code elects one lane per physical warp instead of one per 48- or 96-thread group, yielding two or three leaders per logical group. The Python API exposes shuffle_elect without documenting these constraints, so valid-looking inputs produce either build failures or incorrect leaders.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
commit b2acfc3 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Oct 19 22:08:13 2025 +0800 [Benchmark] Add matmul FP16 benchmark results (tile-ai#1067) commit 17bd0a6 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Sun Oct 19 17:34:12 2025 +0800 [Enhancement] Deprecate split&sum in attn bwd examples on Hopper and migrate to vectorized atomic add (tile-ai#1065) commit ae9a6f0 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Sun Oct 19 15:45:58 2025 +0800 [Refactor][Example] Update linear attention examples and add tests (tile-ai#1010) * [Refactor][Example] Update linear attention examples and add tests - Refactored the backward and forward linear attention kernels to use shared memory and atomic additions for improved performance. - Introduced L2 normalization in the main functions of both examples. - Added a new test suite for the linear attention examples to ensure correctness and performance. - Updated argument parsing in the main functions for better usability. * upd docstring for tma atomic add * lint * Add flash-linear-attention dependency to requirements.txt * Rename main function to chunk_linear_attn_bwd * Rename main function to chunk_linear_attn_fwd * chore --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> commit b7dfdb3 Author: Xuehai Pan <XuehaiPan@pku.edu.cn> Date: Sun Oct 19 12:16:41 2025 +0800 [Misc] Add GitHub issue templates (tile-ai#1057) commit fb8b3af Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Oct 19 12:15:44 2025 +0800 [Benchmark] Add H800 SXM Benchmark results (tile-ai#1063) * Add document PYTHONPATH build path * update fp8 benchmark result * remove redpath * remove path * tflops fix commit 4ca6c13 Author: Yuqi Dong <134183314+yyttt6@users.noreply.github.com> Date: Sun Oct 19 02:43:00 2025 +0800 [CI]:Reduce test shapes to avoid OOM errors during CI. (tile-ai#1060) * [CI]:Reduce test shapes to avoid OOM errors during CI. * rabbit * Increase number of processes for pytest from 2 to 4 --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> commit 759c2e3 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Oct 19 00:35:06 2025 +0800 [DOC] Add document for develop with PYTHONPATH (tile-ai#1062) commit bf2de5b Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Oct 19 00:21:59 2025 +0800 Making version parser more robust against missing or unavailable metadata (tile-ai#1061) commit 7211164 Author: Chaofan Lin <linchaofan@bytedance.com> Date: Fri Oct 17 20:56:01 2025 +0800 [Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (tile-ai#1050) * [Refactor] Refactor Pass to support recursive load/store rewrite * lint * recursive collect conds for call_extern * fix name * [Lint]: [pre-commit.ci] auto fixes [...] * lint * [Lint]: [pre-commit.ci] auto fixes [...] * lint * [Lint]: [pre-commit.ci] auto fixes [...] * address comment * rename pad_value to safe_value * lint * add oob store test * [Lint]: [pre-commit.ci] auto fixes [...] * fix * fix --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 278c0fb Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Oct 17 18:32:43 2025 +0800 [Enhancement] Introduce a workaround for layout inference for local buffer store (tile-ai#1055) * [Enhancement] Improve layout inference for local buffer handling in parallel operations * Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions. * Updated the condition for determining parallel loop execution to account for local buffer stores. * Cleaned up comments for clarity and future considerations. * [Refactor] Clean up parallel loop condition formatting in layout inference * Reformatted the condition for determining parallel loop execution for better readability. * Maintained existing logic while enhancing code clarity for future modifications. --------- Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk> commit 37b3dbd Author: LJC00118 <77378439+LJC00118@users.noreply.github.com> Date: Fri Oct 17 17:15:59 2025 +0800 [Enhancement] Improve CUDA compiler detection in CMake (tile-ai#1054) * improve CUDA compiler detection in CMake * Minor fix commit 1281d6f Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Oct 17 13:44:08 2025 +0800 [CI] Disable autofix for pre-commit CI (tile-ai#1053) commit 35cf888 Author: LJC00118 <77378439+LJC00118@users.noreply.github.com> Date: Fri Oct 17 13:43:08 2025 +0800 [Enhancement] Remove constraint requiring last dimension stride to be 1 (tile-ai#1040) * remove last dimension stride must be 1 constraint * add vectorize test * minor fix * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit fd1493b Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Oct 17 11:34:35 2025 +0800 Automatically initialize submodule if missing (tile-ai#1052) commit cc00fb6 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Fri Oct 17 11:28:14 2025 +0800 [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper (tile-ai#1024) * [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper * [BugFix] Fix shape mismatch and deprecate `T.if()` in fused_moe example * [Fix] Add `is_symbolic_expr` function to check for symbolic expressions in TIR - Introduced a new utility function `is_symbolic_expr` to determine if an expression is a symbolic expression, enhancing type checking capabilities. - Updated shape handling in `CythonKernelAdapter` to utilize the new function, improving handling for symbolic shapes. commit a79bc5c Author: Xuehai Pan <XuehaiPan@pku.edu.cn> Date: Thu Oct 16 20:38:23 2025 +0800 [CI] Fix ROCm CI (tile-ai#1043) * [CI] fix ROCm CI * feat: add a hook to error out on no test runs commit 1f4ffdb Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Thu Oct 16 17:53:45 2025 +0800 [Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. (tile-ai#1051) commit e3742d3 Author: Yichen Yan <wenji.yyc@alibaba-inc.com> Date: Thu Oct 16 15:52:10 2025 +0800 Allow mma gemm for all cuda (tile-ai#1047) commit 0ff4f42 Author: Yuqi Dong <134183314+yyttt6@users.noreply.github.com> Date: Thu Oct 16 12:41:09 2025 +0800 [Feature]: Add test for atomicadd auto vectorize and remove useless code (tile-ai#1019) * update * format * rabbit commit bd1c7b3 Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Thu Oct 16 02:52:35 2025 +0800 [Refactor] Use `has_simt_copy` to decide whether to insert `set_max_nreg` (tile-ai#982) commit 8f001e0 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Thu Oct 16 01:10:28 2025 +0800 [BugFix] Phaseout dependency of Triton in sink examples to make CI happy (tile-ai#1045) * [BugFix] Phaseout dependency of Triton in sink examples to make CI happy - Added `benchmark_gqa_sink_fwd.py` and `benchmark_mha_sink_fwd.py` to evaluate performance of GQA and MHA attention mechanisms using Triton. - Refactored existing attention sink implementations to remove Triton kernel definitions from the reference programs, streamlining the code. - Updated input generation and benchmarking logic to enhance configurability and performance measurement. - Improved overall structure and organization of the examples for better clarity and usability. * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 8ce2778 Author: Xuehai Pan <XuehaiPan@pku.edu.cn> Date: Wed Oct 15 22:12:41 2025 +0800 [CI][Refactor] Merge test CI workflow files into one (tile-ai#973) * refactor: merge test CI workflow files into one * chore: set `UV_INDEX_STRATEGY=unsafe-best-match` * feat: add AST test with Python 3.8 * feat: implement manual caching mechanism for self-hosted runners * refactor: simplify cache logic for self-hosted runners * chore: clear uv cache on failure * chore: print format.sh output to logs * chore: improve uv caching * chore: disable parallel test * chore: use `PYTHONDEVMODE=1` in CI * feat: enable coredump generation * fix: fix perfbench condition * Revert "feat: enable coredump generation" This reverts commit c52da65. * chore: move example CI down * Revert "chore: move example CI down" This reverts commit 9d8e650. * chore: skip example `test_example_mha_sink_bwd_bhsd` * chore: skip example `test_example_gqa_sink_bwd_bhsd` * fix: fix example argument passing * fix: loosen test criteria * chore: rename `CMAKE_CONFIGURE_OPTIONS` -> `CLANG_TIDY_CMAKE_OPTIONS` for clarity * feat: enable parallel testings * chore: update pytest options * remove skipped test as now been resolved * chore: empty commit to re-trigger ci * test for n 1 * chore: remove ` --numprocesses=1` option in example * chore: disable failfast * chore: update cibw selection * fix: fix git submodule clone * chore: update cibw commands * fix: fix yapf multiprocessing * chore: setup ccache for CIBW on macOS only * chore: update comments * chore: update artifact listing * fix: do not fail if not found nvcc in PATH * fix: fix flash-attn installation * chore: update dist workflow trigger * chore: remove outdated comments * chore(workflows/dist): simplify build matrix strategy * fix: fix CUDA path finding * fix: fix CUDA path finding * chore: imcrease CI timeout * ci: disable failfast * fix: hide path prefix * chore: more verbose * chore: disable PR trigger for dist workflow * fix: seed for tests * fix: use nightly torch for ROCm tests * chore: enable PR trigger for dist workflow * chore: stop uploading debug wheels as artifacts in PR * chore: do not run workflows in forks * chore: housekeep requirements * chore: use Nightly-ROCm-6.3 for CI * chore: use Nightly-ROCm-6.4 for CI * Update ROCm toolkit version to 7.0 * chore: restore previous rocm-ci.yml for test * fix: cleanup PYTHONPATH * chore: remove previous rocm-ci.yml * ci fix * chore: remove previous rocm-ci.yml * chore: enable parallel example run --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: alex_xiao <xinyuxiao2024@gmail.com> commit 80665cd Author: alex_xiao <xinyuxiao2024@gmail.com> Date: Wed Oct 15 21:17:14 2025 +0800 fix bug&add amd examples (tile-ai#966) * [Enhancement] Refactor buffer index handling for improved precision and clarity (tile-ai#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. * Add new AMD FlashAttention example and test script - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability. * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner - Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings. * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217 * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c * Add example for AMD Flash Attention backward pass implementation - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications. * Enhance AMD Flash Attention example with additional testing capabilities - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications. * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * Refactor HIP intrinsic rules to CUDA - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure. * Update AMD CI workflow to uninstall specific PyTorch packages before installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts. * Remove unused shared memory allocations in AMD Flash Attention backward example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead. * Remove unnecessary pip uninstall command from AMD CI workflow - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management. * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues. * Refactor formatting of HIP intrinsic rule registrations - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability. * Update file name and documentation for HIP intrinsic rules - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules. - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA. * Enhance DispatchHIPShuffle function with clang-analyzer comments - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage. - This change improves code clarity and maintains compliance with static analysis tools. * lint fix * fix * Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency. * Add backward attention example to test script - Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py. - Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example. - Updated the command for the backward tile example to match the new configuration. * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py - Introduced new functions for forward and backward configurations to enhance autotuning capabilities. - Updated the FlashAttention forward and backward functions to improve performance and maintainability. - Adjusted test script parameters for consistency and clarity, including the addition of group handling. - Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning. - Updated the main function to reflect changes in parameter names and types for better usability. * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py - Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations. - Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning. - Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency. - Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass. - Enhanced correctness checks in the main function to include LSE validation alongside gradient checks. * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py - Introduced a scaling factor for improved numerical stability in gradient calculations. - Optimized shared memory usage by adding new shared buffers for intermediate calculations. - Refined the handling of tensor fragments to improve performance and maintainability. - Updated the main function to ensure compatibility with the new output parameters for backward operations. - Removed unnecessary parameters from the test script to streamline execution. * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py - Updated the forward and backward functions to improve numerical stability and performance. - Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters. - Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters. - Added debugging and benchmarking functions for improved correctness verification and performance analysis. - Updated the main function to reflect changes in parameter handling and ensure consistency across examples. * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py - Updated scaling factor application for improved numerical stability in gradient calculations. - Refined tensor handling to ensure consistency with forward pass operations. - Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision. - Adjusted comments for clarity and alignment with standard implementation practices. * Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh - Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning. - Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example. - Improved comments for clarity and alignment with the updated configurations. * Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py - Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost. - Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations. - Improved comments for better understanding of the performance metrics and implementation details. - Removed unnecessary parameter from test.sh to streamline execution. * Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing. * Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py - Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations. - Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning. - Refined scaling factor calculations for numerical stability in both forward and backward passes. - Improved comments and documentation for clarity and consistency across implementations. - Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements. * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py - Removed outdated comments and improved clarity in the code. - Enhanced the forward function to consistently return output and log-sum-exp (LSE) values. - Updated autotuner configurations to include new parameters for better performance tuning. - Refined tensor handling and scaling factor calculations for improved numerical stability. - Adjusted the main function to ensure compatibility with updated output requirements and parameter handling. * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py - Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization. - Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns. - Modified tensor copy operations to utilize coalesced widths for optimized memory loads. - Enhanced GEMM operations with k_pack for improved computational efficiency. - Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios. * Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py - Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning. - Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation. - Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns. - Adjusted GEMM operations to improve computational efficiency without the use of k_pack. * Enhance HIP code generation and FP8 type support - Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility. - Updated error logging to include unsupported FP8 type details for better debugging. - Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method. - Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality. - Added overloads for AtomicAdd in common.h to support both pointer and value arguments. * Enhance FP8 type support and clarify accumulator handling in HIP - Expanded FP8 type support in codegen_hip.cc to include additional float8 formats. - Updated gemm.h to clarify the handling of the accumulator when clear_accum is true. - Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version. * Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py * Update print statement formatting for clarity in example_amd_flash_attn_bwd.py * Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output. * Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements. * Refactor and enhance HIP code generation for improved FP8 support - Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability. - Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types. - Updated AtomicAdd function in common.h to streamline its implementation. - Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively. - Added function to manage the addition of new functions in the code generation process. * Fix formatting issue in HIP code generation for MFMA call - Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency. * Refactor HIP code generation and enhance FP8 type handling - Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability. - Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types. - Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading. - Refined the AddFunction method to streamline function addition in the code generation process. * Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness. * Refactor backward attention implementation in example_amd_flash_attn_bwd.py - Updated the GEMM operation to use shared memory for improved performance. - Adjusted parallelization parameters to enhance efficiency in the backward pass. * Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness. * Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations * Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py` --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> commit b78d840 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Oct 15 16:38:55 2025 +0800 [Language] Expose `T.get_warp_idx_sync` and `T.shuffle_elect` for efficient thread election (tile-ai#989) * Expose CUDA warp/lane intrinsics in TileLang frontend * generalize warp indexing intrinsics and add coverage * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 32ddc1a Author: LJC00118 <77378439+LJC00118@users.noreply.github.com> Date: Wed Oct 15 15:25:43 2025 +0800 [CUDA] Add pack functions for FP8 types (tile-ai#967) * Remove an incorrect check * add fp8 pack function * code lint * minor fix * minor fix * minor fix * Minor fix * Minor fix commit c67f73b Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Oct 15 15:12:08 2025 +0800 [Env] Optimize the mechanism for locating `TL_LIBS` (tile-ai#1038) commit e539952 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Oct 15 15:11:40 2025 +0800 [TIR] Revert some changes of Pass `LowerIntrin` (tile-ai#1035) * keep >> instead of / * re think replicate * lint fix * handle const int buffers * rep fix --------- Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk> commit 5767475 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Tue Oct 14 23:55:27 2025 +0800 [CI] Disable buggy(maybe) warp specialized kernel ci test for H20 (tile-ai#1033) commit eed320f Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Tue Oct 14 21:51:31 2025 +0800 [Bugfix] Recover code for flexible parallel (tile-ai#1032) * recover flex parallel process * lint fix --------- Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk> commit 1e8f0b1 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Tue Oct 14 17:26:23 2025 +0800 [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation (tile-ai#1023) * [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation * [Lint]: [pre-commit.ci] auto fixes [...] * optimize amd ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
This pull request introduces new thread and warp identification intrinsics for CUDA backends, making it easier to query thread indices, warp indices, and warp group indices in TileLang programs. The changes span the TileLang Python API, CUDA code generation, and the underlying C++ operators and device intrinsics.
New CUDA thread and warp index intrinsics:
Python API additions:
tilelang/language/builtin.py:get_lane_idx,get_warp_idx_sync,get_warp_idx, andget_warp_group_idx, each returning the corresponding thread/warp/group index as aPrimExpr. These functions provide clear documentation and usage examples, and are lowered to new CUDA intrinsics.C++ operator and codegen support:
src/op/builtin.ccand declared them insrc/op/builtin.h, marking them as pure (no side effects) and requiring zero inputs. [1] [2]src/target/codegen_cuda.ccto emit calls to the new CUDA helpers for these operators, ensuring correct code generation.Device-side CUDA implementations:
src/tl_templates/cuda/intrin.h, each forwarding to the corresponding CUTLASS canonical index function, providing accurate thread/warp/group identification on supported hardware. [1] [2]Summary by CodeRabbit
New Features
Bug Fixes / Enhancements
Tests