-
Couldn't load subscription status.
- Fork 286
[Language] Introduce StridedTensor to support non contigious torch inputs
#722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds stride-aware tensors across language and JIT, extends dynamic-symbol mappings to include strides, enforces static stride/contiguity checks in the Cython wrapper, adds CUDA BufferLoad codegen, tightens loop vectorization, reorders an optimization pass, and improves Cython build caching with race-aware locking. Changes
Sequence Diagram(s)sequenceDiagram
participant FS as CacheDir
participant Proc as Process
participant Lock as FileLock
participant CC as CCompiler
Proc->>FS: compute md5, check cached lib
alt cache miss
Proc->>Lock: acquire lock
Proc->>FS: re-check md5/cached lib
alt compiled by other
Proc-->>Proc: use cached lib
else need compile
Proc->>CC: compile Cython JIT adapter
CC-->>Proc: built lib
Proc->>FS: store lib + metadata
end
Proc->>Lock: release lock
else cache hit
Proc-->>Proc: use cached lib
end
sequenceDiagram
participant Caller as User
participant CW as CythonKernelWrapper
participant T as Tensor
Caller->>CW: forward(ins, outs, symbols)
CW->>CW: _check_buffer_device/_check_buffer_dtype
CW->>CW: _check_static_shape
CW->>CW: _check_static_strides
CW->>CW: _check_static_contiguous
loop build dynamic args
CW->>T: resolve (ref_id, buf_idx, dim)
alt ref_id == 0
T-->>CW: shape[dim]
else ref_id == 1
T-->>CW: stride(dim)
end
CW-->>CW: append arg
end
CW-->>Caller: launch kernel with args
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the TileLang framework by introducing explicit support for non-contiguous PyTorch tensor inputs through a new StridedTensor type. This enables more flexible memory layouts for inputs, crucial for optimizing operations on views or slices of tensors. The changes span across the language interface, JIT compilation, and backend code generation to correctly handle and vectorize operations on these strided tensors, ensuring efficient execution.
Highlights
- StridedTensor Introduction: A new T.StridedTensor type is added to the TileLang language, allowing users to define tensors with custom strides, thereby supporting non-contiguous memory layouts.
- Dynamic Stride Support in JIT: The JIT compilation adapters (both ctypes and Cython) are updated to correctly extract and pass dynamic stride information from PyTorch tensors to the compiled kernels, enabling runtime flexibility for non-contiguous inputs.
- Enhanced Buffer Handling and Vectorization: The backend code generation for CUDA and loop vectorization logic are improved to properly account for and utilize stride information when accessing buffer elements, ensuring correct and efficient memory operations.
- Refactored JIT Adapter for Comprehensive Buffer Info: The JIT adapter's internal logic for processing static buffer information is refactored to gather not only shapes but also strides and contiguity status, providing a more complete understanding of tensor properties.
- New Test Cases for Strided Tensors: Dedicated test cases are added to validate the functionality of StridedTensor with various non-contiguous scenarios, ensuring the new features work as expected.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for non-contiguous tensors by adding a StridedTensor type. This is a significant feature that touches many parts of the codebase, including the language frontend (tilelang/language/proxy.py), transformation passes (src/transform/loop_vectorize.cc), code generation (src/target/codegen_cuda.cc), and runtime adapters (tilelang/jit/adapter). New tests have been added to validate the functionality. My review focuses on code correctness and maintainability. I've found a few issues, including some duplicated code and incorrect type hints, which should be addressed. Overall, this is a great addition to the project.
| return adapter | ||
|
|
||
| def _process_dynamic_symbolic(self): | ||
| def _process_dynamic_symbolic(self) -> Dict[tir.Var, Tuple[int, int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint for _process_dynamic_symbolic is incorrect. The function is documented to return a tuple of (id, buffer_index, dimension) and the implementation returns a tuple of three elements, but the type hint is Dict[tir.Var, Tuple[int, int]]. It should be Dict[tir.Var, Tuple[int, int, int]] to match the implementation and documentation.
| def _process_dynamic_symbolic(self) -> Dict[tir.Var, Tuple[int, int]]: | |
| def _process_dynamic_symbolic(self) -> Dict[tir.Var, Tuple[int, int, int]]: |
| // arith::ModularSet me = arith::Analyzer().modular_set(ramp->base); | ||
| // The condition: {k * coeff + base} divisible by the alignment for any k | ||
| // if (me->coeff % op->dtype.lanes() == 0 && me->base % op->dtype.lanes() | ||
| // == 0) { | ||
| // can_vector_load = true; | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip for now for dynamic strides.
Docstrings generation was requested by @LeiWang1999. * #722 (comment) The following files were modified: * `examples/warp_specialize/example_warp_specialize_flashmla.py` * `setup.py` * `src/target/codegen_cuda.cc` * `src/tl_templates/cuda/copy_sm90.h` * `src/tl_templates/hip/reduce.h` * `src/transform/loop_vectorize.cc` * `testing/python/language/test_tilelang_language_copy.py` * `tilelang/engine/phase.py` * `tilelang/jit/adapter/ctypes/adapter.py` * `tilelang/jit/adapter/cython/adapter.py` * `tilelang/jit/adapter/wrapper.py` * `tilelang/language/proxy.py` * `tilelang/language/tir/entry.py`
|
Note Generated docstrings for this pull request at #723 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🔭 Outside diff range comments (2)
src/transform/loop_vectorize.cc (1)
162-167: Use Buffer::ElemOffset for dynamic‐shape offset computationBuffer API already provides a flattened element offset via ElemOffset(indices), which returns a single
PrimExpr. You can replace the currentOffsetOf(...).back()usage to simplify the code and avoid the extra.back(). The threshold guard onvector_size_should remain to ensure the vector load size does not exceed the hardware limit.Please update the following locations:
- src/transform/loop_vectorize.cc (line 165)
- src/transform/loop_vectorize_dynamic.cc (line 188)
- src/transform/atomicadd_vectorize.cc (line 127)
Apply this diff in each file:
- PrimExpr offset = buffer.OffsetOf(indices).back(); + PrimExpr offset = buffer.ElemOffset(indices);setup.py (1)
687-743: Fix lockfile misuse: do not unlink while the file descriptor is locked; use a stable lock pathCurrent flow unlinks the lock file inside the locked section. On POSIX, unlinking while holding a lock does not release the lock but allows another process to create a new inode and acquire a separate lock, defeating mutual exclusion. Also, using a code-hash-specific lock path can allow concurrent compiles of the same source if md5 changes are written before compile completes.
- Use a stable lock file (e.g., in cache_dir) so all processes coordinate on the same lock for this artifact.
- Do not unlink the lock file; just release the lock by closing the FD.
Apply this diff to harden the locking:
- cache_path = cache_dir / f"{code_hash}.so" - lock_file = cache_path.with_suffix('.lock') + cache_path = cache_dir / f"{code_hash}.so" # kept for potential future use + lock_file = cache_dir / "cython_wrapper.lock" @@ - with open(lock_file, 'w') as lock: + with open(lock_file, 'w') as lock: fcntl.flock(lock.fileno(), fcntl.LOCK_EX) @@ - finally: - if lock_file.exists(): - lock_file.unlink() + finally: + # The 'with' context will close the FD and release the lock. + # Keep the lock file on disk to coordinate future processes. + passOptional follow-ups (can be separate PRs):
- Use
subprocess.check_call([...])instead ofos.system(...)and guardcc is Nonefor robust error handling.- Rename
md5_pathtohash_path(it stores a SHA-256 digest) for clarity.
🧹 Nitpick comments (7)
src/tl_templates/hip/reduce.h (1)
25-26: thread_offset is unused in the implementation — either use it or drop it.The template parameter
thread_offsetis declared but never applied in the reduction logic below. IfAllReduceis intended to operate on subgroups starting at a non-zero offset within a block, accessingred_bufand the partner index should account forthread_offset. Otherwise, consider removing the parameter to avoid confusion.Suggested adjustments (outside the selected lines) if you intend to support an offset subgroup:
// In the "shared-memory" path: red_buf[threadIdx.x - thread_offset] = x; x = Reducer()(x, red_buf[(threadIdx.x - thread_offset) ^ offset]); // Optionally, add a precondition (compile-time or runtime) to ensure threadIdx.x >= thread_offset // and that the subgroup is laid out as expected.Alternatively, if offset subgroups aren’t used on HIP, remove
thread_offsetfrom the template and its recursion to keep the HIP and CUDA backends intentional rather than accidentally divergent.Please confirm whether any current HIP call sites rely on a non-zero
thread_offset. If not, I can prepare a follow-up patch to remove the parameter for clarity.src/target/codegen_cuda.cc (2)
1710-1722: Gate vector loads on proven alignment to avoid misaligned vector accessesCurrently, any Ramp(base, 1, lanes) is considered vectorizable without alignment proof. On CUDA, vector types often require alignment; misaligned vector loads can generate suboptimal or even problematic codegen. Re-introduce a conservative alignment check using the analyzer.
- ICHECK(ramp); - can_vector_load = true; - // arith::ModularSet me = arith::Analyzer().modular_set(ramp->base); - // The condition: {k * coeff + base} divisible by the alignment for any k - // if (me->coeff % op->dtype.lanes() == 0 && me->base % op->dtype.lanes() - // == 0) { - // can_vector_load = true; - // } + ICHECK(ramp); + // Enable vector load only if base is aligned to the vector width. + // This avoids emitting potentially misaligned vector loads. + arith::Analyzer analyzer; + PrimExpr zero = make_const(ramp->base.dtype(), 0); + PrimExpr mod = FloorMod(ramp->base, make_const(ramp->base.dtype(), lanes)); + if (analyzer.CanProve(mod == zero)) { + can_vector_load = true; + }
1705-1706: Nit: fix typo in comment“delcare type” → “declare type”
- // delcare type. + // declare type.src/transform/loop_vectorize.cc (1)
245-255: Good alignment preconditions; consider simplification to reduce false negatives.The new checks for extent divisibility and base-offset divisibility are correct and will prevent misaligned vector loads. To improve robustness, consider simplifying the substituted expressions prior to FloorMod to avoid rejecting vectorization due to algebraic form.
Apply this minimal tweak:
- if (!analyzer->CanProveEqual( - FloorMod(Substitute(expr, {{var, 0}}), target_vectorized_size), 0)) { + if (!analyzer->CanProveEqual( + FloorMod(analyzer->Simplify(Substitute(expr, {{var, 0}})), + target_vectorized_size), + 0)) { return false; }tilelang/language/tir/entry.py (1)
11-11: Document the newcheck_well_formedparameter in the docstringThe signature now annotates
check_well_formed: bool = False; the docstring should reflect it to keep the public API self-documenting.Apply this diff to extend the docstring:
private : bool, optional Whether the function should be treated as private. A private function has no global symbol attribute; if the function is not private, it will have a global symbol matching the function name. + check_well_formed : bool, optional + If True, perform well-formedness checks during parsing. Defaults to False. + Passed through to the underlying parser via `parse(..., check_well_formed=...)`.tilelang/jit/adapter/wrapper.py (1)
237-240: Add clarification comment for device_func usage.The comment "QA(@LEI): Why don't use device_mod.params?" indicates potential confusion. Consider adding clarification about why prim_func is used instead of device_func for parameter processing.
Consider adding a more detailed explanation:
- # QA(@lei): Why don't use device_mod.params? - # device func lack buffer map (to convert buffer handle to buffer) + # Note: We use prim_func.params instead of device_func.params because + # device_func lacks buffer_map which is required to convert buffer handles + # to buffers with shape and stride information.tilelang/language/proxy.py (1)
164-173: Consider supporting non-contiguous last dimension for advanced use cases.The TODO comment raises a valid point. While requiring the last dimension to be contiguous (stride=1) is reasonable for most cases, supporting non-contiguous layouts could enable more advanced memory access patterns.
Would you like me to help implement support for non-contiguous last dimensions or open an issue to track this enhancement?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (16)
examples/warp_specialize/example_warp_specialize_flashmla.py(3 hunks)setup.py(4 hunks)src/target/codegen_cuda.cc(1 hunks)src/target/codegen_cuda.h(1 hunks)src/tl_templates/cuda/copy_sm90.h(1 hunks)src/tl_templates/hip/reduce.h(1 hunks)src/transform/loop_vectorize.cc(3 hunks)testing/python/language/test_tilelang_language_copy.py(2 hunks)tilelang/engine/phase.py(1 hunks)tilelang/jit/adapter/ctypes/adapter.py(2 hunks)tilelang/jit/adapter/cython/adapter.py(14 hunks)tilelang/jit/adapter/cython/cython_wrapper.pyx(6 hunks)tilelang/jit/adapter/wrapper.py(3 hunks)tilelang/language/__init__.py(1 hunks)tilelang/language/proxy.py(8 hunks)tilelang/language/tir/entry.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (11)
tilelang/language/__init__.py (1)
tilelang/language/proxy.py (1)
StridedTensor(246-247)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/cuda/reduce.h (2)
tl(5-48)T(7-10)
src/target/codegen_cuda.h (1)
src/target/codegen_cuda.cc (18)
VisitExpr_(725-802)VisitExpr_(725-725)VisitExpr_(928-1559)VisitExpr_(928-928)VisitExpr_(1676-1690)VisitExpr_(1676-1676)VisitExpr_(1692-1760)VisitExpr_(1692-1693)VisitExpr_(1762-1877)VisitExpr_(1762-1763)VisitExpr_(1928-1931)VisitExpr_(1928-1929)op(129-144)op(129-129)op(1291-1293)op(1291-1291)op(1294-1296)op(1294-1294)
tilelang/engine/phase.py (2)
src/transform/config_index_bitwidth.cc (2)
ConfigIndexBitwidth(159-177)ConfigIndexBitwidth(159-159)tilelang/transform/__init__.py (1)
ConfigIndexBitwidth(307-316)
src/tl_templates/cuda/copy_sm90.h (1)
src/tl_templates/cuda/common.h (4)
uint32_t(107-109)int(86-89)int(112-119)tl(207-253)
src/transform/loop_vectorize.cc (1)
src/transform/loop_vectorize_dynamic.cc (2)
indices(139-192)indices(139-139)
examples/warp_specialize/example_warp_specialize_flashmla.py (2)
examples/gemm/example_gemm.py (1)
gemm(9-25)src/op/gemm.cc (1)
Lower(304-339)
tilelang/jit/adapter/ctypes/adapter.py (2)
tilelang/jit/adapter/cython/adapter.py (2)
_process_dynamic_symbolic(348-373)prim_func(492-494)tilelang/jit/adapter/nvrtc/adapter.py (2)
_process_dynamic_symbolic(153-168)prim_func(244-246)
tilelang/jit/adapter/cython/adapter.py (3)
tilelang/engine/param.py (1)
KernelParam(12-103)tilelang/jit/adapter/base.py (1)
BaseKernelAdapter(8-55)tilelang/jit/kernel.py (1)
params(439-440)
testing/python/language/test_tilelang_language_copy.py (7)
tilelang/transform/pass_config.py (1)
PassConfigKey(6-81)tilelang/jit/adapter/ctypes/adapter.py (1)
prim_func(266-268)tilelang/jit/adapter/cython/adapter.py (1)
prim_func(492-494)tilelang/jit/adapter/wrapper.py (2)
prim_func(557-567)prim_func(995-1005)tilelang/language/proxy.py (2)
StridedTensor(246-247)Tensor(243-244)tilelang/jit/__init__.py (1)
compile(32-81)tilelang/language/__init__.py (1)
symbolic(74-75)
tilelang/jit/adapter/wrapper.py (5)
tilelang/jit/adapter/ctypes/adapter.py (1)
prim_func(266-268)tilelang/jit/adapter/cython/adapter.py (1)
prim_func(492-494)tilelang/jit/adapter/nvrtc/adapter.py (1)
prim_func(244-246)tilelang/jit/kernel.py (1)
params(439-440)src/transform/lower_tile_op.cc (2)
buffer(230-238)buffer(230-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (28)
examples/warp_specialize/example_warp_specialize_flashmla.py (5)
148-148: Reformat only; semantics preserved for right-hand Q·K^T GEMM.Keeping transpose_B=True and wg_wait=-1 matches surrounding calls and maintains accumulator behavior (no clear_accum here by design). LGTM.
151-151: Reformat only; PE Q·K^T GEMM remains correct.Dimensions align with transpose_B=True to produce [block_H, block_N]; behavior unchanged. LGTM.
254-254: Reformat only; semantics preserved for right-hand Q·K^T GEMM in second loop.Consistent with the first loop: accumulate into acc_s_1 without clear_accum and use wg_wait=-1. LGTM.
257-257: Reformat only; PE Q·K^T GEMM in second loop remains correct.transpose_B=True is still present; shapes and accumulation behavior unchanged. LGTM.
291-291: Reformat only; PV GEMM remains synchronous-as-queued with wg_wait=-1.No transpose needed here; shapes (acc_s_1_cast [H,N]) x (KV_shared_1_r [N,D]) -> acc_o_r [H,D] remain valid. LGTM.
src/tl_templates/hip/reduce.h (1)
25-26: Formatting-only change looks good.Splitting the template declaration onto two lines improves readability without functional impact. No objections.
src/target/codegen_cuda.h (1)
53-53: Header declaration for BufferLoad visitor looks correct and aligned with cc implementationMatches the new implementation in codegen_cuda.cc and keeps the visitor surface consistent.
src/target/codegen_cuda.cc (1)
1692-1760: Solid BufferLoadNode support with correct vectorization fallback
- Correctly rejects non-flat and predicated loads.
- Properly handles scalar, vectorizable (ramp) and non-vector paths; volatile and storage scopes are respected.
- The per-lane fallback is robust.
Minor refinements below on alignment gating and a small typo.
src/transform/loop_vectorize.cc (1)
266-269: LGTM: necessary simplify before Ramp inspection.Running
analyzer->Simplifybefore examining the ramp is a good addition; it materially helps the pattern matcher in realistic IR.tilelang/engine/phase.py (1)
128-131: Good move: run ConfigIndexBitwidth after FlattenBufferReordering ensures indices are flattened before bitwidth config and legalization, which matches the pass intent and should improve consistency. Comment explains the dependency well.
tilelang/language/tir/entry.py (1)
1-1: Import of inspect is fine and safeNo issues with bringing
inspectto the top; it’s used immediately forouter_stack.setup.py (2)
1-3: Top-level imports reorg looks goodPromoting fcntl/functools/hashlib to the top keeps imports centralized and avoids duplication.
695-701: Cython JIT adapter logs and re-check are clearThe improved messages and the in-lock re-check path make the flow easier to reason about and avoid redundant compilation. Exception message is clearer as well.
Also applies to: 718-719, 739-739
tilelang/language/__init__.py (1)
20-21: Expose StridedTensor in public APIRe-exporting
StridedTensoris aligned with the new stride-aware features and matches test usage. Looks good.testing/python/language/test_tilelang_language_copy.py (3)
31-33: Enum-based pass_config keys are correctSwitching to
PassConfigKey(astr-backed Enum) modernizes the API while preserving compatibility withPassContext.
45-58: Stride-aware copy kernel definition looks correct
StridedTensor((M, N), (NN, 1), dtype)paired with a slice viewa[:, :N]should yield expected strides in PyTorch and exercise the strided path well.
85-86: Tests cover both concrete and symbolic NNGood coverage to validate both static and dynamic stride cases.
tilelang/jit/adapter/ctypes/adapter.py (2)
170-183: Logic is correct for handling dynamic shape and stride variables.The implementation properly:
- First processes shape variables with id=0
- Then processes stride variables with id=1
- Excludes variables that are parameters or already processed
This ensures proper ordering and avoids duplicates.
241-245: LGTM! Dynamic symbolic handling correctly updated for strides.The code properly unpacks the 3-tuple format and differentiates between shape (ref_id=0) and stride (ref_id=1) references when building arguments.
tilelang/jit/adapter/wrapper.py (2)
487-509: LGTM! Stride support properly added to dynamic symbolic collection.The implementation correctly:
- Uses
unique_push_backto avoid duplicates- Processes shapes first, then strides (as documented)
- Maintains the order needed for runtime resolution
569-580: Well-structured device function property addition.The new
device_funcproperty properly handles multiple resolution strategies:
- Single global function
- Main entry point
- Function with
tir.is_global_funcattributeThis provides flexibility while maintaining clear precedence order.
tilelang/language/proxy.py (2)
145-151: Well-implemented default stride calculation.The
_construct_stridesmethod correctly computes contiguous strides in C-order (row-major) format. The logic properly accumulates strides from the last dimension backwards.
290-294: LGTM! Clean API for creating tensors from pointers with stride support.The
make_tensorfunction properly forwards the strides parameter toTensor.from_ptr, maintaining consistency with the new stride-aware tensor interface.tilelang/jit/adapter/cython/adapter.py (2)
405-437: Comprehensive static buffer information processing.The
_process_static_buffer_infosmethod effectively:
- Extracts static shapes and strides
- Determines buffer contiguity
- Returns well-structured data for validation
The contiguity check logic is particularly well-implemented, correctly verifying that each stride equals the product of subsequent dimensions.
84-98: Well-designed cache management with file locking.The implementation properly handles concurrent access:
- Uses file locks to prevent race conditions
- Validates cached library size to detect incomplete files
- Cleans up incomplete files automatically
tilelang/jit/adapter/cython/cython_wrapper.pyx (3)
117-128: Good addition of stride validation.The
_check_static_stridesmethod properly validates that tensor strides match the expected static values, complementing the existing shape validation.
130-135: Appropriate contiguity enforcement for specific buffers.The
_check_static_contiguousmethod correctly validates contiguity only for buffers marked as requiring it, rather than enforcing it globally.
233-237: LGTM! Correct handling of shape vs stride references.The code properly differentiates between ref_id values:
- ref_id == 0: Uses shape dimension
- ref_id != 0: Uses stride dimension
This aligns with the encoding scheme defined in the adapters.
| // Generate strides if not existed | ||
| auto strides = buffer->strides; | ||
| if (buffer->strides.size() == 0) { | ||
| PrimExpr stride = 1; | ||
| for (int i = indices.size() - 1; i >= 0; --i) { | ||
| strides.push_back(stride); | ||
| stride = stride * buffer->shape[i]; | ||
| } | ||
| strides = Array<PrimExpr>{strides.rbegin(), strides.rend()}; | ||
| } | ||
|
|
||
| // Generate and check element offset expression | ||
| ICHECK(indices.size() == strides.size()) << "Invalid indices and strides"; | ||
| PrimExpr elem_offset = 0; | ||
| PrimExpr stride = 1; | ||
| for (int i = indices.size() - 1; i >= 0; --i) { | ||
| elem_offset = elem_offset + indices[i] * stride; | ||
| stride = stride * buffer->shape[i]; | ||
| for (int i = 0; i < indices.size(); ++i) { | ||
| elem_offset += indices[i] * strides[i]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix stride reversal and avoid constructing tvm::Array from iterators (compile-time bug).
Array<PrimExpr>{strides.rbegin(), strides.rend()} constructs an initializer_list from two iterator objects, not a reversed Array. This won’t compile and will break vectorization on strided buffers. Also, you don’t need to materialize/reverse strides when explicit strides are absent; you can compute elem_offset directly.
Refactor to compute elem_offset robustly for both explicit and implicit strides, and drop the invalid reverse:
Apply this diff:
- // Generate strides if not existed
- auto strides = buffer->strides;
- if (buffer->strides.size() == 0) {
- PrimExpr stride = 1;
- for (int i = indices.size() - 1; i >= 0; --i) {
- strides.push_back(stride);
- stride = stride * buffer->shape[i];
- }
- strides = Array<PrimExpr>{strides.rbegin(), strides.rend()};
- }
-
- // Generate and check element offset expression
- ICHECK(indices.size() == strides.size()) << "Invalid indices and strides";
- PrimExpr elem_offset = 0;
- for (int i = 0; i < indices.size(); ++i) {
- elem_offset += indices[i] * strides[i];
- }
+ // Compute element offset (supports both explicit and implicit strides)
+ PrimExpr elem_offset = 0;
+ if (buffer->strides.size() == 0) {
+ PrimExpr stride = 1;
+ for (int i = indices.size() - 1; i >= 0; --i) {
+ elem_offset += indices[i] * stride;
+ stride = stride * buffer->shape[i];
+ }
+ } else {
+ ICHECK(indices.size() == buffer->strides.size()) << "Invalid indices and strides";
+ for (int i = 0; i < indices.size(); ++i) {
+ elem_offset += indices[i] * buffer->strides[i];
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Generate strides if not existed | |
| auto strides = buffer->strides; | |
| if (buffer->strides.size() == 0) { | |
| PrimExpr stride = 1; | |
| for (int i = indices.size() - 1; i >= 0; --i) { | |
| strides.push_back(stride); | |
| stride = stride * buffer->shape[i]; | |
| } | |
| strides = Array<PrimExpr>{strides.rbegin(), strides.rend()}; | |
| } | |
| // Generate and check element offset expression | |
| ICHECK(indices.size() == strides.size()) << "Invalid indices and strides"; | |
| PrimExpr elem_offset = 0; | |
| PrimExpr stride = 1; | |
| for (int i = indices.size() - 1; i >= 0; --i) { | |
| elem_offset = elem_offset + indices[i] * stride; | |
| stride = stride * buffer->shape[i]; | |
| for (int i = 0; i < indices.size(); ++i) { | |
| elem_offset += indices[i] * strides[i]; | |
| } | |
| // Compute element offset (supports both explicit and implicit strides) | |
| PrimExpr elem_offset = 0; | |
| if (buffer->strides.size() == 0) { | |
| PrimExpr stride = 1; | |
| for (int i = indices.size() - 1; i >= 0; --i) { | |
| elem_offset += indices[i] * stride; | |
| stride = stride * buffer->shape[i]; | |
| } | |
| } else { | |
| ICHECK(indices.size() == buffer->strides.size()) << "Invalid indices and strides"; | |
| for (int i = 0; i < indices.size(); ++i) { | |
| elem_offset += indices[i] * buffer->strides[i]; | |
| } | |
| } |
🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 140 to 156, the code attempts to
build reversed strides via Array<PrimExpr>{strides.rbegin(), strides.rend()}
which is invalid and unnecessary; instead handle two cases when computing
elem_offset: if buffer->strides is non-empty use those strides directly,
otherwise compute the running stride on the fly from the buffer->shape (start
with stride=1 and process dimensions from last to first) and accumulate
indices[i]*current_stride into elem_offset without creating a temporary reversed
array. Keep the ICHECK that indices and strides align for the explicit-stride
path, and ensure the implicit-stride path computes elem_offset robustly using
shape-derived strides.
| if isinstance(NN, int): | ||
| assert NN > N, "NN must be greater than N" | ||
| program = tilelang_copy_with_stride(M, N, NN, block_M, block_N, dtype) | ||
| kernel = tilelang.compile( | ||
| program, | ||
| out_idx=[1], | ||
| target="cuda", | ||
| pass_configs={ | ||
| tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, | ||
| tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, | ||
| }) | ||
| if isinstance(NN, T.Var): | ||
| NN = N * 2 | ||
| a = torch.randn(M, NN, device="cuda", dtype=getattr(torch, dtype)) | ||
| b = kernel(a[:, :N]) | ||
| torch.testing.assert_close(b, a[:, :N], rtol=1e-2, atol=1e-2) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Fix TypeError in isinstance check and allow NN == N
isinstance(NN, T.Var) will raise TypeError because T.Var is a function, not a type. Also, the assertion can safely allow NN == N.
Apply this diff:
- if isinstance(NN, int):
- assert NN > N, "NN must be greater than N"
+ if isinstance(NN, int):
+ assert NN >= N, "NN must be greater than or equal to N"
@@
- if isinstance(NN, T.Var):
- NN = N * 2
+ # If NN is symbolic (e.g., tvm.tir.Var), substitute a concrete allocation size.
+ if not isinstance(NN, int):
+ NN = N * 2Optional: If you prefer an explicit type check, import tvm.tir and use isinstance(NN, tvm.tir.PrimExpr).
Run this quick scan to ensure similar mistaken isinstance(..., T.Var) patterns don’t exist elsewhere:
🏁 Script executed:
#!/bin/bash
rg -nP -C2 'isinstance\s*\(\s*[^,]+,\s*T\.Var\s*\)'Length of output: 2709
🏁 Script executed:
#!/bin/bash
# Display import aliases for T in test_tilelang_language_copy.py
sed -n '1,50p' testing/python/language/test_tilelang_language_copy.py | sed -n '1,50p'Length of output: 1713
🏁 Script executed:
#!/bin/bash
# Display import aliases for T in the relevant test files
sed -n '1,20p' testing/python/jit/test_tilelang_jit_gemm_ctypes.py
sed -n '1,20p' testing/python/jit/test_tilelang_jit_gemm_cython.pyLength of output: 753
Fix symbolic-dimension checks and allow NN == N
The isinstance(..., T.Var) guard will fail at runtime because T.Var is a constructor function, not a type. Also, an integer stride equal to N is valid. Update all three tests to:
- Use
assert ... >= Ninstead of> N. - Replace
isinstance(x, T.Var)withif not isinstance(x, int)to catch symbolic cases.
Affected files:
- testing/python/language/test_tilelang_language_copy.py (lines 66–82)
- testing/python/jit/test_tilelang_jit_gemm_ctypes.py (lines 370–375)
- testing/python/jit/test_tilelang_jit_gemm_cython.py (lines 379–384 and 448–453)
Apply these diffs:
--- a/testing/python/language/test_tilelang_language_copy.py
+++ b/testing/python/language/test_tilelang_language_copy.py
@@ -66,7 +66,7 @@ def test_tilelang_copy_with_stride():
- if isinstance(NN, int):
- assert NN > N, "NN must be greater than N"
+ if isinstance(NN, int):
+ assert NN >= N, "NN must be greater than or equal to N"
@@ -75,7 +75,8 @@ def test_tilelang_copy_with_stride():
- if isinstance(NN, T.Var):
- NN = N * 2
+ # For any non-int (e.g., tvm.tir.Var), pick a concrete size
+ if not isinstance(NN, int):
+ NN = N * 2--- a/testing/python/jit/test_tilelang_jit_gemm_ctypes.py
+++ b/testing/python/jit/test_tilelang_jit_gemm_ctypes.py
@@ -368,12 +368,12 @@ def test_matmul_ctypes():
- if isinstance(M, T.Var):
- M = 1024
+ if not isinstance(M, int):
+ M = 1024
- if isinstance(N, T.Var):
- N = 1024
+ if not isinstance(N, int):
+ N = 1024
- if isinstance(K, T.Var):
- K = 768
+ if not isinstance(K, int):
+ K = 768--- a/testing/python/jit/test_tilelang_jit_gemm_cython.py
+++ b/testing/python/jit/test_tilelang_jit_gemm_cython.py
@@ -377,12 +377,12 @@ def test_matmul_cython():
- if isinstance(M, T.Var):
- M = 1024
+ if not isinstance(M, int):
+ M = 1024
- if isinstance(N, T.Var):
- N = 1024
+ if not isinstance(N, int):
+ N = 1024
- if isinstance(K, T.Var):
- K = 768
+ if not isinstance(K, int):
+ K = 768
@@ -446,12 +446,12 @@ def test_matmul_cython_outidx_minus1():
- if isinstance(M, T.Var):
- M = 1024
+ if not isinstance(M, int):
+ M = 1024
- if isinstance(N, T.Var):
- N = 1024
+ if not isinstance(N, int):
+ N = 1024
- if isinstance(K, T.Var):
- K = 768
+ if not isinstance(K, int):
+ K = 768📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(NN, int): | |
| assert NN > N, "NN must be greater than N" | |
| program = tilelang_copy_with_stride(M, N, NN, block_M, block_N, dtype) | |
| kernel = tilelang.compile( | |
| program, | |
| out_idx=[1], | |
| target="cuda", | |
| pass_configs={ | |
| tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, | |
| tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, | |
| }) | |
| if isinstance(NN, T.Var): | |
| NN = N * 2 | |
| a = torch.randn(M, NN, device="cuda", dtype=getattr(torch, dtype)) | |
| b = kernel(a[:, :N]) | |
| torch.testing.assert_close(b, a[:, :N], rtol=1e-2, atol=1e-2) | |
| if isinstance(NN, int): | |
| assert NN >= N, "NN must be greater than or equal to N" | |
| program = tilelang_copy_with_stride(M, N, NN, block_M, block_N, dtype) | |
| kernel = tilelang.compile( | |
| program, | |
| out_idx=[1], | |
| target="cuda", | |
| pass_configs={ | |
| tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, | |
| tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, | |
| }) | |
| # For any non-int (e.g., tvm.tir.Var), pick a concrete size | |
| if not isinstance(NN, int): | |
| NN = N * 2 | |
| a = torch.randn(M, NN, device="cuda", dtype=getattr(torch, dtype)) | |
| b = kernel(a[:, :N]) | |
| torch.testing.assert_close(b, a[:, :N], rtol=1e-2, atol=1e-2) |
🤖 Prompt for AI Agents
testing/python/language/test_tilelang_language_copy.py around lines 66-82: the
code currently uses isinstance(NN, T.Var) (which is invalid because T.Var is a
constructor) and requires NN > N even though NN == N is valid; change the
integer check to assert NN >= N and replace the symbolic detection with if not
isinstance(NN, int) to catch non-integer (symbolic) cases; apply the same two
fixes to the other affected tests:
testing/python/jit/test_tilelang_jit_gemm_ctypes.py (lines 370-375) and
testing/python/jit/test_tilelang_jit_gemm_cython.py (lines 379-384 and 448-453).
| # TODO(chenggang): remove this check or rewrite by ourselves? | ||
| if isinstance(tensor, torch.Tensor) and tensor._base is not None and not tensor.is_contiguous(): | ||
| base_tensor = tensor._base.as_strided(tensor._base.shape, tensor.stride()) | ||
| if torch._debug_has_internal_overlap(base_tensor): | ||
| raise ValueError(f"Cannot use an overlapping tensor" | ||
| f"(shape={tensor.shape}, strides={tensor.stride()}, " | ||
| f"overlap={torch._debug_has_internal_overlap(base_tensor)}) as the kernel input") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the overlap detection logic for safety.
The overlap detection using torch._debug_has_internal_overlap is important for safety, but using private PyTorch APIs (prefixed with _) may be fragile across versions.
Let me verify if this API is stable across PyTorch versions:
🌐 Web query:
Is torch._debug_has_internal_overlap a stable API in PyTorch for detecting tensor memory overlaps?
💡 Result:
torch._debug_has_internal_overlap is not a stable, public API in PyTorch; it is an internal debugging utility and should not be relied upon in production code for detecting tensor memory overlaps.
- In PyTorch, APIs with a leading underscore (
_)—such as torch._debug_has_internal_overlap—are explicitly marked as private/internal and not guaranteed to remain stable across versions. These functions are subject to change or removal without notice[5]. - The [Performance Tuning Guide][5] notes that many PyTorch APIs are intended for debugging and should be disabled for regular training runs, further emphasizing these should not be considered stable or production-safe[5].
Essential Context
- If you need to detect overlapping memory in tensors, you should use the supported public API:
- torch.Tensor.has_storage_overlap() is a public, stable method for checking memory overlap between tensors as of recent PyTorch versions[10].
- Relying on any function in the
torch._debug_*namespace is inherently unstable; these are neither documented for end-user use nor supported across releases.
Additional Notes
- Using internal APIs in production code can lead to silent breakage, bugs, or incompatibility when updating PyTorch.
- For production code, always prefer documented, public APIs that are versioned and maintained[10].
In summary: Do not use torch._debug_has_internal_overlap for detecting overlaps in production code—use torch.Tensor.has_storage_overlap() or other supported public methods instead[10][5].
Citations:
- 1: torch 2.5 slower than 2.4.1 ? pytorch/pytorch#138386
- 2: https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html
- 3: https://docs.pytorch.org/docs/stable/notes/cuda.html
- 4: https://docs.pytorch.org/docs/stable/torch.compiler_faq.html
- 5: https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html
- 6: https://discuss.pytorch.org/t/how-to-debug-causes-of-gpu-memory-leaks/6741
- 7: https://pytorch.org/docs/stable/distributed.html
- 8: https://www.digitalocean.com/community/tutorials/pytorch-memory-multi-gpu-debugging
- 9: https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html?highlight=dataloader
- 10: https://pytorch.org/docs/stable/tensors.html
Replace private overlap detection with a supported public API
Using torch._debug_has_internal_overlap relies on an internal, unstable API. Swap in one of PyTorch’s public tensor‐overlap checks—e.g., Tensor.is_non_overlapping_and_dense() for internal overlap or Tensor.has_storage_overlap(other) for inter‐tensor overlap—to ensure future compatibility.
• In tilelang/jit/adapter/cython/cython_wrapper.pyx (lines 188–194):
base_tensor = tensor._base.as_strided(tensor._base.shape, tensor.stride())
- if torch._debug_has_internal_overlap(base_tensor):
+ if not base_tensor.is_non_overlapping_and_dense():
raise ValueError(
f"Cannot use an overlapping tensor"
f"(shape={tensor.shape}, strides={tensor.stride()}, "
- f"overlap={torch._debug_has_internal_overlap(base_tensor)}) as the kernel input")
+ f"overlap detected) as the kernel input")This replaces the private check with the stable is_non_overlapping_and_dense() method.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # TODO(chenggang): remove this check or rewrite by ourselves? | |
| if isinstance(tensor, torch.Tensor) and tensor._base is not None and not tensor.is_contiguous(): | |
| base_tensor = tensor._base.as_strided(tensor._base.shape, tensor.stride()) | |
| if torch._debug_has_internal_overlap(base_tensor): | |
| raise ValueError(f"Cannot use an overlapping tensor" | |
| f"(shape={tensor.shape}, strides={tensor.stride()}, " | |
| f"overlap={torch._debug_has_internal_overlap(base_tensor)}) as the kernel input") | |
| # TODO(chenggang): remove this check or rewrite by ourselves? | |
| if isinstance(tensor, torch.Tensor) and tensor._base is not None and not tensor.is_contiguous(): | |
| base_tensor = tensor._base.as_strided(tensor._base.shape, tensor.stride()) | |
| if not base_tensor.is_non_overlapping_and_dense(): | |
| raise ValueError( | |
| f"Cannot use an overlapping tensor" | |
| f"(shape={tensor.shape}, strides={tensor.stride()}, " | |
| f"overlap detected) as the kernel input") |
🤖 Prompt for AI Agents
In tilelang/jit/adapter/cython/cython_wrapper.pyx around lines 188–194, the code
uses the private API torch._debug_has_internal_overlap to detect internal
storage overlap; replace that with the public Tensor method
is_non_overlapping_and_dense(). Concretely, after constructing base_tensor, use
a public check like not base_tensor.is_non_overlapping_and_dense() (or
Tensor.has_storage_overlap when checking against another tensor) instead of
torch._debug_has_internal_overlap, and update the raised ValueError message to
report the overlap using that result (keep shape/strides info).
| def __getitem__(self, keys) -> tir.Buffer: | ||
| if not isinstance(keys, tuple): | ||
| return self(keys) | ||
| if len(keys) >= 2 and not isinstance(keys[1], str): | ||
| return self(keys) | ||
| assert isinstance(keys, tuple) | ||
| # Single argument (the shape) | ||
| if all([type(s) not in (tuple, str, list) for s in keys]): | ||
| keys = (keys,) | ||
| return self(*keys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the tuple detection logic for shape handling.
The condition all([type(s) not in (tuple, str, list) for s in keys]) is checking types rather than using isinstance(). This could miss subclasses and may not handle all edge cases correctly.
Let me verify if this type checking approach is consistent with the rest of the codebase:
🏁 Script executed:
#!/bin/bash
# Check how type checking is done elsewhere in the codebase
rg -n "type\(.*\) (not )?in" --type pyLength of output: 138
Update tuple detection to use isinstance for correct subclass handling
The current check in tilelang/language/proxy.py:
if all([type(s) not in (tuple, str, list) for s in keys]):
keys = (keys,)only matches exact tuple, str, or list, and will miss any subclasses. To fix this, replace it with an isinstance‐based check:
--- a/tilelang/language/proxy.py
+++ b/tilelang/language/proxy.py
@@ 113,118c113,118
- def __getitem__(self, keys) -> tir.Buffer:
- assert isinstance(keys, tuple)
- # Single argument (the shape)
- if all([type(s) not in (tuple, str, list) for s in keys]):
- keys = (keys,)
- return self(*keys)
+ def __getitem__(self, keys) -> tir.Buffer:
+ assert isinstance(keys, tuple)
+ # Single argument (the shape) – use isinstance to catch subclasses
+ if all(not isinstance(s, (tuple, str, list)) for s in keys):
+ keys = (keys,)
+ return self(*keys)• This change ensures any subclass of tuple, str, or list will be properly detected.
• Removes the unnecessary list allocation in the comprehension.
🤖 Prompt for AI Agents
In tilelang/language/proxy.py around lines 113 to 118, the tuple detection uses
type(...) equality which misses subclasses and allocates an unnecessary list;
replace the check with an isinstance-based generator expression such as using
all(not isinstance(s, (tuple, str, list)) for s in keys) so subclasses of
tuple/str/list are detected and avoid creating a temporary list, then keep the
existing keys = (keys,) behavior when the condition is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
examples/analyze/example_gemm_analyze.py (1)
52-52: Invert mapping looks right; make HIP check robust and avoid AttributeError on older torchThe inversion to prefer CUDA when HIP is absent aligns with the rest of the PR. To be defensive across torch versions (some builds may not have
torch.version.hip), usegetattrto avoid an AttributeError.Apply:
- cuda_device = CUDA("cuda") if torch.version.hip is None else CDNA("hip") + cuda_device = CUDA("cuda") if getattr(torch.version, "hip", None) is None else CDNA("hip")Optional: consider centralizing this selection (used in multiple files) into a small helper (e.g., select_arch()) to avoid duplication and keep policy consistent.
benchmark/matmul/benchmark_matmul_intrinsic.py (1)
190-190: Consistent inversion; prefer getattr() for HIP detection to avoid AttributeErrorThis matches the PR’s inversion pattern. Minor hardening: use
getattr(torch.version, "hip", None)so CPU-only or unusual torch builds don’t raise.Apply:
- arch = CUDA("cuda") if torch.version.hip is None else CDNA("hip") + arch = CUDA("cuda") if getattr(torch.version, "hip", None) is None else CDNA("hip")Also consider a shared helper for backend selection to keep all examples/benchmarks in sync.
examples/analyze/example_conv_analyze.py (1)
99-99: Backend switch LGTM; guard HIP detection with getattr for safetyThe CUDA/HIP inversion is consistent. Use
getattrto prevent potential AttributeError on older torch.Apply:
- cuda_device = CUDA("cuda") if torch.version.hip is None else CDNA("hip") + cuda_device = CUDA("cuda") if getattr(torch.version, "hip", None) is None else CDNA("hip")If you anticipate CPU-only environments running this script, consider a friendlier error if neither CUDA nor HIP devices are available.
examples/gemm/example_gemm_autotune.py (1)
19-19: Arch selection flip is fine; use getattr() to avoid AttributeError and consider dedup via helperThe change is consistent with related files. Minor safety improvement:
- arch = CUDA("cuda") if torch.version.hip is None else CDNA("hip") + arch = CUDA("cuda") if getattr(torch.version, "hip", None) is None else CDNA("hip")Repeated logic across examples/benchmarks suggests adding a small utility (e.g., select_arch()) to return a
TileDevicebased on availability/detection. That keeps behavior uniform and easier to evolve.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
benchmark/matmul/benchmark_matmul.py(1 hunks)benchmark/matmul/benchmark_matmul_intrinsic.py(1 hunks)examples/analyze/example_conv_analyze.py(1 hunks)examples/analyze/example_gemm_analyze.py(1 hunks)examples/gemm/example_gemm_autotune.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
examples/gemm/example_gemm_autotune.py (3)
tilelang/carver/template/base.py (1)
arch(152-159)tilelang/carver/arch/cuda.py (1)
CUDA(106-147)tilelang/carver/arch/cdna.py (1)
CDNA(11-32)
benchmark/matmul/benchmark_matmul.py (3)
tilelang/carver/template/base.py (1)
arch(152-159)tilelang/carver/arch/cuda.py (1)
CUDA(106-147)tilelang/carver/arch/cdna.py (1)
CDNA(11-32)
benchmark/matmul/benchmark_matmul_intrinsic.py (3)
tilelang/carver/template/base.py (1)
arch(152-159)tilelang/carver/arch/cuda.py (1)
CUDA(106-147)tilelang/carver/arch/cdna.py (1)
CDNA(11-32)
examples/analyze/example_gemm_analyze.py (2)
tilelang/carver/arch/cuda.py (1)
CUDA(106-147)tilelang/carver/arch/cdna.py (1)
CDNA(11-32)
examples/analyze/example_conv_analyze.py (2)
tilelang/carver/arch/cuda.py (1)
CUDA(106-147)tilelang/carver/arch/cdna.py (1)
CDNA(11-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
| import torch | ||
|
|
||
| arch = CDNA("cuda") if torch.version.hip is None else CUDA("hip") | ||
| arch = CUDA("cuda") if torch.version.hip is None else CDNA("hip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Inverted arch selection is consistent; harden HIP detection and verify repo-wide consistency
The inversion matches other files in this PR. To avoid AttributeError on some torch builds, use getattr:
- arch = CUDA("cuda") if torch.version.hip is None else CDNA("hip")
+ arch = CUDA("cuda") if getattr(torch.version, "hip", None) is None else CDNA("hip")Additionally, to ensure there are no missed spots still using the old mapping, you can scan the repo:
If any legacy patterns appear in contexts that should be updated, align them with this PR’s policy. Centralizing selection via a helper will also simplify future changes.
🏁 Script executed:
#!/bin/bash
# Find all inverted/new mapping occurrences
echo "Occurrences of new mapping (CUDA on non-HIP, CDNA on HIP):"
rg -nP '\bCUDA\("cuda"\)\s*if\s*getattr\(torch\.version,\s*"hip",\s*None\)\s*is\s*None\s*else\s*CDNA\("hip"\)' -C2 || true
rg -nP '\bCUDA\("cuda"\)\s*if\s*torch\.version\.hip\s*is\s*None\s*else\s*CDNA\("hip"\)' -C2 || true
# Find any remaining old mapping instances
echo
echo "Occurrences of old mapping (CDNA on non-HIP, CUDA on HIP):"
rg -nP '\bCDNA\("cuda"\)\s*if\s*torch\.version\.hip\s*is\s*None\s*else\s*CUDA\("hip"\)' -C2 || trueLength of output: 2545
Harden HIP detection by using getattr across all occurrences
To avoid AttributeError on torch builds that lack torch.version.hip, please replace raw attribute access with a safe getattr check in every spot where we currently do:
arch = CUDA("cuda") if torch.version.hip is None else CDNA("hip")Please update the following locations:
• examples/gemm/example_gemm_autotune.py:19
• examples/analyze/example_conv_analyze.py:99
• examples/analyze/example_gemm_analyze.py:52
• benchmark/matmul/benchmark_matmul.py:56
• benchmark/matmul/benchmark_matmul_intrinsic.py:190
Example diff:
- arch = CUDA("cuda") if torch.version.hip is None else CDNA("hip")
+ arch = CUDA("cuda") if getattr(torch.version, "hip", None) is None else CDNA("hip")For future maintainability, consider centralizing this logic into a small helper (e.g. def select_device(): ...) so any further changes only touch one place.
🤖 Prompt for AI Agents
In benchmark/matmul/benchmark_matmul.py around line 56 (and also update
examples/gemm/example_gemm_autotune.py:19,
examples/analyze/example_conv_analyze.py:99,
examples/analyze/example_gemm_analyze.py:52,
benchmark/matmul/benchmark_matmul_intrinsic.py:190), the code directly accesses
torch.version.hip which can raise AttributeError on some torch builds; replace
the raw attribute access with a safe getattr check (e.g. use
getattr(torch.version, "hip", None) is None to choose CUDA vs CDNA) in each of
those locations, and optionally factor this into a small helper function like
select_device() and call that helper from all these sites for maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tilelang/language/proxy.py (1)
113-118: Update tuple detection to use isinstance for correct subclass handling.This is the same issue identified in a previous review. The current check uses
type()equality which misses subclasses and creates an unnecessary list.
🧹 Nitpick comments (4)
tilelang/language/proxy.py (4)
56-68: Add strides parameter to docstring.The
stridesparameter is not documented in the method's docstring, which could lead to confusion for users of this API.Apply this diff to add the missing parameter documentation:
def from_ptr(self, pointer_var: Var, shape: tuple[PrimExpr, ...], dtype: str = "float32", strides: tuple[PrimExpr, ...] = None) -> Buffer: """Create a buffer from a pointer, shape, and data type. Args: pointer_var: The pointer variable shape: The shape of the buffer dtype: The data type of the buffer (default: float32) + strides: The strides of the buffer (default: None for contiguous) Returns: A buffer created from the given parameters """ return match_buffer(pointer_var, shape, dtype=dtype, strides=strides)
123-135: Add strides parameter to docstring.Similar to the BufferProxy, the
stridesparameter is missing from the docstring.Apply this diff to add the missing parameter documentation:
def from_ptr(self, pointer_var: Var, shape: tuple[PrimExpr, ...], dtype: str = "float32", strides: tuple[PrimExpr, ...] = None) -> tir.Buffer: """Create a buffer from a pointer, shape, and data type. Args: pointer_var: The pointer variable shape: The shape of the buffer dtype: The data type of the buffer (default: float32) + strides: The strides of the buffer (default: None for contiguous) Returns: A buffer created from the given parameters """ return match_buffer(pointer_var, shape, dtype=dtype, strides=strides)
163-178: Verify stride validation logic and improve error messages.The implementation looks good overall, but there are a few concerns:
- The validation
bool(strides[-1] == 1)might not work correctly for all types of stride expressions- Error messages could be more descriptive
Consider this improvement for more robust validation:
def __call__(self, shape: Tuple[Any], strides: Tuple[Any], dtype: str = "float32") -> tir.Buffer: if len(shape) != len(strides): - raise ValueError("Invalid shape/strides' dimensions") + raise ValueError(f"Shape and strides must have same length: got {len(shape)} vs {len(strides)}") - if not bool(strides[-1] == 1): + # More robust check for last stride being 1 + last_stride = strides[-1] + if not (last_stride == 1 or (hasattr(last_stride, 'value') and last_stride.value == 1)): # TODO(chenggang): shall we support non-contiguous even for the last dimension? - raise ValueError("The stride of the last dimension must be 1 (contiguous)") + raise ValueError(f"Last dimension stride must be 1 for contiguity, got {last_stride}") return super().__call__(shape, dtype=dtype, strides=strides)
244-246: Add strides parameter to TYPE_CHECKING docstring.The
stridesparameter is missing from the type stub's docstring for consistency.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
examples/fusedmoe/example_fusedmoe_tilelang.py(0 hunks)tilelang/language/proxy.py(8 hunks)
💤 Files with no reviewable changes (1)
- examples/fusedmoe/example_fusedmoe_tilelang.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tilelang/language/proxy.py (1)
tilelang/language/ast/ir.py (1)
match_buffer(244-339)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
🔇 Additional comments (6)
tilelang/language/proxy.py (6)
4-4: LGTM!The addition of
Tupleimport is needed for the new type annotations in the stride-related functionality.
145-151: LGTM! Efficient stride computation implementation.The
_construct_stridesmethod correctly computes contiguous strides by working backwards through the shape dimensions. The implementation is efficient and mathematically correct.
153-161: LGTM! Clean integration of default stride computation.The updated
__call__method correctly handles scalar/single-dimension inputs and automatically computes contiguous strides. The implementation maintains backward compatibility while adding the new stride functionality.
251-252: LGTM! Correct type stub for StridedTensor.The type stub correctly inherits from
BaseTensorto provide proper type checking support.
264-264: LGTM! Runtime instantiation of StridedTensor.The runtime alias correctly instantiates
StridedTensorProxy()for use outside type checking.
295-299: LGTM! Clean public API for tensor creation.The
make_tensorfunction provides a clean public interface that forwards all parameters including the newstridesparameter to the underlying implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
♻️ Duplicate comments (2)
src/target/codegen_cuda.cc (1)
1723-1735: Commented-out alignment check — keep disabled, but clarify with a TODOThere’s a legacy, commented-out modular alignment check here. The previous review already flagged this as dead code; you deferred removal due to dynamic strides. Keep it disabled, but replace the block with a single TODO for future re-enablement under static alignment.
- // arith::ModularSet me = arith::Analyzer().modular_set(ramp->base); - // The condition: {k * coeff + base} divisible by the alignment for any k - // if (me->coeff % op->dtype.lanes() == 0 && me->base % op->dtype.lanes() - // == 0) { - // can_vector_load = true; - // } + // TODO(LeiWang1999): Consider re-enabling modular alignment checks for vector loads + // when static alignment information is available; keep disabled for dynamic strides.tilelang/jit/adapter/cython/cython_wrapper.pyx (1)
194-200: Replace private overlap check with a stable, public APIUsing
torch._debug_has_internal_overlapis fragile across PyTorch versions. Useis_non_overlapping_and_dense()on the constructed base view to detect internal overlap in a supported way.Apply this diff:
- # TODO(chenggang): remove this check or rewrite by ourselves? - if isinstance(tensor, torch.Tensor) and tensor._base is not None and not tensor.is_contiguous(): - base_tensor = tensor._base.as_strided(tensor._base.shape, tensor.stride()) - if torch._debug_has_internal_overlap(base_tensor): - raise ValueError(f"Cannot use an overlapping tensor" - f"(shape={tensor.shape}, strides={tensor.stride()}, " - f"overlap={torch._debug_has_internal_overlap(base_tensor)}) as the kernel input") + # TODO(chenggang): remove this check or rewrite by ourselves? + if isinstance(tensor, torch.Tensor) and tensor._base is not None and not tensor.is_contiguous(): + base_tensor = tensor._base.as_strided(tensor._base.shape, tensor.stride()) + if not base_tensor.is_non_overlapping_and_dense(): + raise ValueError( + f"Cannot use an overlapping/non-dense tensor as kernel input " + f"(shape={tuple(tensor.shape)}, strides={tuple(tensor.stride())})" + )Note: If you intend to allow non-dense-but-non-overlapping views, refine this check to strictly test overlap only.
🧹 Nitpick comments (5)
src/target/codegen_cuda.cc (1)
1718-1718: Typo in comment"delcare" → "declare".
Apply this diff:
- // delcare type. + // declare type.tilelang/jit/adapter/cython/cython_wrapper.pyx (4)
62-69: Harden setters with type validation and None defaultsDefensive checks here prevent subtle bugs from misconfigured adapters and allow callers to pass None to skip constraints.
Apply this diff:
def set_static_strides_map(self, static_strides_map): - self.static_strides_map = static_strides_map - return self + if static_strides_map is None: + self.static_strides_map = {} + elif not isinstance(static_strides_map, dict): + raise TypeError("static_strides_map must be a dict") + else: + self.static_strides_map = static_strides_map + return self def set_static_contiguous_list(self, static_contiguous_list): - self.static_contiguous_list = static_contiguous_list - return self + if static_contiguous_list is None: + self.static_contiguous_list = [] + elif not isinstance(static_contiguous_list, (list, tuple)): + raise TypeError("static_contiguous_list must be a list/tuple of (buffer_idx, param) pairs") + else: + self.static_contiguous_list = list(static_contiguous_list) + return self
134-142: Enrich contiguity error message with shape/stride contextIncluding shape/strides greatly speeds up diagnostics.
Apply this diff:
for buffer_idx, param in self.static_contiguous_list: tensor = tensor_list[buffer_idx] if not isinstance(tensor, torch.Tensor): # otherwise, maybe torch.data_ptr() for T.ptr inputs continue if not tensor.is_contiguous(): - raise ValueError(f"Expected parameter {param} to be a contiguous tensor") + raise ValueError( + f"Expected parameter {param} to be a contiguous tensor " + f"(shape={tuple(tensor.shape)}, strides={tuple(tensor.stride())})" + )
179-181: Break after resolving a dynamic shape symbol to avoid duplicate appends and wasted workIf multiple keys stringify to the same name (unlikely but possible), you’ll append duplicates. Also avoids unnecessary iteration.
Apply this diff:
- ref_id, ref_tensor_idx, ref_shape_idx = self.dynamic_symbolic_map[key] - shape.append(tensor_list[ref_tensor_idx].shape[ref_shape_idx]) + ref_id, ref_tensor_idx, ref_shape_idx = self.dynamic_symbolic_map[key] + shape.append(tensor_list[ref_tensor_idx].shape[ref_shape_idx]) + breakAdditionally, consider avoiding
str()comparisons by keyingdynamic_symbolic_mapwith the actualtir.Varor a stable identifier (e.g.,s.name), if feasible.
239-244: Validate dynamic index bounds when emitting shape/stride argumentsIf a map entry references an invalid dim, you’ll get an unhelpful
IndexError. Provide clear errors and normalize negative indices.Apply this diff:
- for _, (ref_id, buffer_idx, shape_idx) in self.dynamic_symbolic_map.items(): - if ref_id == 0: - call_args.append(tensor_list[buffer_idx].shape[shape_idx]) - else: - call_args.append(tensor_list[buffer_idx].stride(shape_idx)) + for _, (ref_id, buffer_idx, shape_idx) in self.dynamic_symbolic_map.items(): + t = tensor_list[buffer_idx] + if not isinstance(t, torch.Tensor): + raise ValueError(f"Dynamic symbol references non-tensor input at index {buffer_idx}") + ndim = t.dim() + idx = shape_idx if shape_idx >= 0 else shape_idx + ndim + if idx < 0 or idx >= ndim: + raise ValueError( + f"Dynamic symbol index out of range: idx={shape_idx}, ndim={ndim}, " + f"tensor shape={tuple(t.shape)}, strides={tuple(t.stride())}" + ) + if ref_id == 0: + call_args.append(t.shape[idx]) + else: + call_args.append(t.stride()[idx])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
src/target/codegen_cuda.cc(1 hunks)tilelang/jit/adapter/cython/cython_wrapper.pyx(6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/target/codegen_cuda.cc (1)
src/target/codegen_hip.cc (24)
VisitExpr_(604-632)VisitExpr_(604-604)VisitExpr_(752-963)VisitExpr_(752-752)VisitExpr_(1026-1039)VisitExpr_(1026-1026)VisitExpr_(1041-1156)VisitExpr_(1041-1042)VisitExpr_(1203-1206)VisitExpr_(1203-1204)op(90-105)op(90-90)GetBufferRef(693-750)GetBufferRef(693-695)HandleVolatileLoads(1208-1222)HandleVolatileLoads(1208-1210)PrintStorageScope(572-582)PrintStorageScope(572-573)PrintType(179-422)PrintType(179-179)PrintVecElemLoad(460-509)PrintVecElemLoad(460-462)PrintVecElemLoadExpr(1224-1289)PrintVecElemLoadExpr(1224-1226)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-nvidia
- GitHub Check: build-test-amd
- GitHub Check: bot-task
🔇 Additional comments (2)
src/target/codegen_cuda.cc (2)
1705-1711: Preconditions for BufferLoad visitor — LGTMFlat (1D) and non-predicated loads only is a reasonable, explicit constraint here. Good defensive checks.
1737-1741: FP4 lane special-case — prudent fallbackDisabling vector loads for multi-lane float4_e2m1fn avoids incomplete-byte hazards. Good defensive choice.
| std::string ref = GetVecLoad(op->dtype, op->buffer.get(), base.Eval()); | ||
| HandleVolatileLoads(ref, op, os); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Confirm GetVecLoad availability
The vector path relies on GetVecLoad(dtype, buffer, base). Please confirm this overload is declared/defined for CUDA and matches the expected semantics (alignment, address space handling).
Run the following to verify declaration in the header and definitions in the codebase:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for VisitExpr_ BufferLoad declaration in codegen_cuda.h"
rg -n -C2 -P 'class\s+CodeGenTileLangCUDA\b(?s).*?\bVisitExpr_\s*\(\s*const\s+BufferLoadNode\s*\*\s*op\s*,\s*std::ostream\s*&\s*os\)' -- src
echo
echo "Searching for GetVecLoad declarations/definitions"
rg -n -C3 -P '\bGetVecLoad\s*\(' -- srcLength of output: 303
To be sure the GetVecLoad overload exists (and isn’t just a false negative from the previous search), let’s search for every occurrence of GetVecLoad in all codegen files:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for all GetVecLoad occurrences in codegen sources"
rg -n 'GetVecLoad' -g 'src/**/codegen_*.cc' -g 'src/**/codegen_*.h'Length of output: 326
Let’s search for any GetVecLoad declarations or definitions across the entire src/ tree to ensure it’s not defined in another module:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for explicit GetVecLoad definitions (e.g., std::string GetVecLoad(...))"
rg -nP 'std::string\s+GetVecLoad\s*\(' -- src
echo
echo "Searching for all GetVecLoad occurrences"
rg -n 'GetVecLoad' -- srcLength of output: 240
Missing GetVecLoad implementation in CUDA codegen
I searched the entire src/ tree and only found the call in codegen_cuda.cc (line 1743). There is no declaration or definition of any GetVecLoad overload for CUDA, so this reference will fail to compile.
• Add a declaration in src/target/codegen_cuda.h (e.g.:
std::string GetVecLoad(DataType dtype, const BufferNode* buffer, Expr base);)
• Provide the corresponding definition in src/target/codegen_cuda.cc, handling alignment and CUDA address spaces as needed.
This must be implemented before merging.
🤖 Prompt for AI Agents
In src/target/codegen_cuda.cc around lines 1743 to 1745, the call to
GetVecLoad(...) is unresolved because there is no CUDA-specific
declaration/definition; add a declaration in src/target/codegen_cuda.h with the
signature std::string GetVecLoad(DataType dtype, const BufferNode* buffer, Expr
base); and implement the matching definition in src/target/codegen_cuda.cc:
accept (DataType dtype, const BufferNode* buffer, Expr base), compute lane
width/alignment from dtype and buffer->dtype/strides, generate CUDA-appropriate
load code that respects buffer address spaces (global/shared/const) and
alignment/predication for unaligned or tail lanes, reuse the existing
GetVecStore/load patterns from other backends as a template, include the header
where needed, and run a build to ensure the symbol is resolved.
| value_temp << "(("; | ||
| if (buffer_var.get()->dtype.is_handle()) { | ||
| auto it = alloc_storage_scope_.find(buffer_var.get()); | ||
| if (it != alloc_storage_scope_.end()) { | ||
| PrintStorageScope(it->second, value_temp); | ||
| } | ||
| } | ||
| PrintType(elem_type, value_temp); | ||
| value_temp << "*)" << vid << ')'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid printing storage scope in C-style casts unless it’s part of the type
In the per-lane fallback, the cast includes the storage scope unconditionally. Elsewhere (e.g., GetBufferRef) this is guarded by IsScopePartOfType() to avoid generating invalid types like (shared float*). Mirror that behavior here to prevent malformed code on targets where storage scope is not encoded in the type.
- if (!HandleTypeMatch(buffer_var.get(), elem_type)) {
- value_temp << "((";
- if (buffer_var.get()->dtype.is_handle()) {
- auto it = alloc_storage_scope_.find(buffer_var.get());
- if (it != alloc_storage_scope_.end()) {
- PrintStorageScope(it->second, value_temp);
- }
- }
- PrintType(elem_type, value_temp);
- value_temp << "*)" << vid << ')';
- } else {
- value_temp << vid;
- }
+ if (!HandleTypeMatch(buffer_var.get(), elem_type)) {
+ value_temp << "((";
+ if (buffer_var.get()->dtype.is_handle()) {
+ auto it = alloc_storage_scope_.find(buffer_var.get());
+ if (it != alloc_storage_scope_.end() && IsScopePartOfType()) {
+ PrintStorageScope(it->second, value_temp);
+ }
+ }
+ PrintType(elem_type, value_temp);
+ value_temp << "*)" << vid << ')';
+ } else {
+ value_temp << vid;
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| value_temp << "(("; | |
| if (buffer_var.get()->dtype.is_handle()) { | |
| auto it = alloc_storage_scope_.find(buffer_var.get()); | |
| if (it != alloc_storage_scope_.end()) { | |
| PrintStorageScope(it->second, value_temp); | |
| } | |
| } | |
| PrintType(elem_type, value_temp); | |
| value_temp << "*)" << vid << ')'; | |
| if (!HandleTypeMatch(buffer_var.get(), elem_type)) { | |
| value_temp << "(("; | |
| if (buffer_var.get()->dtype.is_handle()) { | |
| auto it = alloc_storage_scope_.find(buffer_var.get()); | |
| if (it != alloc_storage_scope_.end() && IsScopePartOfType()) { | |
| PrintStorageScope(it->second, value_temp); | |
| } | |
| } | |
| PrintType(elem_type, value_temp); | |
| value_temp << "*)" << vid << ')'; | |
| } else { | |
| value_temp << vid; | |
| } |
🤖 Prompt for AI Agents
In src/target/codegen_cuda.cc around lines 1753 to 1761, the C-style cast in the
per-lane fallback unconditionally prints the storage scope which can produce
invalid types like '(__shared__ float*)'; change the logic so you only emit the
storage scope when it is actually part of the buffer's type by adding a guard
that calls IsScopePartOfType(buffer_var->dtype) (or equivalent) before invoking
PrintStorageScope on the alloc_storage_scope_ entry; keep the existing
alloc_storage_scope_ lookup and only call PrintStorageScope if the entry exists
AND the scope is part of the type, otherwise skip printing the scope so the
resulting cast is valid.
| object dynamic_symbolic_map # Maps dynamic dimensions to their corresponding tensor indices | ||
| object buffer_device_map # Maps buffer variables to their corresponding devices | ||
| object buffer_dtype_map # Maps buffer variables to their corresponding dtypes | ||
| object static_shape_map # Maps buffer variables to their corresponding static shapes | ||
| object static_strides_map # Maps buffer variables to their corresponding static strides | ||
| object static_contiguous_list # A list contains contiguous buffers | ||
| object ptr_map # Maps pointer arguments to their corresponding buffer indices | ||
| list result_idx # Indices of output tensors in the params list | ||
| list params # List of parameter specifications (includes both inputs and outputs) | ||
| object lib # Reference to the compiled library containing the kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Default-initialize new config fields or guard their usage to avoid AttributeError/TypeError at runtime
static_strides_map and static_contiguous_list are new and later used unconditionally. If an upstream adapter forgets to set them, forward() will raise on .items() or iteration. Initialize these to sane defaults in __cinit__ (e.g., {}/[]) or guard the checks (see diff on Lines 235–237).
Recommended initialization in __cinit__ (outside this hunk):
# In __cinit__
self.dynamic_symbolic_map = {}
self.buffer_device_map = {}
self.buffer_dtype_map = {}
self.static_shape_map = {}
self.static_strides_map = {}
self.static_contiguous_list = []
self.ptr_map = {}🏁 Script executed:
#!/bin/bash
set -e
echo "=== __cinit__ definitions ==="
grep -R "def __cinit__" -n tilelang/jit/adapter/cython/cython_wrapper.pyx || echo "(none found)"
echo ""
echo "=== Usage of static_strides_map ==="
grep -R "static_strides_map" -n tilelang/jit/adapter/cython/cython_wrapper.pyx || echo "(none found)"
echo ""
echo "=== Usage of static_contiguous_list ==="
grep -R "static_contiguous_list" -n tilelang/jit/adapter/cython/cython_wrapper.pyx || echo "(none found)"
echo ""
echo "=== forward() method ==="
grep -R "def forward" -n tilelang/jit/adapter/cython/cython_wrapper.pyx || echo "(none found)"Length of output: 1389
🏁 Script executed:
#!/bin/bash
set -e
echo "=== Lines 1-80 (including __cinit__) ==="
sed -n '1,80p' tilelang/jit/adapter/cython/cython_wrapper.pyx || true
echo ""
echo "=== Lines 130-180 (forward and usage sites) ==="
sed -n '130,180p' tilelang/jit/adapter/cython/cython_wrapper.pyx || trueLength of output: 6306
Initialize all buffer‐config fields in __cinit__ to prevent runtime errors
static_strides_map and static_contiguous_list (and their sibling maps) are declared but never default‐initialized. If an upstream caller never invokes their setters, any unguarded access in forward() (e.g. .items() or iteration) will raise.
Pinpoint and patch in tilelang/jit/adapter/cython/cython_wrapper.pyx, inside __cinit__ (around line 29):
def __cinit__(self, result_idx, params, lib):
self.result_idx = result_idx
self.params = params
self.lib = lib
+ # Default‐initialize all maps/lists to sane defaults
+ self.dynamic_symbolic_map = {}
+ self.buffer_device_map = {}
+ self.buffer_dtype_map = {}
+ self.static_shape_map = {}
+ self.static_strides_map = {}
+ self.static_contiguous_list = []
+ self.ptr_map = {}
# Existing dtype/shape conversion logic…This ensures that later calls in forward() (and the internal _check_* helpers) won’t encounter AttributeError or TypeError.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| object dynamic_symbolic_map # Maps dynamic dimensions to their corresponding tensor indices | |
| object buffer_device_map # Maps buffer variables to their corresponding devices | |
| object buffer_dtype_map # Maps buffer variables to their corresponding dtypes | |
| object static_shape_map # Maps buffer variables to their corresponding static shapes | |
| object static_strides_map # Maps buffer variables to their corresponding static strides | |
| object static_contiguous_list # A list contains contiguous buffers | |
| object ptr_map # Maps pointer arguments to their corresponding buffer indices | |
| list result_idx # Indices of output tensors in the params list | |
| list params # List of parameter specifications (includes both inputs and outputs) | |
| object lib # Reference to the compiled library containing the kernel | |
| def __cinit__(self, result_idx, params, lib): | |
| self.result_idx = result_idx | |
| self.params = params | |
| self.lib = lib | |
| # Default‐initialize all maps/lists to sane defaults | |
| self.dynamic_symbolic_map = {} | |
| self.buffer_device_map = {} | |
| self.buffer_dtype_map = {} | |
| self.static_shape_map = {} | |
| self.static_strides_map = {} | |
| self.static_contiguous_list = [] | |
| self.ptr_map = {} | |
| # Existing dtype/shape conversion logic… |
🤖 Prompt for AI Agents
In tilelang/jit/adapter/cython/cython_wrapper.pyx around lines 14–29 (inside
__cinit__), several buffer-config fields like static_strides_map and
static_contiguous_list (and their sibling maps) are declared but not
default-initialized; initialize all declared maps to empty dicts (e.g.
dynamic_symbolic_map, buffer_device_map, buffer_dtype_map, static_shape_map,
static_strides_map, ptr_map) and initialize list fields to empty lists
(static_contiguous_list, result_idx, params), and set lib to None so that
forward() and the _check_* helpers can safely call .items() or iterate without
risking AttributeError/TypeError.
| list param_dtypes # Cache for parameter dtypes | ||
| list param_shapes # Cache for parameter shapes as native Python lists | ||
| object get_current_device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Map TVM dtypes to torch dtypes for param_dtypes; current cache likely holds non-torch dtypes
param_dtypes = [param.dtype for param in params] will carry TVM/Tile types. You later use these for torch.empty(..., dtype=...) and for dtype_to_ctype lookup. Without mapping, output allocation or scalar marshalling can fail.
Initialize with map_torch_type once to guarantee torch dtypes.
Apply in __cinit__ (outside this hunk):
# Convert TVM types to torch dtypes during initialization
self.param_dtypes = [map_torch_type(param.dtype) for param in params]Also, consider documenting/handling scalar half/bfloat16 cases in your dtype_to_ctype mapping (ctypes has no native float16; see suggestions near call_args assembly).
| cpdef void _check_static_shape(self, list tensor_list): | ||
| for param, (buffer_idx, shape_list) in self.static_shape_map.items(): | ||
| tensor = tensor_list[buffer_idx] | ||
| if isinstance(tensor, torch.Tensor): | ||
| for shape_idx, expected_shape in shape_list: | ||
| actual_shape = tensor.shape[shape_idx] | ||
| if actual_shape != expected_shape: | ||
| raise ValueError( | ||
| f"Static shape mismatch for parameter {param}: " | ||
| f"expected {expected_shape} at index {shape_idx}, " | ||
| f"got {actual_shape}" | ||
| ) | ||
| if not isinstance(tensor, torch.Tensor): | ||
| # otherwise, maybe torch.data_ptr() for T.ptr inputs | ||
| continue | ||
| for shape_idx, expected_shape in shape_list: | ||
| actual_shape = tensor.shape[shape_idx] | ||
| if actual_shape != expected_shape: | ||
| raise ValueError( | ||
| f"Static shape mismatch for parameter {param}: " | ||
| f"expected {expected_shape} at index {shape_idx}, " | ||
| f"got {actual_shape}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Bounds-check static shape indices to avoid IndexError and improve error clarity
Accessing tensor.shape[shape_idx] will raise an unhelpful IndexError for invalid indices (especially negative). Add explicit bounds handling.
Apply this diff:
for param, (buffer_idx, shape_list) in self.static_shape_map.items():
tensor = tensor_list[buffer_idx]
if not isinstance(tensor, torch.Tensor):
# otherwise, maybe torch.data_ptr() for T.ptr inputs
continue
for shape_idx, expected_shape in shape_list:
- actual_shape = tensor.shape[shape_idx]
+ ndim = tensor.dim()
+ idx = shape_idx if shape_idx >= 0 else shape_idx + ndim
+ if idx < 0 or idx >= ndim:
+ raise ValueError(
+ f"Static shape index out of range for parameter {param}: "
+ f"index={shape_idx}, ndim={ndim}, shape={tuple(tensor.shape)}"
+ )
+ actual_shape = tensor.shape[idx]
if actual_shape != expected_shape:
raise ValueError(
f"Static shape mismatch for parameter {param}: "
- f"expected {expected_shape} at index {shape_idx}, "
- f"got {actual_shape}"
+ f"expected {expected_shape} at index {shape_idx} "
+ f"(resolved={idx}), got {actual_shape} (shape={tuple(tensor.shape)})"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cpdef void _check_static_shape(self, list tensor_list): | |
| for param, (buffer_idx, shape_list) in self.static_shape_map.items(): | |
| tensor = tensor_list[buffer_idx] | |
| if isinstance(tensor, torch.Tensor): | |
| for shape_idx, expected_shape in shape_list: | |
| actual_shape = tensor.shape[shape_idx] | |
| if actual_shape != expected_shape: | |
| raise ValueError( | |
| f"Static shape mismatch for parameter {param}: " | |
| f"expected {expected_shape} at index {shape_idx}, " | |
| f"got {actual_shape}" | |
| ) | |
| if not isinstance(tensor, torch.Tensor): | |
| # otherwise, maybe torch.data_ptr() for T.ptr inputs | |
| continue | |
| for shape_idx, expected_shape in shape_list: | |
| actual_shape = tensor.shape[shape_idx] | |
| if actual_shape != expected_shape: | |
| raise ValueError( | |
| f"Static shape mismatch for parameter {param}: " | |
| f"expected {expected_shape} at index {shape_idx}, " | |
| f"got {actual_shape}" | |
| ) | |
| cpdef void _check_static_shape(self, list tensor_list): | |
| for param, (buffer_idx, shape_list) in self.static_shape_map.items(): | |
| tensor = tensor_list[buffer_idx] | |
| if not isinstance(tensor, torch.Tensor): | |
| # otherwise, maybe torch.data_ptr() for T.ptr inputs | |
| continue | |
| for shape_idx, expected_shape in shape_list: | |
| ndim = tensor.dim() | |
| idx = shape_idx if shape_idx >= 0 else shape_idx + ndim | |
| if idx < 0 or idx >= ndim: | |
| raise ValueError( | |
| f"Static shape index out of range for parameter {param}: " | |
| f"index={shape_idx}, ndim={ndim}, shape={tuple(tensor.shape)}" | |
| ) | |
| actual_shape = tensor.shape[idx] | |
| if actual_shape != expected_shape: | |
| raise ValueError( | |
| f"Static shape mismatch for parameter {param}: " | |
| f"expected {expected_shape} at index {shape_idx} " | |
| f"(resolved={idx}), got {actual_shape} (shape={tuple(tensor.shape)})" | |
| ) |
🤖 Prompt for AI Agents
In tilelang/jit/adapter/cython/cython_wrapper.pyx around lines 104 to 118, the
code accesses tensor.shape[shape_idx] directly which can raise an IndexError for
invalid or negative indices; update the method to validate and normalize
shape_idx before indexing: compute ndim = len(tensor.shape), if shape_idx is
negative add ndim to normalize, then check 0 <= shape_idx < ndim and if out of
range raise a ValueError with a clear message showing param, provided index, and
tensor rank (do not let a raw IndexError escape); only after the bounds check
access tensor.shape[shape_idx] and keep the existing static-shape mismatch
ValueError for mismatched sizes.
| cpdef void _check_static_strides(self, list tensor_list): | ||
| for param, (buffer_idx, strides_list) in self.static_strides_map.items(): | ||
| tensor = tensor_list[buffer_idx] | ||
| if not isinstance(tensor, torch.Tensor): | ||
| # otherwise, maybe torch.data_ptr() for T.ptr inputs | ||
| continue | ||
| for stride_idx, expected_stride in strides_list: | ||
| actual_stride = tensor.stride(stride_idx) | ||
| if actual_stride != expected_stride: | ||
| raise ValueError( | ||
| f"Static stride mismatch for parameter {param}: " | ||
| f"expected {expected_stride} at index {stride_idx}, " | ||
| f"got {actual_stride}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Bounds-check static stride indices and support negative indexing
Similar to shapes, tensor.stride(dim) with an invalid index raises IndexError. Normalize negative indices and validate bounds.
Apply this diff:
for param, (buffer_idx, strides_list) in self.static_strides_map.items():
tensor = tensor_list[buffer_idx]
if not isinstance(tensor, torch.Tensor):
# otherwise, maybe torch.data_ptr() for T.ptr inputs
continue
for stride_idx, expected_stride in strides_list:
- actual_stride = tensor.stride(stride_idx)
+ ndim = tensor.dim()
+ idx = stride_idx if stride_idx >= 0 else stride_idx + ndim
+ if idx < 0 or idx >= ndim:
+ raise ValueError(
+ f"Static stride index out of range for parameter {param}: "
+ f"index={stride_idx}, ndim={ndim}, strides={tuple(tensor.stride())}"
+ )
+ actual_stride = tensor.stride()[idx]
if actual_stride != expected_stride:
raise ValueError(
f"Static stride mismatch for parameter {param}: "
- f"expected {expected_stride} at index {stride_idx}, "
- f"got {actual_stride}"
+ f"expected {expected_stride} at index {stride_idx} "
+ f"(resolved={idx}), got {actual_stride} "
+ f"(strides={tuple(tensor.stride())})"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cpdef void _check_static_strides(self, list tensor_list): | |
| for param, (buffer_idx, strides_list) in self.static_strides_map.items(): | |
| tensor = tensor_list[buffer_idx] | |
| if not isinstance(tensor, torch.Tensor): | |
| # otherwise, maybe torch.data_ptr() for T.ptr inputs | |
| continue | |
| for stride_idx, expected_stride in strides_list: | |
| actual_stride = tensor.stride(stride_idx) | |
| if actual_stride != expected_stride: | |
| raise ValueError( | |
| f"Static stride mismatch for parameter {param}: " | |
| f"expected {expected_stride} at index {stride_idx}, " | |
| f"got {actual_stride}" | |
| ) | |
| cpdef void _check_static_strides(self, list tensor_list): | |
| for param, (buffer_idx, strides_list) in self.static_strides_map.items(): | |
| tensor = tensor_list[buffer_idx] | |
| if not isinstance(tensor, torch.Tensor): | |
| # otherwise, maybe torch.data_ptr() for T.ptr inputs | |
| continue | |
| for stride_idx, expected_stride in strides_list: | |
| ndim = tensor.dim() | |
| idx = stride_idx if stride_idx >= 0 else stride_idx + ndim | |
| if idx < 0 or idx >= ndim: | |
| raise ValueError( | |
| f"Static stride index out of range for parameter {param}: " | |
| f"index={stride_idx}, ndim={ndim}, strides={tuple(tensor.stride())}" | |
| ) | |
| actual_stride = tensor.stride()[idx] | |
| if actual_stride != expected_stride: | |
| raise ValueError( | |
| f"Static stride mismatch for parameter {param}: " | |
| f"expected {expected_stride} at index {stride_idx} " | |
| f"(resolved={idx}), got {actual_stride} " | |
| f"(strides={tuple(tensor.stride())})" | |
| ) |
🤖 Prompt for AI Agents
In tilelang/jit/adapter/cython/cython_wrapper.pyx around lines 119 to 133, the
stride index is used directly which can raise an unhandled IndexError for
out-of-range or negative values; normalize negative indices by adding
tensor.dim() and validate that 0 <= normalized_index < tensor.dim() before
calling tensor.stride(normalized_index), and if out of bounds raise a clear
IndexError indicating the parameter name, original stride_idx and valid range.
Ensure you compute tensor_dim = tensor.dim() once per tensor and use the
normalized index for the actual stride check.
| self._check_static_strides(tensor_list) | ||
| self._check_static_contiguous(tensor_list) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard new static checks to handle absent configs gracefully
Avoids failures when these constraints are not configured by the caller.
Apply this diff:
- self._check_static_strides(tensor_list)
- self._check_static_contiguous(tensor_list)
+ if getattr(self, "static_strides_map", None) is not None:
+ self._check_static_strides(tensor_list)
+ if getattr(self, "static_contiguous_list", None) is not None:
+ self._check_static_contiguous(tensor_list)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self._check_static_strides(tensor_list) | |
| self._check_static_contiguous(tensor_list) | |
| if getattr(self, "static_strides_map", None) is not None: | |
| self._check_static_strides(tensor_list) | |
| if getattr(self, "static_contiguous_list", None) is not None: | |
| self._check_static_contiguous(tensor_list) |
🤖 Prompt for AI Agents
In tilelang/jit/adapter/cython/cython_wrapper.pyx around lines 235-237, the
unguarded calls to self._check_static_strides(tensor_list) and
self._check_static_contiguous(tensor_list) will raise when the caller did not
configure static-constraint checks; guard those calls by first checking whether
the configuration or flags enabling static checks are present (e.g. a boolean
like self.static_checks_enabled or a config dict on self), or fall back to
checking the presence of the methods with hasattr/getattr; only call each
_check_* method when the corresponding config flag exists and is truthy (or the
method exists), otherwise skip them gracefully.
…ansformation and updating expected function signature to use match_buffer for better clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py (3)
12-12: Stabilize test by applying the same pass chain to both original and expected.Adding LowerOpaqueBlock is fine, but hard-coding the lowered “expected” IR makes the test brittle to pass-order tweaks. Prefer keeping expected at the high-level form and running the same pass pipeline on both sides inside
_check.You can refactor
_checklike this to reduce churn:def _check(original, expected): def pipe(mod): mod = tl.transform.InjectSoftwarePipeline()(mod) mod = tl.transform.Simplify()(mod) mod = tl.transform.LowerOpaqueBlock()(mod) return mod mod_in = pipe(tvm.IRModule.from_expr(original.with_attr("global_symbol", "main"))) mod_ex = pipe(tvm.IRModule.from_expr(expected.with_attr("global_symbol", "main"))) tvm.ir.assert_structural_equal(mod_in["main"], mod_ex["main"], True)
47-47: Be explicit about dtype for the declared buffer.Defaulting to float32 is fine, but being explicit reduces ambiguity and prevents subtle mismatches if defaults change.
- B = T.decl_buffer((2, 16, 1), scope="shared") + B = T.decl_buffer((2, 16, 1), dtype="float32", scope="shared")
49-51: Zero-extent loop is a no-op; confirm it matches the injected pipeline shape.For extent 1, a zero-iteration main loop plus a drain step is plausible. Just ensure the injected IR indeed emits this pattern so the structural equality stays stable.
If the pass emits T.serial consistently, you may mirror that style:
- for i in range(0): + for i in T.serial(0, 0): B[i + 1, tx, 0] = A[tx, i + 1] * T.float32(2.0) C[tx, i] = B[i, tx, 0] + T.float32(1.0)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py (1)
tilelang/language/ast/ir.py (4)
handle(1467-1497)match_buffer(244-339)launch_thread(1208-1240)decl_buffer(1137-1205)
🔇 Additional comments (1)
testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py (1)
44-45: Explicit strides look correct for a contiguous (16, 1) tensor.Using strides=(1, 1) matches the expected contiguous layout and aligns with the LowerOpaqueBlock-lowered form.
| def expected(A_handle: T.handle, C_handle: T.handle): | ||
| A = T.match_buffer(A_handle, (16, 1), strides=(1, 1)) | ||
| C = T.match_buffer(C_handle, (16, 1), strides=(1, 1)) | ||
| tx = T.launch_thread("threadIdx.x", 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Incorrect use of T.launch_thread — it returns a frame, not the thread var.
Per tilelang.language.ast.ir.launch_thread (see ir.py lines 1207–1239), launch_thread returns a LaunchThreadFrame. Don’t assign it to tx. Instead, declare the thread var with T.env_thread and then call T.launch_thread(tx, extent).
Apply this fix:
- tx = T.launch_thread("threadIdx.x", 16)
+ tx = T.env_thread("threadIdx.x")
+ T.launch_thread(tx, 16)To verify usage patterns across the repo:
🏁 Script executed:
#!/bin/bash
# Instances where launch_thread is assigned (likely incorrect)
rg -n -C3 -P '\b=\s*T\.launch_thread\s*\(' --type=py
# Common/canonical patterns
rg -n -C3 -P 'T\.env_thread\s*\(' --type=py
rg -n -C3 -P 'T\.launch_thread\s*\(' --type=pyLength of output: 16467
🏁 Script executed:
#!/bin/bash
# Show the definition of launch_thread in the IR
sed -n '1180,1270p' tilelang/language/ast/ir.pyLength of output: 2208
Fix incorrect T.launch_thread usage in Inject software pipeline test
The call to T.launch_thread returns a LaunchThreadFrame, not an iteration variable. Assigning it to tx leads to a frame being used in buffer indexing. Instead, bind the thread var via T.env_thread and then invoke T.launch_thread(tx, ...).
• File: testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py, line 46
Apply:
- tx = T.launch_thread("threadIdx.x", 16)
+ tx = T.env_thread("threadIdx.x")
+ T.launch_thread(tx, 16)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tx = T.launch_thread("threadIdx.x", 16) | |
| tx = T.env_thread("threadIdx.x") | |
| T.launch_thread(tx, 16) |
🤖 Prompt for AI Agents
In testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py
around line 46, the code assigns tx = T.launch_thread("threadIdx.x", 16) but
T.launch_thread returns a LaunchThreadFrame not a thread iteration var; instead
create the thread variable with tx = T.env_thread("threadIdx.x") and then call
T.launch_thread(tx, 16) (so the ensuing buffer indexing uses the tx iteration
variable, not the frame).
* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes. - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management. - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body. - Removed obsolete code and improved overall code clarity and maintainability. * lint fix * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves. - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations. * test fix * Enhance global read detection in pipeline planning - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses. - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis. - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code. * Add IndexLegalizer to enforce int64 for out-of-bound indices - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds. - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability. - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations. * [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow. * Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job. * fix: NVRTC backend (#717) * fix: NVRTC backend * fix: CI --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CUDA] Init support for sm_120 (#716) * Init support for sm120 * fmt * resolve comments * unify mma gemm * fmt --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CI] fix docs ci (#720) * [Chore] fix typos (#719) * chore: fix typos * chore: fix ruff * chore: fix clang-format * [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Update AMD FlashAttention example and TVM submodule - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support. - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads. - Updated the TVM submodule to the latest commit for improved functionality. - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters. * Add CI workflow for automated format checking and testing - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests. - The workflow includes steps for setting up a Python environment, running format checks, and executing tests. - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory. * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity. * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management. * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management. * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment. * Add installation of ROCm in AMD CI workflow - Included a step to execute the `install_rocm.sh` script for improved setup. - Removed unnecessary blank line for better readability in the workflow script. * Remove installation step for ROCm in AMD CI workflow to simplify the setup process. * Update AMD CI workflow to run specific test file with verbose output instead of all tests. * Add new tilelang built-in operations for AMD architecture - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang. - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects. * Enhance autotuner configurations and GEMM operations in AMD example - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning. - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications. * Update autotuner configurations in AMD example for enhanced performance - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning. - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns. * Enhance autotuner configurations and memory handling in AMD example - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities. - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations. * Refine autotuner configurations and memory usage in AMD example - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning. - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations. * Update autotuner configurations in AMD example for enhanced performance - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities. - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations. * Enhance autotuner configurations and GEMM operations in AMD example - Expanded thread counts in `get_configs` to include higher values for improved autotuning. - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications. * Update AMD CI workflow and remove obsolete test script - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu. - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure. * Remove TVM subproject from 3rdparty directory * Refactor configuration generation and accumulation logic in AMD example - Reformatted the `get_configs` function for improved readability by aligning parameters. - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions. * Enhance AMD CI workflow with additional logging and setup steps - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements. - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed. * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps * Enhance BuildTileLangHIP function by adding whitespace for improved readability * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * lint fix * Update AMD CI workflow to use requirements-rocm.txt for dependency installation * fix ci * Remove dependency on format-check from AMD CI workflow * fix ci * fix ci * fix ci * Remove format-check job from AMD CI workflow * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow * Add dependency on format-check job in AMD CI workflow * Add format-check job to AMD CI workflow * Update format-check job in AMD CI workflow to run on self-hosted environment * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes * Update amd_ci.yml --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724) * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy * [Typo] Correct architecture selection for CUDA and CDNA * [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor CUDA code generation to simplify eviction policy handling - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity. - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations. * lint fix * [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Support strided tensors * Refactor target attribute helper functions for improved clarity * No code changes made in proxy.py and setup.py * lint fix * lint fix via gemini * lint fix * test fix * test fix * lint fix * Update wrapper.py * test fix * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity. * lint fix --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com> * [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712) * bug fix and support gemm_sr fallback for hopper * Update gemm.cc --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `fix` (#726) Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851 The following files were modified: * `src/op/gemm.cc` * `src/tl_templates/cuda/gemm_sm90.h` * `src/transform/warp_specialized_rewriter.cc` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [CI] Fix AMD CI (#729) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> * [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `mxfp4` (#732) * 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561 The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> * [Refactor] Refactor env into a more flexible version (#740) * Fix environment variable name for compilation print setting in `env.py` * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity. * lint fix * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py` * [Enhancement] Add stride index validation in CythonKernelWrapper (#743) * Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`. * This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code. * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742) * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error * Update atomicadd_vectorize.cc * format * 📝 Add docstrings to PR #744 (#745) * 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559 The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Refactor] Refactor barrier management (#744) * Introduce Barrier * Enhance CUDA kernel with new barrier management and post-processing support - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers. - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure. - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency. - Introduced additional print statements for debugging in the lowering phase of the TileLang engine. - Enhanced the overall structure and readability of the codebase. * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic. * Enhance barrier management in TileLang - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework. - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory. - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code. - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine. - Removed deprecated memory scope handling code to enhance clarity and maintainability. * lint fix * lint fix * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability. * Refactor logging in JITKernel to improve kernel compilation tracking - Removed unused import of `torch.backends` in the example file. - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging. - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function. * Refactor dequantization tests and update barrier function - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite. - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management. * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed. * Fix typos in rasterization parameters and update import path for cached module - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage. - Updated the import statement for the `cached` module to reflect the new path in the cache submodule. - Added `StridedTensor` import in the language module for enhanced tensor functionality. * Update ci.yml * [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable. * lint fix * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency. * remove bulk copy * Refactor copy and atomic add operations to support TMA lower configuration - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach. * Enhance TMA bulk copy logic in `LowerBulkCopy` method - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities. * lint fix * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions. * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling. * Update `native_sparse_attention` function to include TMA configuration options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1. * Refactor JIT decorator formatting in `native_sparse_attention` function - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability. * Enhance thread management and logging in TileLang compilation - Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability. * Add warp specialization scope and refactor register management in TileLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process. * Refactor test for InjectSetMaxNReg pass in TileLang - Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability. * Enhance bulk copy and store checks in `Copy` class - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging. * lint fix * [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency. * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency. * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability. * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality. * typo fix * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues. * lint fix * test fix * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation. * use ninja instead of make * Add CMake configuration step for Ninja build system in setup.py * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja. * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility. * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks. * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly. * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes. * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions. * bugfix * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp. * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts. * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality. * [Enhancement] Optimize loop body handling in IR (#749) - Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable. - This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [MXFP4] Fix bugs and optimize exponential operation (#750) * [MXFP4] Fix bugs - Optimize exp2 with shift operation to boost performance - Fix bug of simple dequantization function call - Fix bug of scaling factor with bias * [Lint] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751) - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. * [Enhancement] Add shape checking for reduce options (#748) * Add shape checking for reduce options * lint fix * Handle special case reducing into shape-1 tensor Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix] Add missing FP8 header include (#752) * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Include cuda_fp8.h in gemm_sm90.h - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * lint fix * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase. - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations. - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability. * lint fix * [Refactor] Update header inclusions in common.h and gemm_sm90.h - Removed the inclusion of "intrin.h" from common.h to streamline dependencies. - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability. * bug fix * [MXFP4] Add bias to MXFP4 GEMM kernel (#753) * [MXFP4] Add bias to gemm kernel * [Lint] * [Lint] Rename "bias" to "Bias" * [Bugfix][WS] Consider loop min extent when computing phase id (#754) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * [Typo] Remove `disable_cache` in some tests (#755) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability. * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Feature] Add 1D TMA support (#761) * [Feature] Add 1D TMA support - Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example * [Lint] * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors * [Lint] * [BugFix] 1D TMA load * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Lint] * [Lint] * [MXFP4] Add test for bf16&mxfp4 gemm * [BugFix] * [Lint] --------- Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> * [Example] Add vertical slash sparse attention pattern (#762) * upd sparse attn * lint * rename * update test file * update benchmark * lint * update benchmark * [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767) * fix ci and pass bug * fix * try * lint * [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766) * [TMA] Add 1D TMA copy for Scale tensor * [Lint] * [Test] Add test for kernel * [BugFix] * hot fix blackwell (#768) * [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763) * Refactor operator classes to inherit from TileOperator and update layout inference methods - Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations. - Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency. - Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization. - Added missing layout inference implementations for Fill and Conv2DIm2ColOp. - Removed deprecated op.cc and op.h files to streamline the codebase. * lint fix * Refactor operator classes to use Node pattern and improve memory management - Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation. - Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access. - Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design. - Refactored InferLayout and Lower methods to ensure consistency across operator implementations. - Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase. * Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning - Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects. - Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations. - Made minor adjustments in layout inference and other related methods for consistency and clarity. * Refactor FillNode::Lower method to remove unused global function call - Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity. - Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies. * [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) * [Enhancement] Introduce finalize_reducer operator and layout reducer support - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations. - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management. - Updated `setup.py` to include logging for build directory paths, improving build process visibility. - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations. - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts. - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance. * Refactor code formatting and improve readability in multiple files - Cleaned up whitespace in `setup.py` to enhance logging clarity. - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability. - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability. - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity. * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability. * [Enhancement] Disable reuse of small arrays in shared memory allocation - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management. * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability. * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability. * bug fix * Add thread checks workaround for replicated cases * Remove the is_one check * fix lint error * lint fix * Update autotune tests to use smaller matrix sizes for improved performance and reliability * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure. - Enhanced layout inference and cloning methods in FinalizeReducerOpNode. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main. - Adjusted header inclusions for improved organization and clarity across multiple files. * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization. * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations. - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus. --------- Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com> Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com> * 📝 Add docstrings to `pytile_0826` (#770) * 📝 Add docstrings to `pytile_0826` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814 The following files were modified: * `src/op/atomic_add.cc` * `src/op/atomic_add.h` * `src/op/copy.cc` * `src/op/copy.h` * `src/op/elem.cc` * `src/op/elem.h` * `src/op/gemm.cc` * `src/op/gemm.h` * `src/op/gemm_sp.cc` * `src/op/gemm_sp.h` * `src/op/operator.cc` * `src/op/operator.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/op/reduce.h` * `src/op/region.cc` * `src/op/region.h` * `src/transform/layout_inference.cc` * `src/transform/lower_tile_op.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix]:Fix atomic add auto vectorize negative optimization (#765) * [Bugfix]:Fix atomic add auto vectorize negative optimization * fixbug * format * fix bug * 📝 Add docstrings to `reducer_0825` (#772) * 📝 Add docstrings to `reducer_0825` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118 The following files were modified: * `setup.py` * `src/op/builtin.h` * `src/op/finalize_reducer.cc` * `src/op/finalize_reducer.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/target/codegen_cuda.cc` * `src/tl_templates/cuda/common.h` * `src/transform/layout_inference.cc` * `src/transform/layout_reducer.cc` * `src/transform/layout_reducer.h` * `src/transform/merge_shared_memory_allocations.cc` * `src/transform/storage_access.cc` * `src/transform/warp_specialized_rewriter.cc` * `testing/python/autotune/test_tilelang_autotune_with_inputs.py` * `tilelang/engine/phase.py` * `tilelang/language/customize.py` * `tilelang/language/reduce.py` * `tilelang/transform/__init__.py` * lint fix * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * Allow fill global buffer (#774) * Allow fill global buffer * fix lint error * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771) * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match * [Lint] * add bf16 exp fallback (#776) * [Lint] Introduce clang-tidy into format.sh (#777) * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization. - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity. - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions. - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code. - General code cleanup and adherence to best practices for better maintainability. * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration - Updated .clang-tidy configuration to include additional checks for improved code quality and performance. - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity. - Replaced size checks with `empty()` method calls in various locations for clearer intent. - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency - Added clang-tidy checks to the format script for improved code quality assurance. - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity. - Updated the requirements-lint.txt file to include clang-tidy as a dependency. - General code cleanup to adhere to best practices and improve maintainability. * [CI] Update AMD CI Workflow to Include Build Directory Creation - Added steps to create a build directory and configure CMake with ROCm support during the format check process. - Ensured cleanup of the build directory after the format check to maintain a clean workspace. * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members. - This change enhances code clarity and maintainability by focusing on relevant attributes for each class. * [Refactor] Update Clang-Tidy Integration and Code Improvements - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes. - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures. - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity. - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency. - General code cleanup to maintain consistency and adhere to best practices. * [CI] Add Git Submodule Initialization to CI Workflow - Included a step to initialize and update git submodules recursively in the CI workflow. - This change ensures that all necessary submodules are available during the format check process, improving build reliability. * [CI] Add Git Submodule Update Step to Format Check - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process. - This enhancement ensures that all required submodules are available, contributing to improved build reliability. * [Refactor] Update Function Signatures in AtomicAddVectorize - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity. - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase. * [Cache] Introduce detailed target information for the disk kernel cache (#780) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * [Example]Adds example for top-k operation (#775) * [Example]Adds example for top-k operation Adds an example demonstrating the top-k operation using tilelang * format * Adds topk tilelang example test * fix lint * [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h * lint fix * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages * lint fix * Add cmake dependency to pyproject.toml and improve build logging in setup.py * [CI] Adds pytest-durations for test timing (#782) * [Ci] Adds pytest-durations for test timing Adds `pytest-durations` to the test requirements and configures pytest to display test durations. This helps in identifying slow-running tests and optimizing the test suite for faster feedback. * add amd ci durations * Removes flash_attn installation from CI * [Refactor] Support python reflection for tile operators (#783) * Implement Fill operator and related reflection methods in TileLang - Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers. - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities. - Updated relevant files to register reflection methods and ensure proper initialization in static blocks. - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability. - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly. * Refactor operator reflection methods and improve code clarity - Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks. - Consolidated static initialization blocks for various operators to a single line for improved consistency. - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability. - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports. * Refactor GEMM and AtomicAdd operations for improved clarity - Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety. - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method. - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity. - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports. * Remove deprecated operator files from the tilelang IR module - Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase. - This cleanup enhances maintainability by removing unused code and improving overall organization of the module. * Refactor imports in tilelang IR module for improved organization - Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase. * lint fix * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability. - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code. - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity. - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities. - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities. * Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure. - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization. - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability. - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability. * comment * Refactor operator header files for improved readability - Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity. - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions. * Refactor MakeReduce method in ReduceOpNode for clarity - Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability. - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation. * Update Reduce operation type checks for consistency - Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity. - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase. * [AMD] Fix amd tir&add examples (#784) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. * Add new AMD FlashAttention example and test script - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability. * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner - Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings. * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217 * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c * Add example for AMD Flash Attention backward pass implementation - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications. * Enhance AMD Flash Attention example with additional testing capabilities - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications. * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * Refactor HIP intrinsic rules to CUDA - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure. * Update AMD CI workflow to uninstall specific PyTorch packages before installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts. * Remove unused shared memory allocations in AMD Flash Attention backward example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead. * Remove unnecessary pip uninstall command from AMD CI workflow - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management. * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues. * Refactor formatting of HIP intrinsic rule registrations - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability. * Update file na…
/gemini summary
Summary by CodeRabbit
New Features
Improvements
Tests
Style