feat: add nvrtc execution backend #461

lucifer1004 · 2025-05-09T01:59:26Z

No description provided.

LeiWang1999 · 2025-05-09T10:19:01Z

likely all test passed, I'll dive deep to benchmark this backend in my device later. Thanks @lucifer1004 for your huge contribution!

…ckend. Raise ImportError with detailed installation instructions instead of logging a warning.

…or cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.

LeiWang1999 · 2025-06-05T07:04:44Z

Hi @lucifer1004, I've reviewed the proposal and overall LGTM. While nvrtc is promising, it will likely face GIL-related limitations in mainstream cuda environments (<=12.9) that prevent thread parallelization during autotuning. Therefore, I think we should default to Cython for now and enable it as a default backend when we used proc parallel or cuda toolkits >= 13.0.

If users prioritize faster compilation speeds, we can recommend nvrtc as an alternative.

Would you be able to share some benchmark data from your earlier tests showing the compilation speed improvements?

LeiWang1999 · 2025-06-05T07:04:59Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new NVRTC (NVIDIA Runtime Compilation) execution backend, which is a significant and valuable addition. The changes are extensive, touching upon CUDA template headers, caching mechanisms, JIT compilation, and adapter/wrapper logic. The overall approach of using cuda-python for NVRTC compilation and kernel launching from a generated Python host script is modern and well-implemented.

The new nvrtc_std.h provides essential shims for the NVRTC environment. The PyLibraryGenerator and NVRTCKernelAdapter are well-structured to handle the specifics of the NVRTC backend, including the generation and caching of both CUBIN and Python host files.

I've identified a couple of medium-severity issues related to documentation consistency and a typo, along with some minor stylistic points noted in the summary. Addressing these will improve the clarity and maintainability of the new backend.

Summary of Findings

Docstring/Signature Mismatch: The compile function in tilelang/jit/__init__.py has a mismatch between the documented default for execution_backend ('nvrtc') and its actual signature default ('cython').
Typo in Constant Name: A constant KERNAL_LAUNCH_FUNC_PY in tilelang/jit/adapter/wrapper.py appears to have a typo and should likely be KERNEL_LAUNCH_FUNC_PY.
Missing Newline at EOF: The file src/tl_templates/cuda/nvrtc_std.h is missing a newline character at the end of the file. This was not commented on directly due to review settings (severity: low).
Missing Newline at EOF: The file tilelang/jit/adapter/__init__.py is missing a newline character at the end of the file. This was not commented on directly due to review settings (severity: low).

Merge Readiness

The pull request introduces a significant new feature and the core implementation looks solid. However, there are a couple of medium-severity issues (a docstring/signature mismatch and a typo in a constant name) that should be addressed to ensure clarity and maintainability. Once these are resolved, the PR should be in good shape for further review and merging. As an AI, I am not authorized to approve pull requests.

tilelang/jit/__init__.py

tilelang/jit/adapter/wrapper.py

…ng reduced compilation time for CUDA templates.

commit 6c97f8bcafd3fcffc938a32ba80e2ef644707b71 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon Jun 23 18:15:29 2025 +0800 [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy (#592) * [Enhancement] Improve memory access condition checks in GlobalMemChecker - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations. - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis. * lintfix * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation. - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function. commit a6e35dff911f90b2fd5d98dada138ac495f318b6 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Jun 22 18:28:32 2025 +0800 [Enhancement] Improve memory access condition checks in GlobalMemChecker (#591) * [Enhancement] Improve memory access condition checks in GlobalMemChecker - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations. - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis. * lintfix commit 37ace3c816c1c39fc0ac60738c3c2c82022646fa Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sat Jun 21 20:02:20 2025 +0800 [Refactor] Improve tensor shape compatibility checks in AutoTuner (#590) - Simplified the shape comparison logic in the AutoTuner class to enhance readability and maintainability. - Ensured that the shape compatibility checks are more concise while preserving functionality, contributing to overall code clarity. commit cae7122be0b278b6f9766094f527bcb679599508 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sat Jun 21 14:43:30 2025 +0800 [Bugfix] Fix input tensor compatibility checks in AutoTuner (#588) * [Refactor] Remove cache existence check in kernel saving logic - Eliminated redundant checks for existing cache paths in `AutotuneResult` and `AutoTunerCache` classes, simplifying the kernel saving process. - Ensured that the cache directory is always created before saving kernel source code, improving reliability in kernel storage. * [Enhancement] Improve input tensor compatibility checks in AutoTuner - Enhanced the input tensor caching logic in the AutoTuner class to ensure compatibility between cached tensors and newly generated tensors during configuration trials. - Added detailed logging to warn users about potential mismatches in tensor properties, including shape and dtype, when caching is enabled. - Implemented a mechanism to regenerate input tensors if compatibility issues are detected, improving the robustness of the autotuning process. * [Refactor] Update L2 persistent map initialization in CUDA wrapper - Adjusted the L2 persistent map initialization function to use a consistent size parameter for cache limits and byte counts, improving clarity and reducing potential errors in memory management. - Simplified the formatting of the initialization function to enhance readability and maintainability of the code. * Update tilelang/autotuner/__init__.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit ed93aac3a8308d74f1eecde5bc5e2f0cf82a689e Author: botbw <wang1570@e.ntu.edu.sg> Date: Sat Jun 21 01:59:59 2025 +0800 [Bugfix] fix missing node in ws role maker (#587) commit 1406a113804623d3a08cfe2301d34316361b107b Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Jun 20 12:58:59 2025 +0800 [Bugfix] Fix device type validation for input tensors (#586) * Enhancement: Update `pythonic_expr` to accept `tvm.tir.PrimExpr` and improve type handling - Modified the `pythonic_expr` function to check for `tvm.tir.PrimExpr` type, ensuring proper handling of expressions. - Refactored device and dtype checks in `CythonKernelWrapper` for better clarity and error messaging, enhancing robustness in tensor validation. * Enhancement: Refine `pythonic_expr` function to support additional expression types - Updated the `pythonic_expr` function to accept `tvm.tir.PrimExpr` and handle both integer and float immediate types, improving expression representation and type handling. commit 236e8e6ac29a2bebd52487452e79acad8fcdc1c4 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Jun 20 12:58:01 2025 +0800 [Enhancement] align shared memory allocations (#583) * [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation. - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter. - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions. * [Feature] Implement dynamic shared memory allocation alignment - Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency. - Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations. - Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures. * [Enhancement] Update dtype and argument defaults in GEMM autotuning example - Changed data type from `float16` to `bfloat16` for improved precision in computations. - Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process. * [Enhancement] Improve thread range computation in storage access - Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking. - Updated `AccessEntry` structure to include `thread_range`. - Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits. - Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency. * [Refactor] Update comments for clarity in dynamic shared memory allocation alignment - Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding. - Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`. - Added a blank line for improved readability in `thread_storage_sync.cc`. * [Refactor] Enhance storage access analysis and thread range computation - Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits. - Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency. - Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations. * [Enhancement] Update default matrix dimensions in GEMM autotune example - Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking. * typo fix * enhancement * [Fix] Add conflict detection for buffer index size mismatch in thread storage sync - Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict. commit 50a4de4333bf93b2141099d45688afe6c94c1ccc Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Thu Jun 19 18:40:21 2025 +0800 [Bugfix] FIx autotuning params (#585) * [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation - Modified AutoTuner to store cache in a dedicated "autotuner" directory. - Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing. - Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation. * lint fix * [Enhancement] Improve error handling and documentation in AutoTuner - Added traceback logging for exceptions during configuration testing to enhance debugging. - Expanded the AutoTuner class docstring to include detailed descriptions of new parameters for input tensor generation and validation, improving clarity for users. commit 246bba6a5b71d62362cd275fe07fa532c091f519 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Thu Jun 19 02:08:54 2025 +0800 [Enhancement] Update warp specialization checking (#580) * Fix L2 cache size calculation to handle symbolic expressions and ensure float conversion of hit ratios in annotation * [Enhancement] Update warp specialization check in phase.py * lint fix * [Enhancement] Add ContainsSeqStmt method to improve statement handling in merge_shared_memory_allocations.cc * [Refactor] Simplify memory copy operations in GEMM kernel tests - Updated memory copy operations in `test_tilelang_kernel_gemm.py` to use shared memory allocations for both A and B matrices, improving clarity and performance. - Adjusted the main execution block to include a new `run_gemm_rs` function call for testing, enhancing the test structure. * revert memory reuse pass. * revert the memory resue and thread sync pass/ * Update test_tilelang_kernel_gemm.py * Update test_tilelang_kernel_mha_bwd.py commit d6a977a68618da66891e5bee2ba3ba27fbaffb00 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Jun 18 00:08:17 2025 +0800 [Enhancement] Update dtype handling in KernelParam and CythonKernelWrapper (#582) - Modified `KernelParam.from_var` to map Torch data types to a more appropriate format. - Enhanced `CythonKernelWrapper` to support additional tensor types and ensure proper conversion of tensor dtypes to C types, improving error handling for unsupported types. commit e7e5d2cee89bbb318c490f0ea03c95b92dadd669 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Tue Jun 17 17:18:50 2025 +0800 [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper (#581) - Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation. - Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter. - Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions. commit f596349270d923f8b41f65b6c5b91fce9d06d54a Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon Jun 16 23:51:56 2025 +0800 [Enhancement] Introduce wrapper util `pythonic_expr` to transform a PrimExpr into python string (#577) * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes. - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts. - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate. - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns. * lint fix * lint fix * rebase * rebase * typo * requirement fix * revert flash atten requirenemts commit 9e679dcf60e9f4bb98904e8f64a0007b2b39ff4d Author: 徐畅 <147292722+xuchangtolearn@users.noreply.github.com> Date: Mon Jun 16 21:25:47 2025 +0800 [BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split (#575) * [CI] Add flash_decoding example to CI * Add output of ref latency * format example_gqa_decode.py * [BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split * format example_gqa_decode.py commit 8e6c1d8ef88c2829fccd801c2ce0537955a178be Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon Jun 16 21:25:13 2025 +0800 Fix L2 cache size calculation to handle symbolic expressions and ensure float conversion of hit ratios in annotation (#576) commit 5d21b080fc2cbbadd2c9f6ef748fff963ef929b0 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Mon Jun 16 17:45:32 2025 +0800 [BugFix] Fix import error in nsa examples when `fla.__version__ >=0.2.1` (#579) * Update FLA import path for `prepare_token_indices` * Update FLA import path for `prepare_token_indices` * Compare versions via packaging.version.parse commit a5c67554b289ad065a5b5fd3edd3684837c0250e Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Mon Jun 16 17:45:03 2025 +0800 [CI] Modify test requirements and CI workflow (#578) * [CI] Modify test requirements and CI workflow - Replaced `flash-attn` with `packaging` and `wheel` in `requirements-test.txt` to ensure proper package management. - Updated the CI workflow to install `flash-attn` without build isolation, improving the installation process. * [CI] remove redundant packages * [CI] Update test requirements and CI workflow - Added `flash-attn` to `requirements-test.txt` to ensure it is included in the testing environment. - Modified the CI workflow to install packages from `requirements-test.txt` with `PIP_NO_BUILD_ISOLATION=1`, improving the installation process. commit d6f8ebb7943c1657f2fac478ad392c0057398db1 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon Jun 16 13:01:10 2025 +0800 [Refactor] Phaseout tf32 Casting from GEMM Templates (#573) * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes. - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts. - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate. - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns. * lint fix * [Refactor] Update GEMM Layout Functions and Inference Logic - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature. - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture. - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection. - Adjusted related layout functions to ensure consistent behavior across different architectures. * [Refactor] Remove tf32 Casting Logic from GEMM Templates - Eliminated the `cast_float_to_tf32` function from `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` templates to streamline the code. - Removed conditional casting logic for float32 to tfloat32 conversion, enhancing clarity and maintainability. - Updated relevant sections in GEMM operations to reflect the removal of casting, ensuring consistent behavior across templates. - Adjusted tensor view handling to improve performance and accuracy in matrix operations. * Update bulk_copy.cc * Fix profiler initialization in GEMM test by removing TensorSupplyType argument for improved flexibility. commit bc7db9f42a48078ed7f1de6e9e388b33b21386f4 Author: aaababaaz <lhristo475@gmail.com> Date: Fri Jun 13 12:28:38 2025 +0000 [Doc] `README.md` deepseek_nsa Link correction (#571) commit b55d899969ebbcd76f1e955e93f2622640bc1efb Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Jun 13 12:39:14 2025 +0800 [Enhancement] Include Metadata (LayoutMap etc.) into hashing (#570) - Modified the serialization of function scripts in both KernelCache and AutoTunerCache to include metadata by setting `show_meta=True` in `cloudpickle.dumps()`. This change enhances the hash key generation for kernel configurations, improving cache accuracy and consistency. commit 2d705e720c8a5d53764565e7af2cda6d115ea1f8 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Jun 13 11:24:32 2025 +0800 Fix assertion in GQA backward example to ensure correct tensor comparison for gradient validation (#568) commit 30c8c50c014f3193743ffa3a0eb3e5f76d3a0106 Author: Leon Lu <gfvvz@163.com> Date: Fri Jun 13 11:24:04 2025 +0800 [Doc] Use the right date format (#569) commit 41c4a70be2d2bc5135f8417f0debbc4437106dbc Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Jun 11 23:51:16 2025 +0800 [Feature] Implement Swizzle 32B (#566) * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes. - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts. - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate. - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns. * lint fix * [Refactor] Update GEMM Layout Functions and Inference Logic - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature. - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture. - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection. - Adjusted related layout functions to ensure consistent behavior across different architectures. * Update bulk_copy.cc * Update __init__.py commit a8e580ec8e3470c959e09454156be734b3372d3d Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Jun 11 14:51:05 2025 +0800 [Bugfix] Add `__tune_params` into key hash for autotuning (#565) * [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation - Modified AutoTuner to store cache in a dedicated "autotuner" directory. - Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing. - Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation. * lint fix commit e143fd3a35a4481979eacd93fc75fdbbc0677f38 Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Wed Jun 11 13:58:59 2025 +0800 [Refactor] Improve dtype handling in KernelParam class (#564) - Updated the dtype handling logic in the KernelParam class to enhance clarity and maintainability. The dtype string is now modified only if it starts with "torch.", simplifying the return statement for boolean type checks. commit 913780b08abbc568f151402cc39baccd93a979b2 Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Wed Jun 11 13:04:53 2025 +0800 [Feature] Introduce Persistent Loop and Update GEMM Example (#563) * [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation - Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid. - Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code. - Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels. - Updated relevant documentation and comments to reflect the addition of new features and usage instructions. * [Example] Add MLA Decode With Persistent Threadblock Example * [Feature] Introduce Persistent Loop and Update GEMM Example - Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution. - Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication. - Introduced a `loop_break` intrinsic for better control flow within persistent loops. - Updated relevant files to support the new features, including changes in code generation and language interface. * lint fix commit ed43770726bbfb8446c44566216726fc05b5b301 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon Jun 9 13:02:43 2025 +0800 [Enhancement] Optimize debug info for auto tuning (#560) * [Enhancement] Update AutoTuner and JIT compilation arguments * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities. * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling. * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes. * Update tilelang/autotuner/param.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * remove redundant comments * Refactor kernel source retrieval and logging levels * Updated `AutotuneResult` to use `kernel.get_kernel_source()` instead of `kernel.adapter.get_kernel_source()`. * Changed logging level in `KernelCache` from `ERROR` to `DEBUG` for improved verbosity during kernel caching operations. * Removed unnecessary verbose logging in JIT compilation process to streamline output. * Merge branch 'main' of https://github.com/tile-ai/tilelang into bugfix_autotune_0604 * lint fix --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 50fc410afc86de39525a5ed315190f7fe6cc4e1b Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Sat Jun 7 23:24:02 2025 +0800 [Feature] Support persistent kernels and add persistent GEMM examples (#559) * [Enhancement] Fix multi-version buffer index in nested-loop * [Feature] Support persistent kernels and add persistent GEMM example * lint fix * lint fix * [CI] Remove test_tilelang_transform_annotate_device_regions.py commit e00706136cd33d959ffc47f226075988300b0baa Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sat Jun 7 13:34:31 2025 +0800 [Bugfix] Add tf32 casting to GEMM templates (#556) * Add tf32 casting functionality to GEMM templates - Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates. - Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types. - Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations. * lint fix * Refactor tfloat32 casting logic in GEMM templates - Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32. - Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types. - Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations. * Refactor GEMM template functions for improved readability - Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity. - Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability. * Enhance tf32 casting logic in GEMM templates - Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness. - Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability. * Refactor GEMM template functions and layout inference logic - Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability. - Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations. - Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling. commit bcad41f773b6011c89ab648e8debfef8fd8350a0 Author: xs-keju <93414213+xs-keju@users.noreply.github.com> Date: Fri Jun 6 20:27:26 2025 +0800 [CI] Add CI test for flash_attention examples (#558) * [CI] Add CI test for flash_attention examples * Update example_gqa_fwd_bshd.py * Update example_mha_fwd_bshd_wgmma_pipelined.py * [CI] Added conditional annotations for tests in flash_attention * [CI] Added conditional annotations for tests in flash_attention --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> commit 87c8adbcf385a2415d8371c94bc0c8d620805856 Author: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> Date: Thu Jun 5 23:21:39 2025 +0800 [Enhancement] Add nvrtc execution backend (#461) * [wip] feat: add nvrtc backend * [wip] fix: handle out_idx * [wip] refactor: move lib logic to libgen * feat: cache for nvrtc backend * fmt: run format * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: get kernel source * refactor: speedup pyimport * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning. * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity. * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates. * fix tl_templates * ensure CUDA context --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> commit 4e7535a48262203f89f612113f281b4e8e1b4865 Author: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com> Date: Thu Jun 5 22:39:48 2025 +0800 [CI] Add FusedMoE example (#555) * [CI] Add FusedMoE example * Lint * Fix import bug * Fix comment bug * Update example_fusedmoe_torch.py --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> commit 8df4af8415373f9737e9cc4460625bc07bae1549 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Thu Jun 5 16:35:21 2025 +0800 [Release] Bump Version to 0.1.5 (#551) * Update VERSION to 0.1.5 * Add DEBUG_MODE support in setup.py and update CMake build type; enhance pypi.Dockerfile with git installation commit a32009bf1e314b514c07389123648ba19009f3a5 Author: alex_xiao <113411296+Alex4210987@users.noreply.github.com> Date: Wed Jun 4 22:24:47 2025 +0800 [CI]Add norm and layout_plot (#534) * [CI]Add norm and layout_plot * fix lint * Remove obsolete test files for RMS normalization and plot layout, streamlining the testing suite. * Add make_mma_load_base_layout function to create MMA result layouts - Introduced a new function `make_mma_load_base_layout` for generating layout functions for storing MMA results in fragment buffers. - Added detailed docstring explaining parameters, return values, and potential exceptions. - Implemented logic for handling different data types and matrix configurations, including assertions for input validation. - Defined internal functions for mapping fragment indices to threads and local indices, enhancing the layout functionality. * Enhance MMA load test with additional imports and functionality - Added imports for `tilelang.language`, `Literal`, `Callable`, `DataType`, `IndexMap`, and `get_mma_micro_size` to support extended functionality. - Improved the `make_mma_load_base_layout` function by ensuring it can handle various data types and configurations. - Updated the test function `test_mma_load_base_layout` to validate the layout for float16 matrix A. * Fix formatting in test_fragment_mma_load_a.py by adding a blank line for improved readability. * Add RMS normalization functions to test_rms_norm.py - Introduced `rms_norm` and `rms_norm_splitk` functions for RMS normalization, enhancing the testing capabilities. - Implemented kernel functions with shared memory allocation and parallel processing for improved performance. - Updated the test function to validate the new RMS normalization implementations. * Add reference program for RMS normalization in test_rms_norm.py - Introduced `ref_program` function to provide a reference implementation for RMS normalization. - This addition enhances the testing framework by allowing comparisons against a known reference output. * Enhance RMS normalization tests with additional imports and formatting - Added import for `tilelang.language` to support extended functionality in `test_rms_norm.py`. - Improved code readability by adding blank lines for better separation of code sections. * Update RMS normalization test parameters and enhance layout plotting - Increased matrix dimensions in `test_rms_norm` to 8192 for improved performance testing. - Removed obsolete test functions in `test_fragment_mma_load_a.py` to streamline the test suite. - Enhanced layout plotting functionality by ensuring proper visualization of base, warp, and block layouts in `test_fragment_mma_load_a.py`. * Refactor RMS normalization test parameters and improve layout plotting readability - Simplified the parameters in `test_rms_norm` by removing `blk_k` for clarity. - Enhanced code readability in `test_fragment_mma_load_a.py` by adjusting the formatting of the `block_layout` definition and removing the unused `warp_cols` variable. * Enhance RMS normalization with split-k implementation and additional profiling - Added a new function `test_rms_norm_splitk` to test the split-k variant of RMS normalization. - Updated the main RMS normalization script to include profiling for the split-k implementation. - Ensured all checks pass with appropriate latency measurements for both reference and tile-lang implementations. * Remove obsolete test file `test_fragment_mma_load_a.py` to streamline the test suite. * Refactor `rms_norm.py` to streamline benchmarking output and remove redundant code. Comment out the `plot_layout` call in `fragment_mma_load_a.py` for clarity. * Refactor `test_rms_norm.py` by removing redundant test function `test_rms_norm_splitk` to streamline the test suite and improve clarity. --------- Co-authored-by: Your Name <you@example.com> commit 7eef7f23a404eeceb675a9fb325cebf16918a625 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Wed Jun 4 22:21:00 2025 +0800 [CI] Add linear attention examples to CI (#552) * Add linear attention examples. * Add license * Remove comments * Run yapf and ruff commit 5faaaca915909e5965e1375dcfa7d01868c92943 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Jun 4 20:52:34 2025 +0800 [Autotune] Remove the out_idx argument from the autotune cache (#553) * [Enhancement] Update AutoTuner and JIT compilation arguments * Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities. * Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling. * Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes. * Update tilelang/autotuner/param.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * remove redundant comments * Update tilelang/jit/__init__.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit e304fc6c91df9d04b82fd75d16ce48d6a967ddf5 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Jun 4 17:18:45 2025 +0800 [Bugfix] Enhance layout inference pass for flexibility (#550) * Enhance Layout * strict update * lint fix * Refactor layout inference by removing unnecessary logging statements in `parallel.cc` and `layout_inference.cc`. This cleanup enhances code readability and reduces log clutter during layout inference steps. * lint fix * Refactor file copying logic in setup.py to simplify directory creation and file copying process. Removed unnecessary existence check before copying source files to the target directory. commit bc5995a82ec750628ce1afe7ef932f658b9fab34 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Jun 4 14:22:24 2025 +0800 [AMD][Enhancement] Add support for Vectorized FP8 DataPacking (#542) * [Enhancement] Add support for new FP8 types in HIP code generation * Updated `PrintConst` function in `codegen_hip.cc` to handle `float8_e4m3fnuz` type. * Introduced new functions in `hip_fp8.h` for creating FP8 types, including `make_fp8_e4_4_t` and `make_fp8_e4_8_t`, enhancing type handling for FP8 data structures. * Improved overall compatibility and performance for FP8 data types in HIP. * workaround for competition * enhance autotune * autotune cache fix * Implement validation for unused keys in AutoTuner configuration * Added a check in the AutoTuner class to raise a ValueError if there are unused keys in the configuration, enhancing error handling and ensuring configuration integrity. * lint fix * revert changes of threads * Update pipelining in `example_mla_decode.py` to improve performance * Changed the number of stages in the pipelined loop from 0 to 2, enhancing the efficiency of the attention mechanism in the decoding process. * Enhance Cython kernel validation by adding tensor attribute checks * Updated the `CythonKernelWrapper` to include dedicated methods for validating tensor device, dtype, and static shape. * Modified the `forward` method to utilize these new validation methods, improving error handling and ensuring input integrity. * Updated the `lambda_forward` function in `CythonKernelAdapter` to reflect changes in validation parameters. commit 339dccc1ddff9d865c98f1f94293bd60ea6f084e Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Jun 4 14:21:18 2025 +0800 [Refactor] Include several examples into ci (#531) * Remove unused 2D continuous cumulative sum example and related functions from the cumsum module. * lint fix * fix split k example * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice commit fe89c4e717df7cb312e765127a20fe7c15c7fcbc Author: botbw <wang1570@e.ntu.edu.sg> Date: Wed Jun 4 02:57:45 2025 +0800 [Refactor] set default build type to release if not provided (#548) commit e59e919b9e0ca832235b8c93dfb69410d03bc667 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Wed Jun 4 02:57:10 2025 +0800 [CI] Add hadamard example to CI (#549) * [CI] Add hadamard example to CI * Run yapf and ruff * Run yapf and ruff commit 4747edbf0174e36748b4ed0e0f0cfb748f32b203 Author: Wenhao Xie <wh.xie@outlook.com> Date: Mon Jun 2 14:24:06 2025 +0800 [Doc] Include DeepWiki badge in README (#541) * [Doc] Include DeepWiki badge in README * include more badges commit aee0bcf481fe9270197f0f24abddf83f7406e8ec Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon Jun 2 01:03:36 2025 +0800 [AMD] Support float8 matrix core (#537) * [Enhancement] Add support for FP8 types in CUDA and HIP code generation * Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`. * Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP. * Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types. * Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types. * Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types. * lint fix * Update src/target/codegen_hip.cc Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update tilelang/intrinsics/mfma_macro_generator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * workaround * fix * Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c * bug fix * Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality. * Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency. * Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly. * lint fix --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit f40164ac7e0892af1c5cc17a533038b9528f5f98 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sat May 31 16:59:40 2025 +0800 [Bugfix] Fix a bug when simplifying warp combination for T.gemm (#540) commit db86ec4b849e66cda122056706ceeb3c27bf8341 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri May 30 00:57:34 2025 +0800 [Language] Support `T.annotate_l2_hit_ratio` via `cudaStreamSetAttribute` (#539) * Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code * Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process. * Cleaned up unnecessary whitespace in the function for improved readability. * Enhanced the overall clarity and maintainability of the code. * Refactor AllocateNode handling in vectorize_loop.cc * Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic. * Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability. * Improved overall readability by eliminating unnecessary comments and code related to extent handling. * Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing. * Add buffer allocation planning step in OptimizeForTarget function * Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process. * This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management. * Update submodule TVM to latest commit db50d4e, ensuring alignment with upstream changes. * Add L2 persistent annotation support and related functionality * Introduced a new file `lower_l2_persistent_annotation.cc` to handle the lowering of L2 persistent annotations. * Added functions to annotate L2 hit ratios for buffers, ensuring compatibility with global buffer requirements. * Updated the `LowerAndLegalize` function to include the new L2 persistent map lowering step. * Enhanced CUDA driver with a function to retrieve the maximum size of the persisting L2 cache. * Modified the `TLCUDASourceWrapper` class to integrate L2 persistent map handling during kernel launches. These changes improve the framework's ability to manage L2 cache optimizations, enhancing performance for CUDA applications. * lint fix commit 77c9ab38a65c37cbf5408a51193399cd7644d14c Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Thu May 29 02:26:01 2025 +0800 [Refactor] Disable legacy vectorization for buffer allocation (#535) * Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code * Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process. * Cleaned up unnecessary whitespace in the function for improved readability. * Enhanced the overall clarity and maintainability of the code. * Refactor AllocateNode handling in vectorize_loop.cc * Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic. * Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability. * Improved overall readability by eliminating unnecessary comments and code related to extent handling. * Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing. * Add buffer allocation planning step in OptimizeForTarget function * Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process. * This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management. commit cf7a28c1128768a1969afa75fc9de0d90f3e5b1b Author: yyttt6 <134183314+yyttt6@users.noreply.github.com> Date: Thu May 29 02:12:44 2025 +0800 [Refactor] add autotune example to convolution examples (#536) commit 36a21ef21d453ec9ab6743b8c0f6ec00be1a9ece Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed May 28 17:41:02 2025 +0800 [Autotune] Introduce cache mechanism for auto tuner (#527) * [Enhancement] Add commit ID to versioning and improve logging initialization * Updated `get_tilelang_version` to include an optional commit ID in the version string. * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process. * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash. * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity. * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging. * [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching * Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability. * Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations. * Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking. * Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process. * Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity. * Minor adjustments in related modules to support the new caching and profiling features. * [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules * Consolidated import statements and removed unnecessary line breaks for better readability. * Standardized function argument formatting across the AutoTuner and CompileArgs classes. * Enhanced consistency in the use of whitespace and indentation throughout the codebase. * Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability. * Ensured that all changes adhere to the project's coding style guidelines. * [Refactor] Remove redundant type hints in AutoTuner modules * Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`. * Improved code readability and maintainability by streamlining type imports across the AutoTuner module. * [Refactor] Update AutoTuner configuration for improved profiling and target detection * Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings. * Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection. * Removed redundant target specifications in certain instances to streamline the configuration process. * Improved overall clarity and maintainability of the autotuning logic in various example scripts. * [Refactor] Simplify code formatting and improve readability in example scripts * Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity. * Removed unnecessary line breaks and standardized argument placement across multiple files. * Enhanced overall code readability and maintainability in autotuning examples and performance scripts. * [Refactor] Update JIT decorator usage across multiple files * Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code. * Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase. * Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions. * [Refactor] Standardize JIT decorator formatting across benchmark and example scripts * Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks. * Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts. * Improved overall maintainability by ensuring uniformity in function definitions and decorator usage. commit f81818d3fc1083149ecd65c64222384bf093e014 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed May 28 17:35:44 2025 +0800 [Refactor] Refactor convolution example to streamline configuration and remove unused code (#530) * Refactor convolution example to streamline configuration and remove unused code * Updated the `check_hopper` function to properly check for CUDA availability and compute capability. * Removed the `get_configs` and `get_best_config` functions, simplifying the example by eliminating unused autotuning logic. * Adjusted argument parsing in the `main` function to directly compile the convolution kernel without autotuning options. * Cleaned up the code for better readability and maintainability. * Update examples/convolution/example_convolution.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit b2f9b6b4ec16f106b02bf0d27168e43d3a4aab6d Author: Leslin <148342234+LeslinD@users.noreply.github.com> Date: Wed May 28 00:18:53 2025 +0800 [CI] Add gemm and gemm_fp8 example to CI (#516) * [CI] Add gemm and gemm_fp8 example to CI * Fix lint via format.sh * Resolved the issues with profiler API usage and parse_args commit 6c07f9e7fd2c2729961e21e5de74dc31d16c7801 Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Tue May 27 14:16:57 2025 +0800 [Enhancement] Add warp specialization attribute handling in IR and rewriter (#518) * Introduced an `AttrFrame` for warp specialization in the IR, enhancing the handling of warp-specific optimizations. * Refactored the `VisitStmt_` method in `warp_specialized_rewriter.cc` to check for the new warp specialization attribute, improving the detection of warp specialization conditions. * Removed outdated code related to condition checks in `IfThenElseNode`, streamlining the specialization logic. commit ecbc0be06e5b198bed624770d81be665df8affe0 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Tue May 27 00:54:17 2025 +0800 [Enhancement] Add commit ID to versioning and improve logging initialization (#524) * Updated `get_tilelang_version` to include an optional commit ID in the version string. * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process. * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash. * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity. * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging. commit 7aac423fe9cb87916eb4f69dc065e2d624f49bfb Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon May 26 20:32:12 2025 +0800 [Refactor] Reorganize Thread Synchronization Steps to make sure global synchronization can be correctly lowered (#521) * [Refactor] Reorganize Thread Synchronization Steps in OptimizeForTarget Function * Removed redundant thread synchronization steps for "global" and "shared" memory, streamlining the optimization process. * Reintroduced necessary synchronization for "shared" and "shared.dyn" after the injection of PTX async copy, ensuring correct memory access patterns. * Enhanced overall clarity and maintainability of the OptimizeForTarget function by restructuring the order of operations. * [Refactor] Reorder Thread Synchronization and PTX Async Copy in OptimizeForTarget Function * Removed redundant global thread synchronization step and adjusted the order of operations for shared memory synchronization. * Ensured that the PTX async copy injection occurs after the global thread sync, improving memory access validity. * Enhanced clarity and maintainability of the OptimizeForTarget function by restructuring synchronization steps. commit 5e1b83176122d0aaf1de77f58fc2f3036f8825ab Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon May 26 20:25:00 2025 +0800 [Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 (#522) * [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA * Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA. * Updated `customize.py` to include the new `atomic_addx4` function for external calls. * Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module. * lint fix commit 10e12b4e0ccab939f6095efbe75c8949b885be72 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Mon May 26 17:06:38 2025 +0800 [Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516) * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions. * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity. * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process. * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability. * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies. * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions. * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise. * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios. * Add dynamic matrix multiplication example and corresponding test * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling. * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example. * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness. * lint fix * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device. * Updated the `__init__.py` file to include the new function in the module exports. * lint fix * Add global barrier state and expectation handling in CUDA code generation * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator. * Updated `Finish` method to declare the global barrier state if needed. * Implemented handling for `EvaluateNode` to initialize the barrier expectation. * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method. * Enhanced CUDA FP8 type definitions for better alignment and structure. * Enhance CUDA FP8 type handling and debug printing * Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure. * Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types. * Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions. * Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state * Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration. * Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework. * Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting * Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting. * This cleanup enhances maintainability and reduces potential confusion regarding obsolete code. * lint fix commit 2c3dd212f42c1e47be6b66c58f2975a015fd44ba Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun May 25 20:29:05 2025 +0800 [Enhancement] Support auto synchronization for global memory access (#519) * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516) * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions. * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity. * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process. * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability. * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies. * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions. * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise. * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios. * Add dynamic matrix multiplication example and corresponding test * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling. * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example. * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness. * lint fix * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device. * Updated the `__init__.py` file to include the new function in the module exports. * lint fix * Add global barrier state and expectation handling in CUDA code generation * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator. * Updated `Finish` method to declare the global barrier state if needed. * Implemented handling for `EvaluateNode` to initialize the barrier expectation. * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method. * Enhanced CUDA FP8 type definitions for better alignment and structure. commit 5d95d1d79ba45192f79b4184533660980c802471 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sat May 24 21:57:52 2025 +0800 [Refactor] Support auto index bitwidth casting (#517) * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516) * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions. * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity. * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process. * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability. * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies. * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions. * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise. * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios. * Add dynamic matrix multiplication example and corresponding test * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling. * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example. * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness. * lint fix * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device. * Updated the `__init__.py` file to include the new function in the module exports. * lint fix commit df28ad20ab6f57f08ea75ee091de2ea5a202cbe3 Author: Taoyu Zhu <z609495@gmail.com> Date: Fri May 23 16:49:45 2025 +0800 Fix deepgemm exmaple (#513) * fix deepgemm example * fix deepgemm example * make format * Update example_deepgemm_fp8_2xAcc.py --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> commit 015c14d3192153bbb878e5ce4829228d1ec4ced6 Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Fri May 23 16:10:31 2025 +0800 [Dev] Add grouped GEMM backward example scripts (#515) * Introduced `example_grouped_gemm_fwd.py` and `example_grouped_gemm_bwd.py` to demonstrate grouped matrix multiplication with forward and backward operations. * Implemented functions for grouped GEMM, input construction, and validation against PyTorch's implementation. * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions. * Included a test function to validate the functionality with various input scenarios. commit b084d326269ebf2b51570ca0985cd7ee77adca35 Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Fri May 23 15:32:02 2025 +0800 [Dev] Add grouped GEMM example with TileLang and PyTorch in…

* [wip] feat: add nvrtc backend * [wip] fix: handle out_idx * [wip] refactor: move lib logic to libgen * feat: cache for nvrtc backend * fmt: run format * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: handle cuda bindings import error * fix: get kernel source * refactor: speedup pyimport * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning. * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity. * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates. * fix tl_templates * ensure CUDA context --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

lucifer1004 added 12 commits May 8, 2025 11:07

[wip] feat: add nvrtc backend

9358d89

[wip] fix: handle out_idx

bace4a2

[wip] refactor: move lib logic to libgen

024f3c4

feat: cache for nvrtc backend

d3f92b0

Merge remote-tracking branch 'upstream/main' into nvrtc

cac51fe

fmt: run format

037b0fe

fix: handle cuda bindings import error

3b70fd0

fix: handle cuda bindings import error

63dd099

fix: handle cuda bindings import error

eeca41b

fix: handle cuda bindings import error

0c33224

Merge remote-tracking branch 'upstream/main' into nvrtc

a04ab35

fix: get kernel source

2a31cc3

lucifer1004 added 4 commits May 9, 2025 10:48

refactor: speedup pyimport

243b642

Merge remote-tracking branch 'upstream/main' into nvrtc

4f0d1ce

Merge remote-tracking branch 'upstream/main' into nvrtc

5672187

Merge remote-tracking branch 'upstream/main' into nvrtc

8c7a286

LeiWang1999 mentioned this pull request Jun 2, 2025

[Enhancement] Save compilation time from cute templates #272

Closed

LeiWang1999 added 4 commits June 4, 2025 17:37

Merge branch 'main' of https://github.com/tile-ai/tilelang into nvrtc

8effff8

Improve error handling for missing cuda-python dependency in nvrtc ba…

c314c64

…ckend. Raise ImportError with detailed installation instructions instead of logging a warning.

Merge branch 'main' of https://github.com/tile-ai/tilelang into nvrtc

c77eb46

Enhance nvrtc backend error handling by introducing a flag to check f…

d730e17

…or cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.

gemini-code-assist bot suggested changes Jun 5, 2025

View reviewed changes

tilelang/jit/__init__.py Outdated Show resolved Hide resolved

tilelang/jit/adapter/wrapper.py Outdated Show resolved Hide resolved

LeiWang1999 and others added 3 commits June 5, 2025 15:13

Update README.md to include recent NVRTC Backend addition, highlighti…

ed191a8

…ng reduced compilation time for CUDA templates.

fix tl_templates

733e38e

ensure CUDA context

03694ac

LeiWang1999 merged commit 87c8adb into tile-ai:main Jun 5, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add nvrtc execution backend #461

feat: add nvrtc execution backend #461

Uh oh!

lucifer1004 commented May 9, 2025

Uh oh!

LeiWang1999 commented May 9, 2025

Uh oh!

LeiWang1999 commented Jun 5, 2025

Uh oh!

LeiWang1999 commented Jun 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add nvrtc execution backend #461

feat: add nvrtc execution backend #461

Uh oh!

Conversation

lucifer1004 commented May 9, 2025

Uh oh!

LeiWang1999 commented May 9, 2025

Uh oh!

LeiWang1999 commented Jun 5, 2025

Uh oh!

LeiWang1999 commented Jun 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants