Merge remote-tracking branch 'upstream/main' into main #31

chengyupku · 2025-10-23T18:10:52Z

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive attention mechanism examples (FlashAttention variants, attention sinks, sparse MLA).
- Introduced DeepSeek V3.2 inference examples with distributed support.
- Added new sparse attention and quantization benchmarks.
Documentation
- Simplified installation instructions.
- Added guides for targets and compiler internals.
- Enhanced contributing guidelines.
Chores
- Updated build system (CMake 3.26+).
- Standardized code formatting and linting rules.
- Enhanced GitHub workflows and issue templates.
- Bumped version to 0.1.6.post1.

… it from FlattenBuffer (tile-ai#714) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes. - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management. - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body. - Removed obsolete code and improved overall code clarity and maintainability. * lint fix * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves. - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations. * test fix * Enhance global read detection in pipeline planning - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses. - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis. - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code. * Add IndexLegalizer to enforce int64 for out-of-bound indices - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds. - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability. - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations.

…tile-ai#718) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow. * Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job.

* fix: NVRTC backend * fix: CI --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* Init support for sm120 * fmt * resolve comments * unify mma gemm * fmt --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* chore: fix typos * chore: fix ruff * chore: fix clang-format

* [Enhancement] Refactor buffer index handling for improved precision and clarity (tile-ai#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Update AMD FlashAttention example and TVM submodule - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support. - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads. - Updated the TVM submodule to the latest commit for improved functionality. - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters. * Add CI workflow for automated format checking and testing - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests. - The workflow includes steps for setting up a Python environment, running format checks, and executing tests. - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory. * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity. * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management. * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management. * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment. * Add installation of ROCm in AMD CI workflow - Included a step to execute the `install_rocm.sh` script for improved setup. - Removed unnecessary blank line for better readability in the workflow script. * Remove installation step for ROCm in AMD CI workflow to simplify the setup process. * Update AMD CI workflow to run specific test file with verbose output instead of all tests. * Add new tilelang built-in operations for AMD architecture - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang. - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects. * Enhance autotuner configurations and GEMM operations in AMD example - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning. - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications. * Update autotuner configurations in AMD example for enhanced performance - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning. - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns. * Enhance autotuner configurations and memory handling in AMD example - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities. - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations. * Refine autotuner configurations and memory usage in AMD example - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning. - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations. * Update autotuner configurations in AMD example for enhanced performance - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities. - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations. * Enhance autotuner configurations and GEMM operations in AMD example - Expanded thread counts in `get_configs` to include higher values for improved autotuning. - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications. * Update AMD CI workflow and remove obsolete test script - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu. - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure. * Remove TVM subproject from 3rdparty directory * Refactor configuration generation and accumulation logic in AMD example - Reformatted the `get_configs` function for improved readability by aligning parameters. - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions. * Enhance AMD CI workflow with additional logging and setup steps - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements. - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed. * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps * Enhance BuildTileLangHIP function by adding whitespace for improved readability * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * lint fix * Update AMD CI workflow to use requirements-rocm.txt for dependency installation * fix ci * Remove dependency on format-check from AMD CI workflow * fix ci * fix ci * fix ci * Remove format-check job from AMD CI workflow * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow * Add dependency on format-check job in AMD CI workflow * Add format-check job to AMD CI workflow * Update format-check job in AMD CI workflow to run on self-hosted environment * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes * Update amd_ci.yml --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

…nsorcore policy (tile-ai#724) * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy * [Typo] Correct architecture selection for CUDA and CDNA

…handling (tile-ai#721) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor CUDA code generation to simplify eviction policy handling - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity. - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations. * lint fix

…inputs (tile-ai#722) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Support strided tensors * Refactor target attribute helper functions for improved clarity * No code changes made in proxy.py and setup.py * lint fix * lint fix via gemini * lint fix * test fix * test fix * lint fix * Update wrapper.py * test fix * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity. * lint fix --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

…r fallback support for Hopper (tile-ai#712) * bug fix and support gemm_sr fallback for hopper * Update gemm.cc --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

@LeiWang1999

Docstrings generation was requested by @LeiWang1999. * tile-ai#712 (comment) The following files were modified: * `src/op/gemm.cc` * `src/tl_templates/cuda/gemm_sm90.h` * `src/transform/warp_specialized_rewriter.cc` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [Enhancement] Refactor buffer index handling for improved precision and clarity (tile-ai#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

@LeiWang1999

* 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * tile-ai#725 (comment) The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

* Fix environment variable name for compilation print setting in `env.py` * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity. * lint fix * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py`

…e-ai#743) * Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`. * This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code.

…or (tile-ai#742) * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error * Update atomicadd_vectorize.cc * format

@LeiWang1999

* 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * tile-ai#742 (comment) The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* Introduce Barrier * Enhance CUDA kernel with new barrier management and post-processing support - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers. - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure. - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency. - Introduced additional print statements for debugging in the lowering phase of the TileLang engine. - Enhanced the overall structure and readability of the codebase. * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic. * Enhance barrier management in TileLang - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework. - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory. - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code. - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine. - Removed deprecated memory scope handling code to enhance clarity and maintainability. * lint fix * lint fix * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability. * Refactor logging in JITKernel to improve kernel compilation tracking - Removed unused import of `torch.backends` in the example file. - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging. - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function. * Refactor dequantization tests and update barrier function - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite. - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management. * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed. * Fix typos in rasterization parameters and update import path for cached module - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage. - Updated the import statement for the `cached` module to reflect the new path in the cache submodule. - Added `StridedTensor` import in the language module for enhanced tensor functionality. * Update ci.yml

… bulk copy (tile-ai#746) * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable. * lint fix * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency. * remove bulk copy * Refactor copy and atomic add operations to support TMA lower configuration - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach. * Enhance TMA bulk copy logic in `LowerBulkCopy` method - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities. * lint fix * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions. * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling. * Update `native_sparse_attention` function to include TMA configuration options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1. * Refactor JIT decorator formatting in `native_sparse_attention` function - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability. * Enhance thread management and logging in TileLang compilation - Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability. * Add warp specialization scope and refactor register management in TileLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process. * Refactor test for InjectSetMaxNReg pass in TileLang - Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability. * Enhance bulk copy and store checks in `Copy` class - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging. * lint fix

* Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency. * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency. * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability. * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality. * typo fix * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues. * lint fix * test fix * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation. * use ninja instead of make * Add CMake configuration step for Ninja build system in setup.py * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja. * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility. * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks. * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly. * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes. * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions. * bugfix * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp. * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts. * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.

- Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable. - This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [MXFP4] Fix bugs - Optimize exp2 with shift operation to boost performance - Fix bug of simple dequantization function call - Fix bug of scaling factor with bias * [Lint] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

… gemm_sm90.h (tile-ai#751) - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.

* Add shape checking for reduce options * lint fix * Handle special case reducing into shape-1 tensor Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Include cuda_fp8.h in gemm_sm90.h - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * lint fix * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase. - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations. - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability. * lint fix * [Refactor] Update header inclusions in common.h and gemm_sm90.h - Removed the inclusion of "intrin.h" from common.h to streamline dependencies. - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability. * bug fix

* [MXFP4] Add bias to gemm kernel * [Lint] * [Lint] Rename "bias" to "Bias"

…i#754) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability.

…#1084)

…ile-ai#1068) * [Enhancement] Update async intrinsic handling in inject_fence_proxy * Added support for wgmma async intrinsics in IsAsyncIntrinsic function. * Changed handling of unknown externs to treat them as Generic instead of Async, improving accuracy in proxy kind determination. * test fix * Update testing/python/transform/test_tilelang_transform_inject_fence_proxy.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [Feature] Add GQA backward kernel with varlen input * [Lint] * [BugFix] Freeze the memory order of all atomic_add operations * [Lint] * [Lint] * [BugFix] Use release order to boost performance

…e-ai#1081) * [BugFix] Add memory order argument for non-vectorized atomic add * [Lint] * [BugFix] Memory order * [Lint] * [BugFix] Argument in cuda template * [Lint]

… path (tile-ai#1086) * refactor cython wrapper * optimize * fix installations

…1085) * Improve target docs and helper messaging Commit Message: - add SUPPORTED_TARGETS metadata and expose describe_supported_targets() - relax target validation to accept option suffixes and upgrade error messages - document target usage and compute capability mapping in docs/get_started/targets.md - note preference for string targets when caching and link the new guide in docs/index.md * remove american english spelling

…ests (tile-ai#1088) * [Cleanup] Remove `tilelang.disable_cache()` calls from example scripts * lint * lint

…tile-ai#1089) * • Enable configurable StorageRewrite inplace detection - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key. - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse. - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples. * lint fix * add test * lint fix

* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to generated Allocates and the PrimFunc attrs - thread the map through FlattenBuffer and StorageRewrite so flattened/merged allocations keep their tl.local_var_init annotations - teach annotation handling to accept scalar initializers, resolve buffers, and merge with existing stat * lint fix * enhance * lint fix * lint fix

…1093) * [Lint] * [BugFix] Freeze the memory order of all atomic_add operations * [Lint] * [Atomic] Move on to regional atomic add * [Lint]

* add alloc_reducer gemv example * test

…ile-ai#1044) * [Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow * chore: update clang-tidy settings * chore: upgrade clang-format and clang-tidy version * lint: resolve clang-tidy errors * [Maint] restore format.sh * [CI] pre-commit autoupdate * [Minor] fix `command -v` usage

…tile-ai#1104)

…ile-ai#1102) * [Maint] Remove pre-commit install in `format.sh` * [Maint] Update uncommitted change detection command * [Minor] update warning messages

* update rules * ruff check * other fixes * fmt * do not touch examples * fmt

…ion dtype logi (tile-ai#1111) * [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic * Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function. * Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty. * lint fix * remove debug print

…-ai#1095) * [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen * Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn. * Enhanced the existing code to support both directions of conversion based on the lane count. * Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang. * [Feature] Add float32 to float8 conversion support in CUDA codegen * Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method. * Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations. * Enhanced type handling for better compatibility with TileLang, particularly for float8 types. * lint * fix a bug * [Enhancement] Support lanes=4 cases and add unit test for vectorized cast * lint * [Feature] Refactor bf16 convertion operations and remove legacy compile flags * lint

github-actions · 2025-10-23T18:11:07Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

coderabbitai · 2025-10-23T18:11:28Z

Caution

Review failed

The pull request is closed.

Walkthrough

This pull request adds comprehensive project infrastructure, CI/CD workflows, build system improvements, extensive compiler configuration for code quality, and a large collection of neural network kernel examples. It includes configuration standardization, TVM integration refactoring, new attention mechanism implementations, and documentation improvements.

Changes

Cohort / File(s)	Summary
Configuration & Formatting `.clang-format`, `.clang-tidy`, `.editorconfig`, `.gitattributes`, `.pre-commit-config.yaml`	Adds clang-format (LLVM-based, 2-space indent, 80-column limit), clang-tidy with curated checks and suppressions, EditorConfig for language-specific indentation rules, Git attributes for line endings and file types, and pre-commit hooks for linting and formatting across multiple tools.
GitHub Issue Templates `.github/ISSUE_TEMPLATE/bug-report.yml`, `.github/ISSUE_TEMPLATE/config.yml`, `.github/ISSUE_TEMPLATE/feature-request.yml`, `.github/ISSUE_TEMPLATE/questions.yml`	Adds structured issue templates for bug reports, feature requests, and questions with required fields and validation; disables blank issues.
GitHub Workflows - Added `.github/workflows/dist.yml`, `.github/workflows/pr-perfbench-bot.yml`, `.github/workflows/pr-reminder-bot.yml`, `.github/workflows/publish-docs.yml`	Introduces CI workflows for wheel distribution (multi-platform, CUDA), performance benchmarking on PR comments, PR reminder bot, and automated documentation publishing.
GitHub Workflows - Removed `.github/workflows/bot.yml`, `.github/workflows/dependabot.yml`, `.github/workflows/reminder.yml`, `.github/workflows/publish_docs.yml`	Removes legacy bot, dependabot action, reminder, and old docs workflow; consolidates into new unified workflows.
GitHub Configuration `.github/dependabot.yml`	Configures Dependabot for weekly GitHub Actions dependency updates.
Build System & CMake `CMakeLists.txt`, `cmake/load_tvm.cmake`	Refactors CMake to require 3.26+, adds TVM loading via dedicated module, implements ccache integration, adds GPU backend detection (CUDA, ROCm, Metal), introduces Cython wrapper support, and consolidates source collection.
Submodule Updates `3rdparty/cutlass`, `3rdparty/tvm`	Updates CUTLASS and TVM submodule pointers to newer commits.
Documentation - Core `README.md`, `CODE_OF_CONDUCT.md`, `CONTRIBUTING.md`, `docs/index.md`	Fixes spelling ("coming soon"), code examples (device parameter), adds Code of Conduct, expands contribution guidelines with setup/dev instructions, and updates docs index with new sections.
Documentation - Compiler Internals `docs/compiler_internals/inject_fence_proxy.md`, `docs/compiler_internals/letstmt_inline.md`	Adds documentation for TVM fence proxy injection pass and LetStmt inlining optimization strategies.
Documentation - Deep Learning `docs/deeplearning_operators/elementwise.md`, `docs/deeplearning_operators/matmul.md`, `docs/get_started/Installation.md`, `docs/get_started/targets.md`	Updates API examples (T.symbolic → T.dynamic), adds target configuration guide, and simplifies installation instructions.
Documentation - Configuration `docs/conf.py`, `docs/requirements.txt`, `docs/spelling_wordlist.txt`	Updates Sphinx config, constrains astroid < 4, and adds spelling wordlist for codespell.
Benchmark Examples - Attention & MLA `benchmark/mamba2/benchmark_mamba_chunk_scan.py`, `benchmark/mamba2/README.md`, `examples/attention_sink/`, `examples/deepseek_mla/`	Adds Mamba2 chunk scan benchmark and comprehensive attention sink implementations (GQA/MHA forward/backward, Triton comparison); extends MLA decode variants with autotuning and fast-math optimizations.
Benchmark Examples - Matrix Operations `benchmark/matmul/benchmark_matmul.py`, `benchmark/matmul/benchmark_matmul_sp.py`, `benchmark/matmul_fp8/benchmark_matmul.py`	Adds swizzle layout support, sparse matmul with dynamic architecture handling, and FP8 matmul benchmarks.
Example - Flash Attention & Variants `examples/amd/example_amd_flash_attn_bwd.py`, `examples/amd/example_amd_flash_attn_fwd.py`	Implements autotuned FlashAttention V2 forward and backward passes with correctness checks and performance benchmarking.
Example - Sparse & NSA `examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py`, `examples/deepseek_nsa/example_tilelang_nsa_*.py`	Adds neural sparse attention forward/backward implementations with autotuning, fast-math, and TMA/warp specialization controls.
Example - Block Sparse Attention `examples/blocksparse_attention/example_tilelang_*.py`	Updates sparse GQA decode with fast-math and dynamic parameter handling (T.symbolic → T.dynamic).
Example - Convolution `examples/convolution/example_convolution.py`, `examples/convolution/example_convolution_autotune.py`	Adds pass message, refactors autotuning from external path to decorator-based approach.
Example - Cast Operations `examples/cast/example_*_cast_to_fp8.py`, `examples/cast/test_example_cast.py`	Adds parameterized main functions, removes cache disabling, updates test invocations with explicit arguments.
Example - Dequantization `examples/dequantize_gemm/dequantize_utils.py`, `examples/dequantize_gemm/example_dequant_gemm_*.py`	Introduces utility module for FP4/FP8 decoding; updates imports and replaces loop-based scaling with vectorized operations.
Example - DeepSeek V3.2 `examples/deepseek_v32/*`	Comprehensive new module with FP8 MQA indexing, top-k selector, sparse MLA forward/backward/pipelined, distributed inference model (671B config, conversion, generation scripts), and supporting utilities.
Example - Other `examples/bitnet-1.58b/`, `examples/blocksparse_attention/test_.py`, `examples/conftest.py`, `examples/convolution/example_convolution.py`	Minor docstring/comment fixes, test helper additions with deterministic seeding and collection validation.
Distributed Benchmarks `benchmark/distributed/benchmark_all_gather.py`, `benchmark/distributed/benchmark_all_to_all.py`, `benchmark/distributed/deepep/*`	Updates type annotations to PEP 585 (list[...] instead of List[...]), implements PEP 604 unions (X \| None), and adjusts method signatures for optional scale parameters.
Version & Dependencies `VERSION`, `.gitignore`, `examples/deepseek_v32/inference/requirements.txt`	Bumps version to 0.1.6.post1, expands .gitignore with build artifacts and environment files, adds requirements for inference module.

Sequence Diagram(s)

sequenceDiagram
    actor Dev as Developer
    participant GH as GitHub
    participant CI as CI/CD<br/>Workflows
    participant Docs as Docs<br/>Repo
    
    Dev->>GH: Open PR / Comment /performance-report
    
    rect rgba(100, 150, 255, 0.2)
    Note over CI: New Performance Bot
    GH->>CI: Trigger pr-perfbench-bot.yml
    CI->>CI: Checkout PR merge ref
    CI->>CI: Build merged & original versions
    CI->>CI: Run performance benchmark
    CI->>GH: Post results as comment
    end
    
    rect rgba(100, 200, 150, 0.2)
    Note over CI: Distribution Workflow
    GH->>CI: Trigger dist.yml (release/schedule)
    CI->>CI: Build wheels (multi-platform)
    CI->>CI: Upload artifacts
    end
    
    rect rgba(200, 150, 100, 0.2)
    Note over CI: Docs Publishing
    GH->>CI: Trigger publish-docs.yml
    CI->>CI: Build documentation
    CI->>Docs: Push to docs repo
    Docs->>Dev: Docs available online
    end
    
    rect rgba(255, 150, 150, 0.2)
    Note over GH: PR Automation
    GH->>GH: Post PR reminder (pre-commit)
    GH->>GH: Check issue template
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: The PR spans diverse domains—infrastructure/CI (medium complexity), build system refactoring (high complexity with TVM integration), extensive examples with intricate kernel logic (high density in attention/sparse operators), and configuration standardization (low per-file, high aggregate). While many changes are additive or configuration-based (reducing per-file cognitive load), the heterogeneity demands separate reasoning for CI workflows, CMake logic, and multiple kernel implementations across different attention mechanisms. The large example suite and algorithmic complexity in sparse MLA and attention variants increase cognitive overhead despite homogeneous patterns within cohorts.

Possibly related PRs

[Feat][Doc] Enhance TileLang with NVSHMEM support and build system improvements #18: Modifies CMakeLists.txt for CUDA/PyTorch extension build targets and wiring, addressing similar build infrastructure concerns.
[Feat] Refactor UVA mechanism and introduce metadata table to support remote device-to-device copy #15: Introduces distributed intrinsics, allocator utilities, and refactors remote-copy lowering and CUDA codegen, related to kernel infrastructure improvements.

Suggested reviewers

tzj-fxz

Poem

🐰 From configs fine-tuned with care,
To workflows dancing through the air,
New kernels bloom in attention's grace,
While sparse ops find their rightful place.
Build systems strong, docs crystal-clear—
TileLang's future is finally here! 🎉

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yu/dev

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5845232 and 6b201c6.

⛔ Files ignored due to path filters (2)

benchmark/mamba2/mamba_benchmark_result.png is excluded by !**/*.png
examples/deepseek_v32/figures/v32_arch.png is excluded by !**/*.png

📒 Files selected for processing (107)

.clang-format (1 hunks)
.clang-tidy (1 hunks)
.editorconfig (1 hunks)
.gitattributes (1 hunks)
.github/ISSUE_TEMPLATE/bug-report.yml (1 hunks)
.github/ISSUE_TEMPLATE/config.yml (1 hunks)
.github/ISSUE_TEMPLATE/feature-request.yml (1 hunks)
.github/ISSUE_TEMPLATE/questions.yml (1 hunks)
.github/dependabot.yml (1 hunks)
.github/workflows/bot.yml (0 hunks)
.github/workflows/dependabot.yml (0 hunks)
.github/workflows/dist.yml (1 hunks)
.github/workflows/pr-perfbench-bot.yml (1 hunks)
.github/workflows/pr-reminder-bot.yml (1 hunks)
.github/workflows/publish-docs.yml (1 hunks)
.github/workflows/publish_docs.yml (0 hunks)
.github/workflows/reminder.yml (0 hunks)
.gitignore (3 hunks)
.pre-commit-config.yaml (1 hunks)
3rdparty/cutlass (1 hunks)
3rdparty/tvm (1 hunks)
CMakeLists.txt (2 hunks)
CODE_OF_CONDUCT.md (1 hunks)
CONTRIBUTING.md (2 hunks)
README.md (2 hunks)
VERSION (1 hunks)
benchmark/distributed/benchmark_all_gather.py (2 hunks)
benchmark/distributed/benchmark_all_to_all.py (2 hunks)
benchmark/distributed/benchmark_reduce_scatter.py (1 hunks)
benchmark/distributed/deepep/intranode/get_dispatch_layout.py (3 hunks)
benchmark/mamba2/README.md (1 hunks)
benchmark/mamba2/benchmark_mamba_chunk_scan.py (1 hunks)
benchmark/matmul/README.md (1 hunks)
benchmark/matmul/benchmark_matmul.py (4 hunks)
benchmark/matmul/benchmark_matmul_intrinsic.py (1 hunks)
benchmark/matmul/benchmark_matmul_sp.py (7 hunks)
benchmark/matmul_fp8/README.md (1 hunks)
benchmark/matmul_fp8/benchmark_matmul.py (3 hunks)
cmake/load_tvm.cmake (1 hunks)
docs/compiler_internals/inject_fence_proxy.md (1 hunks)
docs/compiler_internals/letstmt_inline.md (1 hunks)
docs/conf.py (1 hunks)
docs/deeplearning_operators/elementwise.md (1 hunks)
docs/deeplearning_operators/matmul.md (2 hunks)
docs/get_started/Installation.md (2 hunks)
docs/get_started/targets.md (1 hunks)
docs/index.md (3 hunks)
docs/requirements.txt (1 hunks)
docs/spelling_wordlist.txt (1 hunks)
examples/amd/example_amd_flash_attn_bwd.py (1 hunks)
examples/amd/example_amd_flash_attn_fwd.py (3 hunks)
examples/attention_sink/README.md (1 hunks)
examples/attention_sink/benchmark_gqa_sink_fwd.py (1 hunks)
examples/attention_sink/benchmark_mha_sink_fwd.py (1 hunks)
examples/attention_sink/example_gqa_sink_bwd_bhsd.py (1 hunks)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1 hunks)
examples/attention_sink/example_mha_sink_bwd_bhsd.py (1 hunks)
examples/attention_sink/example_mha_sink_fwd_bhsd.py (1 hunks)
examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py (1 hunks)
examples/attention_sink/test_example_attention_sink.py (1 hunks)
examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py (4 hunks)
examples/bitnet-1.58b/modeling_bitnet.py (1 hunks)
examples/bitnet-1.58b/tokenization_bitnet.py (2 hunks)
examples/bitnet-1.58b/vllm_workspace/utils.py (1 hunks)
examples/blocksparse_attention/example_tilelang_block_sparse_attn.py (1 hunks)
examples/blocksparse_attention/example_tilelang_sparse_gqa_decode_paged.py (2 hunks)
examples/blocksparse_attention/example_tilelang_sparse_gqa_decode_varlen_indice.py (3 hunks)
examples/blocksparse_attention/example_tilelang_sparse_gqa_decode_varlen_mask.py (3 hunks)
examples/blocksparse_attention/test_example_blocksparse_attention.py (1 hunks)
examples/cast/example_group_per_split_token_cast_to_fp8.py (2 hunks)
examples/cast/example_per_token_cast_to_fp8.py (1 hunks)
examples/cast/test_example_cast.py (1 hunks)
examples/conftest.py (1 hunks)
examples/convolution/example_convolution.py (1 hunks)
examples/convolution/example_convolution_autotune.py (5 hunks)
examples/deepseek_mla/amd/README.md (2 hunks)
examples/deepseek_mla/amd/benchmark_mla_decode_amd_tilelang.py (2 hunks)
examples/deepseek_mla/benchmark_mla.py (6 hunks)
examples/deepseek_mla/example_mla_decode.py (1 hunks)
examples/deepseek_mla/example_mla_decode_paged.py (3 hunks)
examples/deepseek_mla/example_mla_decode_persistent.py (1 hunks)
examples/deepseek_mla/example_mla_decode_ws.py (1 hunks)
examples/deepseek_mla/experimental/example_mla_decode_kv_fp8.py (1 hunks)
examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (6 hunks)
examples/deepseek_nsa/example_tilelang_nsa_bwd.py (5 hunks)
examples/deepseek_nsa/example_tilelang_nsa_decode.py (1 hunks)
examples/deepseek_nsa/example_tilelang_nsa_fwd.py (1 hunks)
examples/deepseek_nsa/example_tilelang_nsa_fwd_varlen.py (1 hunks)
examples/deepseek_nsa/requirements.txt (1 hunks)
examples/deepseek_v32/README.md (1 hunks)
examples/deepseek_v32/fp8_lighting_indexer.py (1 hunks)
examples/deepseek_v32/inference/README.md (1 hunks)
examples/deepseek_v32/inference/config_671B_v3.2.json (1 hunks)
examples/deepseek_v32/inference/convert.py (1 hunks)
examples/deepseek_v32/inference/generate.py (1 hunks)
examples/deepseek_v32/inference/kernel.py (1 hunks)
examples/deepseek_v32/inference/model.py (1 hunks)
examples/deepseek_v32/inference/requirements.txt (1 hunks)
examples/deepseek_v32/sparse_mla_bwd.py (1 hunks)
examples/deepseek_v32/sparse_mla_fwd.py (1 hunks)
examples/deepseek_v32/sparse_mla_fwd_pipelined.py (1 hunks)
examples/deepseek_v32/test_tilelang_example_deepseek_v32.py (1 hunks)
examples/deepseek_v32/topk_selector.py (1 hunks)
examples/deepseek_v32/utils.py (1 hunks)
examples/dequantize_gemm/dequantize_utils.py (1 hunks)
examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (12 hunks)
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (16 hunks)

⛔ Files not processed due to max files limit (58)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper_tma.py
examples/dequantize_gemm/example_dequant_gemm_fine_grained.py
examples/dequantize_gemm/example_dequant_gemm_fp4_hopper.py
examples/dequantize_gemm/example_dequant_gemm_w4a8.py
examples/dequantize_gemm/example_dequant_groupedgemm_bf16_mxfp4_hopper.py
examples/dequantize_gemm/test_example_dequantize_gemm.py
examples/dequantize_gemm/utils.py
examples/distributed/example_allgather_gemm_overlapped_fast_sync.py
examples/distributed/example_allgather_gemm_overlapped_persistent.py
examples/distributed/example_post_attn_all2all_transpose.py
examples/distributed/example_pre_attn_all2all.py
examples/distributed/example_pre_attn_all2all_transpose.py
examples/distributed/gemm_rs_kernel.py
examples/distributed/primitives/example_sync.py
examples/distributed/triton_sp.py
examples/dynamic_shape/example_dynamic.py
examples/dynamic_shape/test_example_dynamic.py
examples/elementwise/example_elementwise_add.py
examples/elementwise/example_elementwise_add_tma_1d.py
examples/elementwise/test_example_elementwise.py
examples/flash_attention/example_gqa_bwd.py
examples/flash_attention/example_gqa_bwd_tma_reduce.py
examples/flash_attention/example_gqa_bwd_tma_reduce_varlen.py
examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py
examples/flash_attention/example_gqa_fwd_bshd.py
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py
examples/flash_attention/example_gqa_fwd_varlen.py
examples/flash_attention/example_mha_bwd.py
examples/flash_attention/example_mha_bwd_bhsd.py
examples/flash_attention/example_mha_bwd_wgmma_pipelined.py
examples/flash_attention/example_mha_fwd_bhsd.py
examples/flash_attention/example_mha_fwd_bhsd_wgmma_pipelined.py
examples/flash_attention/example_mha_fwd_bshd.py
examples/flash_attention/example_mha_fwd_bshd_wgmma_pipelined.py
examples/flash_attention/example_mha_fwd_varlen.py
examples/flash_attention/test_example_flash_attention.py
examples/flash_attention/varlen_utils.py
examples/flash_decoding/example_mha_inference.py
examples/fusedmoe/example_fusedmoe_tilelang.py
examples/fusedmoe/example_fusedmoe_torch.py
examples/fusedmoe/test_example_fusedmoe.py
examples/gdn/README.md
examples/gdn/example_chunk_delta_bwd.py
examples/gdn/example_chunk_delta_h.py
examples/gdn/example_chunk_o.py
examples/gdn/example_chunk_o_bwd.py
examples/gdn/example_chunk_scaled_dot_kkt.py
examples/gdn/example_cumsum.py
examples/gdn/example_wy_fast.py
examples/gdn/example_wy_fast_bwd_split.py
examples/gdn/test_example_gdn_compilation.py
examples/gdn/utils.py
examples/gemm/example_gemm.py
examples/gemm/example_gemm_autotune.py
examples/gemm/example_gemm_intrinsics.py
examples/gemm/example_gemm_persistent.py
examples/gemm/test_example_gemm.py
examples/gemm_fp8/example_tilelang_gemm_amd.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

LeiWang1999 and others added 30 commits August 13, 2025 18:37

fix: NVRTC backend (tile-ai#717)

6610c7b

* fix: NVRTC backend * fix: CI --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

[CUDA] Init support for sm_120 (tile-ai#716)

f5fca05

* Init support for sm120 * fmt * resolve comments * unify mma gemm * fmt --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

[CI] fix docs ci (tile-ai#720)

6545b08

[Chore] fix typos (tile-ai#719)

d074286

* chore: fix typos * chore: fix ruff * chore: fix clang-format

[Carver][Bugfix] Correct score function for warp tile selection in te…

2bd2d69

…nsorcore policy (tile-ai#724) * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy * [Typo] Correct architecture selection for CUDA and CDNA

[Bugfix]:Fix atomic add auto vectorize memory access out of bound err…

ce7b932

…or (tile-ai#742) * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error * Update atomicadd_vectorize.cc * format

[Enhancement] Add shape checking for reduce options (tile-ai#748)

c2fe91e

* Add shape checking for reduce options * lint fix * Handle special case reducing into shape-1 tensor Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

[MXFP4] Add bias to MXFP4 GEMM kernel (tile-ai#753)

fd199a4

* [MXFP4] Add bias to gemm kernel * [Lint] * [Lint] Rename "bias" to "Bias"

chengyupku and others added 24 commits October 21, 2025 00:25

[Bugfix] Fix missing reg alloc in custom warp specialization (tile-ai…

f8d3e73

…#1084)

[Feature] Add GQA backward kernel with varlen input (tile-ai#1082)

792e5d5

* [Feature] Add GQA backward kernel with varlen input * [Lint] * [BugFix] Freeze the memory order of all atomic_add operations * [Lint] * [Lint] * [BugFix] Use release order to boost performance

[BugFix] Add memory order argument for non-vectorized atomic add (til…

1d4b718

…e-ai#1081) * [BugFix] Add memory order argument for non-vectorized atomic add * [Lint] * [BugFix] Memory order * [Lint] * [BugFix] Argument in cuda template * [Lint]

[Refactor] Rename cython output to tilelang_cython and relocate its…

60e9c7e

… path (tile-ai#1086) * refactor cython wrapper * optimize * fix installations

[Cleanup] Remove tilelang.disable_cache() calls from examples and t…

0c7e741

…ests (tile-ai#1088) * [Cleanup] Remove `tilelang.disable_cache()` calls from example scripts * lint * lint

[Bugfix] Fix missing host cuTensorMapEncodeIm2col call (tile-ai#1094)

5cb5c06

[GQA] Add regional atomic add to slightly boost performance (tile-ai#…

f003f37

…1093) * [Lint] * [BugFix] Freeze the memory order of all atomic_add operations * [Lint] * [Atomic] Move on to regional atomic add * [Lint]

[Example] Add block level high performance gemv example (tile-ai#1097)

514bdea

* add alloc_reducer gemv example * test

[Refactor] Optimize debug message for parallel inference (tile-ai#1096)

151d9e6

[Refactor] Use forceinline in ldmatrix and update mamba scan kernel (…

8a5eb56

…tile-ai#1104)

[Maint] Update uncommitted change detection command in format.sh (t…

e28433e

…ile-ai#1102) * [Maint] Remove pre-commit install in `format.sh` * [Maint] Update uncommitted change detection command * [Minor] update warning messages

[Benchmark] Add Mamba2_chunk_scan benchmark (tile-ai#1109)

717f7b5

[Benchmark] Update Mamba2_chunk_scan benchmark (tile-ai#1110)

4f3523d

[Lint] Enable pyupgrade linter in ruff (tile-ai#963)

f14fb11

* update rules * ruff check * other fixes * fmt * do not touch examples * fmt

Merge remote-tracking branch 'upstream/main' into merge1023

2a382c8

fix

c68a512

lint

39dbb91

chengyupku added 3 commits October 24, 2025 15:06

[Install] Use pyproject to install extensions

288c025

[Install] Merge and into one extension

b7a3898

lint

6b201c6

chengyupku merged commit b51fc91 into main Oct 24, 2025
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge remote-tracking branch 'upstream/main' into main #31

Merge remote-tracking branch 'upstream/main' into main #31

Uh oh!

chengyupku commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

coderabbitai bot commented Oct 23, 2025 •

edited

Loading

Review failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

29 participants

Merge remote-tracking branch 'upstream/main' into main #31

Merge remote-tracking branch 'upstream/main' into main #31

Uh oh!

Conversation

chengyupku commented Oct 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

coderabbitai bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

29 participants

chengyupku commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 23, 2025 •

edited

Loading