[AMD] fix bug&add amd fp8 examples #966

Alex4210987 · 2025-10-10T07:00:40Z

Summary by CodeRabbit

New Features
- Expanded FlashAttention: forward/backward autotuning, reference verification, benchmarking, diagnostics, and updated CLI parameters for running kernels.
- Broader FP8 support and new FP8 vector types for HIP.
- New atomic-add API variants for HIP.
Performance
- Multi-stage kernels and revised tiling/wiring to improve tuning and runtime efficiency.
Tests
- Removed scripted AMD forward example runs.
Refactor
- Simplified layout annotations in attention examples.

…nd clarity (tile-ai#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices.

…s directory.

… code.

…ed flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters.

… example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance.

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability.

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization.

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability.

…requirements.

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability.

- Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings.

…b217

…e86c

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure.

…installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

…rd example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management.

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability.

- Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules. - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

…tn_bwd.py

…ple_amd_flash_attn_bwd.py for cleaner output.

…xample_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.

- Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability. - Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types. - Updated AtomicAdd function in common.h to streamline its implementation. - Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively. - Added function to manage the addition of new functions in the code generation process.

- Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.

- Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability. - Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types. - Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading. - Refined the AddFunction method to streamline function addition in the code generation process.

…proved code cleanliness.

github-actions · 2025-10-10T07:00:49Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

…bwd.py - Updated the GEMM operation to use shared memory for improved performance. - Adjusted parallelization parameters to enhance efficiency in the backward pass.

coderabbitai · 2025-10-10T07:04:03Z

Walkthrough

Adds comprehensive AMD FlashAttention forward/backward Tile-lang kernels, reference implementations, autotune configs, diagnostics, and CLI updates; adjusts forward numeric scaling; removes two scripted forward tests; removes some shared-memory layout annotations; and extends HIP FP8, atomic helpers, and GEMM template handling.

Changes

Cohort / File(s)	Summary
AMD FlashAttention (new kernels & tools) `examples/amd/example_amd_flash_attn_bwd.py`, `examples/amd/example_amd_flash_attn_fwd.py`	Adds `ref_program`, tiled forward (`fast_flashattn`) and backward (`flashattn_bwd`) kernels, preprocess/postprocess helpers, autotune config providers (`get_fwd_configs`, `get_bwd_configs`), diagnostic `debug_tensor_comparison`, benchmarking `benchmark_function`, and JIT/autotune decorators; changes forward scaling (exp2→exp) and expands autotune search space and tunable parameters.
AMD test script `examples/amd/test.sh`	Removes two forward-pass test invocations: the Python forward example call and a tiled FMHA forward binary call.
FlashAttention example (layout annotations) `examples/flash_attention/example_mha_bwd.py`	Removes several explicit shared-memory `annotation_layout` calls for fragments (`Q_shared`, `K_shared`, `dv_shared`, `dk_shared`); computation and control flow otherwise unchanged.
HIP codegen `src/target/codegen_hip.cc`	Adds mapping and constant-print handling for additional FP8 variants (e.g., e4m3/e4m3fnuz), emits `loop_break()` lowering, treats `no_set_max_nreg()` as a HIP no-op, and adjusts MFMA emission formatting.
HIP templates — common utilities `src/tl_templates/hip/common.h`	Adds value-based `AtomicAdd(T1 address, T2 val)` overload and `AtomicAddRet(T1 &ref, T2 val)` that returns the atomic result; preserves existing pointer-based overload.
HIP templates — GEMM `src/tl_templates/hip/gemm.h`	Removes/enables (comments) the `static_assert(!clear_accum)` guard, allowing compilation when `clear_accum=true` but noting manual handling/unsupported semantics.
HIP templates — FP8 types `src/tl_templates/hip/hip_fp8.h`	Adds FP8 E5M2 aliases and vector types (`fp8_e5_4_t`, `fp8_e5_8_t`, `fp8_e5_16_t`) with constructors/conversions for HIP compatibility.
Testing — FP8 matmul cases `testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py`	Adds tl_matmul tests covering `float8_e4m3fnuz` input dtype across several M/N/K and packing/transposition variants.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI/Caller
    participant AT as Autotuner
    participant JIT as JIT Compiler
    participant K as GPU Kernel (FWD)
    participant GPU as GPU

    CLI->>AT: fast_flashattn(args, configs)
    AT->>JIT: select & compile kernel variant
    JIT-->>AT: compiled kernel handle
    AT->>K: launch with Q, K, V and params
    K->>GPU: load tiles, compute Q·Kᵀ, apply mask (causal), softmax, multiply V
    GPU-->>K: tile results
    K-->>CLI: output + LSE/stats

sequenceDiagram
    autonumber
    participant CLI as CLI/Caller
    participant PRE as bwd_preprocess
    participant AT as Autotuner (BWD)
    participant JIT as JIT Compiler
    participant KB as GPU Kernel (BWD)
    participant POST as bwd_postprocess

    CLI->>PRE: flashattn_bwd_preprocess(inputs)
    PRE-->>CLI: auxiliaries (delta/LSE)
    CLI->>AT: flashattn_bwd(args, configs)
    AT->>JIT: compile selected bwd kernel
    JIT-->>AT: kernel handle
    AT->>KB: launch to compute dQ, dK, dV
    KB-->>CLI: gradients
    CLI->>POST: flashattn_bwd_postprocess(dQ,...)
    POST-->>CLI: final dQ

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[CI][AMD] Add AMD GPU CI and fix some related bugs #694 — Directly touches examples/amd/example_amd_flash_attn_fwd.py and autotune/config expansion for forward kernels.
fix amd tir&add examples #784 — Overlaps large refactors and new entrypoints for AMD FlashAttention forward/backward examples.
[AMD] support mfma i32_16x16x32_i8 #800 — Modifies HIP/MFMA/GEMM codepaths and relates to FP8 and GEMM template changes.

Suggested reviewers

LeiWang1999

Poem

I twitch my ears at kernels new,
Tiles hop fast where gradients grew.
FP8 and atomics join the crew,
Autotune hums — benchmarks chew.
Carrots queued, the rabbits review. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The title “fix bug&add amd examples” is too vague and generic to convey the significant and varied changes in this PR, which include adding AMD FlashAttention kernels, multiple HIP codegen and FP8 support enhancements, example scripts, and new tests. It does not clearly summarize the main contributions or specify which bug is fixed.	Please revise the title to clearly and specifically reflect the primary changes, such as “Add AMD FlashAttention kernels with HIP FP8 support and fix codegen bug,” so reviewers can quickly grasp the PR’s focus.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd1564d and 5b6bcaa.

📒 Files selected for processing (1)

testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)

testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py (1)

assert_tl_matmul_correctness (197-261)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-test-metal
GitHub Check: build-test-nvidia
GitHub Check: build-test-amd

🔇 Additional comments (2)

testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (2)

241-246: Approve test cases pending assertion fix.

The FP8 test cases are well-structured and provide good coverage:

Basic configuration (128×128×128, float16 output)

Larger sizes (128×256×256, float32 output)

k_pack variations (k_pack=2)

Transpose variations (b_transposed=False)

The test pattern is consistent with the existing int8 tests above. However, these tests will fail until the dtype assertion issue (lines 25-28) is resolved.

190-195: Verify PyTorch float8_e4m3fnuz dtype support on ROCm/HIP
Confirm in your ROCm-enabled PyTorch environment that torch.float8_e4m3fnuz exists (e.g., via getattr(torch, "float8_e4m3fnuz")) and that you can create a CUDA tensor with this dtype.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…lash_attn_bwd.py for improved code cleanliness.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (8)

examples/amd/example_amd_flash_attn_fwd.py (2)

37-37: Autotune space widened: expect longer tuning; consider adding num_stages=2 for AMD LDS pipelining

Good to include 0 and 1. On some workloads, stages=2 helps hide memory latency. Consider adding 2 (optional) or gate by dim/seq_len to avoid explosion.

175-186: Stable softmax path switch to exp is correct; minor micro-optimization optional

The move from exp2 to exp with scaled m is numerically sound. You can precompute inv_sqrt_d = scale once and reuse; also fold the two multiplications into a single fused multiply-sub for readability:

sf = T.exp((m_prev[i] - m_i[i]) * scale)

acc_s[i, j] = T.exp((acc_s[i, j] - m_i[i]) * scale)

Current code is correct as-is.

examples/amd/example_amd_flash_attn_bwd.py (5)

34-66: Config space parity with fwd example

Good coverage. To keep parity with the forward example file and avoid regressions, consider constraining threads/panel combos via simple heuristics (e.g., limit threads=1024 to block sizes ≥128) to reduce tuning time without losing best configs.

249-273: Preprocess kernel is correct; consider explicit threads kwarg

The Delta = sum(O*dO, axis=dim) implementation is fine. For consistency and predictable occupancy, pass threads=128 (or tuned) to T.Kernel here.

276-296: Backward kernel math path LGTM; minor clarity tweak

P_acc = exp(qkT*sm_scale - lse) and causal masking are standard.

dP, dv, dk, dq accumulation and atomic adds look consistent.

Optional: rename qkT -> S_scaled for readability.

Also applies to: 321-365

386-445: Debug/benchmark helpers are useful; minor CUDA guard

You already guard synchronize; optionally also guard torch.cuda.manual_seed(42) and device selection in main by checking CUDA/HIP availability to make CPU fallback easier when running on non-GPU CI.

446-486: End-to-end flow is solid; add LSE check and deterministic seeds

Great validation flow. Two small improvements:

Validate LSE from fwd_kernel against ref_program to catch rare stability issues.

Set torch.backends.cuda.matmul.allow_tf32 and cudnn flags as needed for determinism during checks (keep them default for perf bench).

I can add these guards and checks if you want.

Also applies to: 488-586
src/tl_templates/hip/hip_fp8.h (1)
54-64: Inconsistent API: fp8_e5_4_t is missing constructors and operators present in fp8_e4_4_t.

The fp8_e5_4_t struct only has:

Default constructor

Constructor from __hip_fp8x4_e5m2_fnuz

Conversion operator

However, fp8_e4_4_t (lines 24-40) additionally provides:

Constructor from float4

Assignment operator from native type

This API inconsistency could lead to confusion and limit usability of E5 types.

Consider adding the missing members to fp8_e5_4_t to match fp8_e4_4_t:
 struct fp8_e5_4_t {
   union {
     __hip_fp8x4_e5m2_fnuz data;
     struct {
       fp8_e5_t x, y, z, w;
     };
   };
   __device__ fp8_e5_4_t() = default;
   __device__ fp8_e5_4_t(const __hip_fp8x4_e5m2_fnuz &val) : data(val) {}
+  __device__ fp8_e5_4_t(const float4 &val) : data(val) {}
   __device__ operator __hip_fp8x4_e5m2_fnuz() const { return data; }
+  __device__ fp8_e5_4_t &operator=(const __hip_fp8x4_e5m2_fnuz &val) {
+    data = val;
+    return *this;
+  }
 };

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cd0da9 and 8b345ae.

📒 Files selected for processing (8)

examples/amd/example_amd_flash_attn_bwd.py (2 hunks)
examples/amd/example_amd_flash_attn_fwd.py (3 hunks)
examples/amd/test.sh (0 hunks)
examples/flash_attention/example_mha_bwd.py (0 hunks)
src/target/codegen_hip.cc (4 hunks)
src/tl_templates/hip/common.h (1 hunks)
src/tl_templates/hip/gemm.h (1 hunks)
src/tl_templates/hip/hip_fp8.h (2 hunks)

💤 Files with no reviewable changes (2)

examples/amd/test.sh
examples/flash_attention/example_mha_bwd.py

🔇 Additional comments (11)

examples/amd/example_amd_flash_attn_fwd.py (1)

86-86: Scale change LGTM

Using natural exp with scale=1/sqrt(d) aligns with the reference softmax.

examples/amd/example_amd_flash_attn_bwd.py (1)

68-88: Forward kernel (in bwd file) numerics and LSE emission look correct

Scaling applied before max and exp; mask preserves -inf; LSE written as log(l_i)+m_i with bounds check. This matches the reference.

If desired, add an assert_allclose against ref_program for LSE too to tighten validation (currently only outputs checked via profiler in main).

Also applies to: 98-107, 176-201, 212-222

src/tl_templates/hip/common.h (1)

119-121: LGTM!

The AtomicAddRet implementation correctly performs an atomic add on a reference and returns the result. This is a useful addition for cases where the return value is needed.

src/target/codegen_hip.cc (4)

44-54: LGTM: Extended FP8 type support.

The additions correctly map new FP8 variants (kFloat8_e4m3, kFloat8_e4m3b11fnuz, kFloat8_e5m2fnuz, kFloat8_e8m0fnu) to their respective HIP type representations. The consistent pattern of preserving vector-width suffixes ensures proper type handling across all FP8 variants.

55-55: LGTM: Improved error diagnostics.

Including the actual type in the error message will help developers quickly identify unsupported FP8 types during debugging.

966-972: LGTM: HIP-specific control flow additions.

The loop_break emits a standard break statement, and no_set_max_nreg correctly becomes a no-op for HIP since register management is handled differently than in CUDA. Both implementations are appropriate for the HIP backend.

1178-1179: LGTM: Consistent FP8 constant handling.

The extension of FP8 constant detection to include is_float8_e4m3 and is_float8_e4m3fn aligns with the FP8 type handling additions earlier in the file, ensuring all FP8 variants are properly recognized and emitted as fp8_e4_t constants.

src/tl_templates/hip/hip_fp8.h (3)

8-13: LGTM: Added FP8 E5 type aliases.

The additions correctly provide type aliases for E5M2 FP8 types, maintaining consistency with the existing E4M3 aliases. The commented note about E8M0 not being supported in the current HIP version is helpful for future reference.

24-40: LGTM: Enhanced fp8_e4_4_t wrapper API.

The additions provide a complete set of constructors and operators:

Default constructor for uninitialized instances

Copy constructor from native type for interoperability

Constructor from float4 for conversions

Conversion operator for seamless usage with native types

Assignment operator for value updates

These enhancements improve usability and type safety for FP8 operations.

66-74: LGTM: E5 vector type scaffolding.

The fp8_e5_8_t and fp8_e5_16_t structs follow the same composition pattern as their E4 counterparts, providing consistent vector type support across FP8 variants.

src/tl_templates/hip/gemm.h (1)

73-75: Verify that accumulator clearing is implemented when clear_accum=true.

The comment states "we'll handle it by manually clearing the accumulator," but reviewing the body() and body_rs() methods (lines 158-291), I don't see explicit accumulator initialization code that checks the clear_accum template parameter.

If clear_accum=true is now allowed but not implemented, this could lead to undefined behavior with uninitialized accumulators.

Run the following script to verify if there's accumulator clearing logic elsewhere or if this needs to be implemented:

coderabbitai · 2025-10-10T07:12:47Z

examples/amd/example_amd_flash_attn_bwd.py

+                T.gemm(K_shared, q_shared, qkT, transpose_B=True, policy=T.GemmWarpPolicy.FullRow)
+


⚠️ Potential issue | 🔴 Critical

Fix policy reference: use imported GemmWarpPolicy instead of T.GemmWarpPolicy

You import GemmWarpPolicy, but call T.GemmWarpPolicy.*. Unless tilelang.language exposes GemmWarpPolicy, this will raise at JIT time. Align with the forward file usage.

Apply this diff:

- T.gemm(K_shared, q_shared, qkT, transpose_B=True, policy=T.GemmWarpPolicy.FullRow) + T.gemm(K_shared, q_shared, qkT, transpose_B=True, policy=GemmWarpPolicy.FullRow) @@ - T.gemm(V_shared, do_shared, dP, transpose_B=True, policy=T.GemmWarpPolicy.FullRow) + T.gemm(V_shared, do_shared, dP, transpose_B=True, policy=GemmWarpPolicy.FullRow) @@ - T.gemm(p_cast, do_shared, dv, policy=T.GemmWarpPolicy.FullRow) + T.gemm(p_cast, do_shared, dv, policy=GemmWarpPolicy.FullRow) @@ - T.gemm(p_cast, q_shared, dk, policy=T.GemmWarpPolicy.FullRow) + T.gemm(p_cast, q_shared, dk, policy=GemmWarpPolicy.FullRow)

Also applies to: 343-347, 353-354

🤖 Prompt for AI Agents

In examples/amd/example_amd_flash_attn_bwd.py around lines 328-329 (and also 343-347, 353-354), the code references T.GemmWarpPolicy.* while GemmWarpPolicy is imported directly; replace all uses of T.GemmWarpPolicy with the imported GemmWarpPolicy (e.g., GemmWarpPolicy.FullRow) to match the forward file and avoid JIT-time attribute errors, updating each affected invocation to use GemmWarpPolicy.* consistently.

coderabbitai · 2025-10-10T07:12:47Z

src/tl_templates/hip/common.h

+// Overload for when the first argument is a value instead of a pointer
+template <typename T1, typename T2>
+TL_DEVICE void AtomicAdd(T1 address, T2 val) {
+  atomicAdd(reinterpret_cast<T1 *>(&address), static_cast<T1>(val));
+}


⚠️ Potential issue | 🔴 Critical

Critical bug: Atomic operation on local copy is meaningless.

This overload takes address by value (a copy), then takes the address of that local copy to perform the atomic operation. Since the atomic operation is performed on a temporary local variable, the result is discarded when the function returns, making this operation meaningless.

If the intent is to provide a convenience wrapper for value-based atomics, the signature should take a reference or pointer instead. As written, this code will compile but produce incorrect behavior.

Apply this diff to remove the broken overload:

-// Overload for when the first argument is a value instead of a pointer -template <typename T1, typename T2> -TL_DEVICE void AtomicAdd(T1 address, T2 val) { - atomicAdd(reinterpret_cast<T1 *>(&address), static_cast<T1>(val)); -}

If a value-based atomic is truly needed, clarify the intended semantics and implement accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Overload for when the first argument is a value instead of a pointer

template <typename T1, typename T2>

TL_DEVICE void AtomicAdd(T1 address, T2 val) {

atomicAdd(reinterpret_cast<T1 *>(&address), static_cast<T1>(val));

}

🤖 Prompt for AI Agents

In src/tl_templates/hip/common.h around lines 113-117, the AtomicAdd overload takes the address parameter by value and performs atomicAdd on the address of that local copy, so the atomic operation targets a temporary and has no effect; remove this broken overload or change its signature to accept a pointer or reference to the target storage (e.g., T1* or T1&) and use that directly in atomicAdd, or if a value-based atomic is intended, document and implement the correct semantics (e.g., return the new value and operate on a shared location) instead of operating on a local copy.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

examples/amd/example_amd_flash_attn_fwd.py (1)

31-63: Autotune search space broadened; consider tuning bounds for ROCm

num_stages now includes [0, 1]. Good for overlap coverage. If you observe compile/runtime instability on some HIP stacks, you may want to guard stage=1 via device checks or leave a note in the example.
examples/amd/example_amd_flash_attn_bwd.py (3)
328-347: Unify policy enum usage to avoid ambiguity

You import GemmWarpPolicy but also use T.GemmWarpPolicy in a few places. Prefer one. Using the imported GemmWarpPolicy everywhere avoids reliance on module re-exports.
-                T.gemm(K_shared, q_shared, qkT, transpose_B=True, policy=T.GemmWarpPolicy.FullRow)
+                T.gemm(K_shared, q_shared, qkT, transpose_B=True, policy=GemmWarpPolicy.FullRow)

-                T.gemm(V_shared, do_shared, dP, transpose_B=True, policy=T.GemmWarpPolicy.FullRow)
+                T.gemm(V_shared, do_shared, dP, transpose_B=True, policy=GemmWarpPolicy.FullRow)

-                T.gemm(p_cast, q_shared, dk, policy=T.GemmWarpPolicy.FullRow)
+                T.gemm(p_cast, q_shared, dk, policy=GemmWarpPolicy.FullRow)
Also applies to: 343-347, 353-358

256-271: Rename ambiguous parameter ‘O’ to satisfy linters and improve clarity

Variable name O triggers E741 (ambiguous name) and reduces readability.
-    def flash_bwd_prep(O: T.Tensor(shape, dtype), dO: T.Tensor(shape, dtype),
+    def flash_bwd_prep(Out: T.Tensor(shape, dtype), dO: T.Tensor(shape, dtype),
                        Delta: T.Tensor([batch, heads, seq_len], accum_dtype)):
@@
-                T.copy(O[bz, by * blk:(by + 1) * blk, bx, k * blk:(k + 1) * blk], o)
+                T.copy(Out[bz, by * blk:(by + 1) * blk, bx, k * blk:(k + 1) * blk], o)
Optionally mirror this rename at the call site for consistency in naming (argument name doesn’t need to match, so no functional impact).

518-534: Remove or underscore unused ‘*_mean_diff’ variables

Avoid RUF059 by ignoring the unused means.
-    dq_close, dq_max_diff, dq_mean_diff = debug_tensor_comparison(
+    dq_close, dq_max_diff, _dq_mean = debug_tensor_comparison(
         dQ_tl, q_ref.grad, "dQ", rtol=0.05, atol=0.05)
@@
-    dk_close, dk_max_diff, dk_mean_diff = debug_tensor_comparison(
+    dk_close, dk_max_diff, _dk_mean = debug_tensor_comparison(
         dK_tl.to(torch.float16), k_ref.grad, "dK", rtol=0.05, atol=0.05)
@@
-    dv_close, dv_max_diff, dv_mean_diff = debug_tensor_comparison(
+    dv_close, dv_max_diff, _dv_mean = debug_tensor_comparison(
         dV_tl.to(torch.float16), v_ref.grad, "dV", rtol=0.05, atol=0.05)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cd0da9 and c34315c.

📒 Files selected for processing (8)

examples/amd/example_amd_flash_attn_bwd.py (2 hunks)
examples/amd/example_amd_flash_attn_fwd.py (3 hunks)
examples/amd/test.sh (0 hunks)
examples/flash_attention/example_mha_bwd.py (0 hunks)
src/target/codegen_hip.cc (4 hunks)
src/tl_templates/hip/common.h (1 hunks)
src/tl_templates/hip/gemm.h (1 hunks)
src/tl_templates/hip/hip_fp8.h (2 hunks)

💤 Files with no reviewable changes (2)

examples/flash_attention/example_mha_bwd.py
examples/amd/test.sh

🧰 Additional context used

🧬 Code graph analysis (3)

src/tl_templates/hip/hip_fp8.h (1)

src/tl_templates/cuda/cuda_fp8.h (1)

fp8_e5_2_t (44-47)

src/target/codegen_hip.cc (2)

tilelang/language/builtin.py (2)

loop_break (419-422)

no_set_max_nreg (160-163)

tilelang/language/customize.py (1)

loop_break (67-73)

examples/amd/example_amd_flash_attn_bwd.py (2)

examples/amd/example_amd_flash_attn_fwd.py (4)

ref_program (11-28)

fast_flashattn (68-205)

main (97-203)

main (208-238)

examples/attention_sink/example_gqa_sink_bwd_bhsd.py (7)

ref_program (403-447)

main (450-519)

flashattn_bwd_preprocess (148-173)

flash_bwd_prep (154-171)

flashattn_bwd (213-321)

flashattn_bwd_postprocess (188-205)

flash_bwd_post (194-203)

🪛 Ruff (0.13.3)

examples/amd/example_amd_flash_attn_bwd.py

257-257: Ambiguous variable name: O

(E741)

518-518: Unpacked variable dq_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

525-525: Unpacked variable dk_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

532-532: Unpacked variable dv_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-metal
GitHub Check: format-check

🔇 Additional comments (12)

src/tl_templates/hip/common.h (1)

119-121: LGTM!

The AtomicAddRet function correctly takes a reference and returns the old value from the atomic operation.

src/target/codegen_hip.cc (4)

44-55: LGTM!

The extended FP8 type handling correctly maps additional FP8 variants to their corresponding HIP types (fp8_e4_t, fp8_e5_t, fp8_e8_t). The error message now includes the type information for better diagnostics.

936-940: LGTM!

The reformatted mfma builtin call maintains the same behavior with improved readability.

966-972: LGTM!

The implementation correctly handles:

tl::loop_break() by emitting a break; statement

tl::no_set_max_nreg() as a no-op (HIP doesn't require explicit register management like CUDA)

Both behaviors are appropriate for the HIP backend.

1178-1179: LGTM!

The consolidated FP8 constant handling correctly combines multiple E4M3 variants (e4m3fnuz, e4m3, e4m3fn) that all map to the same fp8_e4_t type for HIP. This simplification reduces code duplication.

src/tl_templates/hip/hip_fp8.h (3)

8-13: LGTM!

The FP8 E5M2 typedefs are correctly defined using HIP's underlying types, and the comment appropriately documents that E8M0 types are not yet supported in the current HIP version.

53-74: LGTM!

The FP8 E5M2 vector types (fp8_e5_4_t, fp8_e5_8_t, fp8_e5_16_t) follow the same structure and pattern as the existing E4M3 types, providing consistent API and proper alignment attributes.

76-99: LGTM!

The commented-out E8M0 vector types provide a clear template for future implementation when HIP adds support for this FP8 variant.

examples/amd/example_amd_flash_attn_fwd.py (2)

86-86: Scale computation is correct

(1.0 / dim)**0.5 equals 1/sqrt(dim); matches the reference softmax scaling.

172-186: exp2 → exp change is consistent with the new scale

Using T.exp with the scale as written preserves correctness and removes the need for a log2(e) factor. Looks good.

If you previously cached autotune results, clear the cache to avoid mixing exp2/exp kernels during benchmarking.

examples/amd/example_amd_flash_attn_bwd.py (2)

88-105: Forward kernel: LSE production and scaling look correct

You now emit LSE and compute it as log(l_i) + m_i in the scaled domain. The earlier multiply-by-scale before reduce_max and exp(acc_s - m_i) is consistent and numerically stable.

Please re-run forward correctness with a larger seq_len (e.g., 8192) to sanity-check stability on HIP: rtol=5e-2, atol=5e-2.

Also applies to: 207-223

447-507: End-to-end wiring is clean and debuggable

Nice addition of reference checks, LSE plumbing, and separated pre/post backward stages; improves clarity and makes benchmarking straightforward.

coderabbitai · 2025-10-10T07:13:58Z

src/tl_templates/hip/common.h

+// Overload for when the first argument is a value instead of a pointer
+template <typename T1, typename T2>
+TL_DEVICE void AtomicAdd(T1 address, T2 val) {
+  atomicAdd(reinterpret_cast<T1 *>(&address), static_cast<T1>(val));
+}


⚠️ Potential issue | 🔴 Critical

Critical logic error: taking address of pass-by-value parameter.

This overload takes T1 address by value and then uses &address to get a pointer. This retrieves the address of the local copy created for the function parameter, not the address of the caller's original variable. The atomic operation would be performed on the local copy, which is destroyed when the function returns, making the atomic operation completely ineffective.

Either remove this overload entirely if it's not needed, or if the intent was to support non-pointer types, the signature should be AtomicAdd(T1 &address, T2 val) (taking a reference).

Apply this diff to fix the signature to take a reference:

// Overload for when the first argument is a value instead of a pointer template <typename T1, typename T2> -TL_DEVICE void AtomicAdd(T1 address, T2 val) { - atomicAdd(reinterpret_cast<T1 *>(&address), static_cast<T1>(val)); +TL_DEVICE void AtomicAdd(T1 &address, T2 val) { + atomicAdd(&address, static_cast<T1>(val)); }

🤖 Prompt for AI Agents

In src/tl_templates/hip/common.h around lines 113 to 117, the AtomicAdd overload takes its first parameter by value and then uses &address, which takes the address of the local copy instead of the caller's variable; change the function signature to take the address parameter by reference (e.g., T1 &address) so atomicAdd operates on the caller's memory (or remove the overload if not needed), and update any callers if necessary to pass lvalues.

coderabbitai · 2025-10-10T07:13:58Z

src/tl_templates/hip/gemm.h

+  // Note: clear_accum=true is not fully supported in HIP implementation
+  // but we'll handle it by manually clearing the accumulator
+  // static_assert(!clear_accum, "clear_accum=true is not supported yet");


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that clear_accum=true is properly handled.

The static_assert has been removed and replaced with a comment stating that clear_accum will be "manually cleared." However, examining the body() (lines 158-232) and body_rs() (lines 234-291) methods, there is no visible code that clears the accumulator when clear_accum=true.

If clear_accum=true is passed, the accumulator should be zeroed before accumulation begins, but this implementation appears to rely on the caller to pre-zero C_local.

Run the following script to check if clear_accum is handled elsewhere or if callers are responsible for zeroing:

🏁 Script executed:

#!/bin/bash # Search for clear_accum usage patterns and accumulator initialization rg -n -C5 'clear_accum.*true|C_local.*=.*0|memset.*C_local' --type=cpp --type=h

Length of output: 3003

Restore or implement clear_accum handling
Either re-enable static_assert(!clear_accum) on lines 73–75 or add code to zero C_local when clear_accum=true, as it’s currently unhandled.

🤖 Prompt for AI Agents

In src/tl_templates/hip/gemm.h around lines 73–75, restore handling for the clear_accum template flag: either re-enable the static_assert(!clear_accum) to fail compilation when clear_accum=true, or implement the runtime/compile-time zeroing of C_local when clear_accum==true. If choosing static_assert, add the assert back exactly for the HIP path; if choosing implementation, add a guarded branch (compile-time if constexpr or preprocessor guarded by the template param) that writes zeros into the accumulator buffer C_local before accumulation begins, ensuring the zeroing matches the accumulator tile shape and memory layout and does not break existing non-clear behavior.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

examples/amd/example_amd_flash_attn_bwd.py (1)

1-601: Format examples/amd/example_amd_flash_attn_bwd.py using the project’s formatter (e.g., ruff --fix or black) to resolve CI formatting errors.
src/tl_templates/hip/gemm.h (1)
54-65: Add MfmaTraits specialization for fp8_e5_t
HIP codegen emits fp8_e5_t but src/tl_templates/hip/gemm.h only handles fp8_e4_t. Add the following under #if defined(HIP_FP8_ENABLED):
template <> struct MfmaTraits<fp8_e5_t> {
  template <typename AccType>
  static TL_DEVICE void mfma_op(const fp8_e5_t* b, const fp8_e5_t* a, AccType* c) {
    int64_t a_val = *reinterpret_cast<const int64_t*>(a);
    int64_t b_val = *reinterpret_cast<const int64_t*>(b);
    *c = __builtin_amdgcn_mfma_f32_16x16x32_fp8_fp8(b_val, a_val, *c, 0, 0, 0);
  }
};
Confirm the intrinsic is correct for E5 on your target compiler.

🧹 Nitpick comments (3)

examples/amd/example_amd_flash_attn_bwd.py (2)
257-258: Address ambiguous variable name.

Static analysis flags variable O as ambiguous (single uppercase letter). While this is idiomatic for mathematical notation in attention mechanisms, consider the project's style guidelines.

Based on learnings

If the codebase prefers descriptive names, apply this diff:
-    def flash_bwd_prep(O: T.Tensor(shape, dtype), dO: T.Tensor(shape, dtype),
+    def flash_bwd_prep(Output: T.Tensor(shape, dtype), dOutput: T.Tensor(shape, dtype),
                        Delta: T.Tensor([batch, heads, seq_len], accum_dtype)):
And update references within the function accordingly.

517-517: Remove unused unpacked variables.

Static analysis flags dq_mean_diff, dk_mean_diff, and dv_mean_diff as unpacked but unused.

Apply this diff to suppress warnings:
-    dq_close, dq_max_diff, dq_mean_diff = debug_tensor_comparison(
+    dq_close, dq_max_diff, _ = debug_tensor_comparison(
         dQ_tl, q_ref.grad, "dQ", rtol=0.05, atol=0.05)
     if dq_close:
         print("dQ is correct.")
     else:
         print("dQ mismatch detected.")
 
-    dk_close, dk_max_diff, dk_mean_diff = debug_tensor_comparison(
+    dk_close, dk_max_diff, _ = debug_tensor_comparison(
         dK_tl.to(torch.float16), k_ref.grad, "dK", rtol=0.05, atol=0.05)
     if dk_close:
         print("dK is correct.")
     else:
         print("dK mismatch detected.")
 
-    dv_close, dv_max_diff, dv_mean_diff = debug_tensor_comparison(
+    dv_close, dv_max_diff, _ = debug_tensor_comparison(
         dV_tl.to(torch.float16), v_ref.grad, "dV", rtol=0.05, atol=0.05)
Also applies to: 524-524, 531-531
src/tl_templates/hip/hip_fp8.h (1)

53-75: E5 FP8 vector wrappers: LGTM; consider parity with E4 helpers

Implementation mirrors E4 forms. For API parity, you may add an assignment operator from __hip_fp8x4_e5m2_fnuz like E4_4 has, but optional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cd0da9 and 8b345ae.

📒 Files selected for processing (8)

examples/amd/example_amd_flash_attn_bwd.py (2 hunks)
examples/amd/example_amd_flash_attn_fwd.py (3 hunks)
examples/amd/test.sh (0 hunks)
examples/flash_attention/example_mha_bwd.py (0 hunks)
src/target/codegen_hip.cc (4 hunks)
src/tl_templates/hip/common.h (1 hunks)
src/tl_templates/hip/gemm.h (1 hunks)
src/tl_templates/hip/hip_fp8.h (2 hunks)

💤 Files with no reviewable changes (2)

examples/amd/test.sh
examples/flash_attention/example_mha_bwd.py

🧰 Additional context used

🧬 Code graph analysis (3)

src/target/codegen_hip.cc (2)

tilelang/language/builtin.py (2)

loop_break (419-422)

no_set_max_nreg (160-163)

tilelang/language/customize.py (1)

loop_break (67-73)

src/tl_templates/hip/hip_fp8.h (1)

src/tl_templates/cuda/cuda_fp8.h (1)

fp8_e5_2_t (44-47)

examples/amd/example_amd_flash_attn_bwd.py (4)

examples/amd/example_amd_flash_attn_fwd.py (4)

ref_program (11-28)

fast_flashattn (68-205)

main (97-203)

main (208-238)

examples/attention_sink/example_mha_sink_fwd_bhsd.py (3)

ref_program (191-236)

main (128-185)

main (253-312)

examples/attention_sink/example_mha_sink_bwd_bhsd.py (8)

ref_program (399-444)

main (447-514)

get_bwd_configs (11-19)

flashattn_bwd_preprocess (145-170)

flash_bwd_prep (151-168)

flashattn_bwd (210-322)

flashattn_bwd_postprocess (185-202)

backward (367-391)

tilelang/tileop/gemm/gemm_base.py (3)

K (42-43)

k_pack (111-112)

policy (119-120)

🪛 GitHub Actions: CI Test on Metal

examples/amd/example_amd_flash_attn_bwd.py

[error] 1-1: Reformatted files. Please review and stage the changes.

🪛 Ruff (0.13.3)

examples/amd/example_amd_flash_attn_bwd.py

257-257: Ambiguous variable name: O

(E741)

517-517: Unpacked variable dq_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

524-524: Unpacked variable dk_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

531-531: Unpacked variable dv_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (10)

examples/amd/example_amd_flash_attn_fwd.py (2)

37-37: LGTM! Autotune search space expanded.

Adding num_stages = [0, 1] increases the configuration search space, allowing the autotuner to explore pipelined and non-pipelined execution paths for better performance.

86-86: Consistent exp usage verified. All AMD attention examples use T.exp with scale = (1.0/dim)**0.5; no T.exp2 or hardcoded 1.44269504 found.
examples/amd/example_amd_flash_attn_bwd.py (6)
1-11: LGTM! Imports are appropriate.

The imports include necessary dependencies for the backward pass implementation, including numpy for tensor comparison utilities and time for benchmarking.

13-31: LGTM! Reference implementation returns LSE.

The reference implementation now returns both the output and log-sum-exp (LSE), which is required for the backward pass. The variable naming uses K_ref and V_ref for clarity after the repeat_interleave operation.

182-198: Guard against infinity before exponentiation.

The code correctly checks for -T.infinity values before calling T.exp, preventing numerical overflow and NaN propagation:
if m_prev[i] == -T.infinity(accum_dtype):
    scale_factor[i] = 0.0
else:
    scale_factor[i] = T.exp(m_prev[i] - m_i[i])
This defensive pattern is also applied at lines 194-197. Good practice for numerical stability.

276-365: LGTM! Backward kernel uses atomic operations correctly.

The backward pass implementation:

Correctly computes dV and dK accumulations per block

Uses T.atomic_add for dQ accumulation (line 359) to handle race conditions from multiple blocks writing to the same dQ elements

Uses T.atomic_add for final dV and dK writes (lines 362-363) to handle multi-group scenarios

The use of accum_dtype (float32) for dQ accumulation buffer followed by postprocessing to float16 is a sound approach for numerical stability.

386-425: Comprehensive debug utility.

The debug_tensor_comparison function provides detailed diagnostics including max/mean differences, relative errors, NaN/Inf counts, and per-element analysis. This will be valuable for debugging numerical issues during development.

446-588: Comprehensive test harness with proper verification.

The main function provides:

Forward pass autotuning and verification against PyTorch reference

Backward pass autotuning with preprocessing and postprocessing

Detailed gradient verification with configurable tolerances

Performance benchmarking with proper synchronization

Clear pass/fail reporting

The integration of all components (forward, backward, pre/post-processing) demonstrates a complete end-to-end workflow.
src/target/codegen_hip.cc (1)

966-973: loop_break and no_set_max_nreg lowering: LGTM

Emitting break; and treating no_set_max_nreg() as a no‑op on HIP is fine.

Ensure these intrinsics are only used in statement context (Evaluate) to avoid dangling expressions in value contexts.

src/tl_templates/hip/hip_fp8.h (1)

8-14: E5 FP8 aliases: LGTM

Adds E5M2 aliases consistent with HIP’s fnuz variants.

Confirm target HIP SDK provides __hip_fp8_e5m2_fnuz and __hip_fp8x2_e5m2_fnuz on all CI targets.

coderabbitai · 2025-10-10T07:14:51Z

examples/amd/example_amd_flash_attn_bwd.py

+def get_fwd_configs():
+    block_M = [32, 64, 128, 256]
+    block_N = [32, 64, 128, 256]
+    threads = [128, 256, 512]
+    num_split_q = [64, 128, 256]
+    num_stages = [0, 1]
+    enable_rasterization = [True]
+    k_pack = [2]
+    panel_size = [7, 8, 9, 10]
+    qk_coalesced_width = [8]
+    v_coalesced_width = [4]
+
+    valid_configs = []
+
+    for m, n, s, t, stages, r, k, p, qkw, vw in itertools.product(block_M, block_N, num_split_q,
+                                                                  threads, num_stages,
+                                                                  enable_rasterization, k_pack,
+                                                                  panel_size, qk_coalesced_width,
+                                                                  v_coalesced_width):
+        valid_configs.append({
+            "block_M": m,
+            "block_N": n,
+            "num_split_q": s,
+            "threads": t,
+            "num_stages": stages,
+            "enable_rasterization": r,
+            "k_pack": k,
+            "panel_size": p,
+            "qk_coalesced_width": qkw,
+            "v_coalesced_width": vw,
+        })
+    return valid_configs


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify configuration consistency with forward pass.

The get_fwd_configs() function generates configurations for forward pass autotuning. Note that panel_size has been expanded to [7, 8, 9, 10] compared to the forward-only file.

Ensure these configurations align with the standalone forward example to maintain consistency across the codebase:

🏁 Script executed:

#!/bin/bash # Description: Compare forward configs between fwd-only and fwd-bwd examples echo "=== Forward-only example config ===" rg -A 10 'def get_configs' examples/amd/example_amd_flash_attn_fwd.py echo "" echo "=== Forward-backward example config ===" rg -A 10 'def get_fwd_configs' examples/amd/example_amd_flash_attn_bwd.py

Length of output: 758

Align panel_size between examples. get_fwd_configs() uses [7, 8, 9, 10], but get_configs() uses [7, 8]; update one to match the other.

🤖 Prompt for AI Agents

In examples/amd/example_amd_flash_attn_bwd.py around lines 34 to 65, panel_size values differ between get_fwd_configs() ([7, 8, 9, 10]) and get_configs() ([7, 8]); make them consistent by updating the get_configs() panel_size list to match get_fwd_configs() (use [7, 8, 9, 10]) so both functions produce the same panel_size options.

coderabbitai · 2025-10-10T07:14:52Z

src/target/codegen_hip.cc

+  } else if (type.code() == DataType::kFloat8_e4m3) {
+    stream << "fp8_e4" << vec << "_t";
+  } else if (type.code() == DataType::kFloat8_e4m3b11fnuz) {
+    stream << "fp8_e4" << vec << "_t";
  } else if (type.code() == DataType::kFloat8_e5m2) {
    stream << "fp8_e5" << vec << "_t";
+  } else if (type.code() == DataType::kFloat8_e5m2fnuz) {
+    stream << "fp8_e5" << vec << "_t";
+  } else if (type.code() == DataType::kFloat8_e8m0fnu) {
+    stream << "fp8_e8" << vec << "_t";
  } else {
-    LOG(FATAL) << "Unsupported FP8 type in HIP codegen";
+    LOG(FATAL) << "Unsupported FP8 type in HIP codegen: " << type;
  }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

FP8 E8 mapping unsupported by header; and FP8 E5 constants not handled

GetFP8Type maps kFloat8_e8m0fnu → fp8_e8*_t, but hip_fp8.h explicitly comments E8 types as unsupported. This will fail to compile when used.

E5 mapping is added here, but PrintConst lacks E5 handling, leading to LOG(FATAL) for FloatImm FP8 E5.

Apply:

- } else if (type.code() == DataType::kFloat8_e8m0fnu) { - stream << "fp8_e8" << vec << "_t"; - } else { + } else { LOG(FATAL) << "Unsupported FP8 type in HIP codegen: " << type;

And extend PrintConst (see below comment) to support E5 FP8 constants.

Check if any E8 paths are exercised; if so, either add E8 types to hip_fp8.h guarded by feature detection or drop E8 emission entirely.

🏁 Script executed:

#!/bin/bash set -e # Locate hip_fp8.h HIP_FILE=$(fd hip_fp8.h | head -n1) echo "=== hip_fp8.h ($HIP_FILE) context ===" rg -n -C2 'E8' "$HIP_FILE" rg -n -C2 'unsupported' "$HIP_FILE" # Inspect PrintConst in codegen_hip.cc for FP8 handling echo "=== PrintConst FP8 cases in codegen_hip.cc ===" rg -n -C3 'PrintConst' -m1 src/target/codegen_hip.cc | sed -n '1,20p' rg -n -C3 'kFloat8' src/target/codegen_hip.cc

Length of output: 408

🏁 Script executed:

#!/bin/bash set -e # Locate PrintConst in the repository echo "=== PrintConst definitions/usages ===" rg -n 'PrintConst' -C3 . # Show all FP8 kFloat8 cases with context in codegen_hip.cc echo "=== FP8 cases in codegen_hip.cc ===" rg -n -C10 'kFloat8' src/target/codegen_hip.cc

Length of output: 5602

🏁 Script executed:

#!/bin/bash set -e sed -n '1171,1271p' src/target/codegen_hip.cc

Length of output: 2899

Remove unsupported FP8 E8 case and add FP8 E5 constant support

Drop the kFloat8_e8m0fnu branch since hip_fp8.h (lines 11–13) comments out E8 aliases, causing build failures.

Extend PrintConst to handle E5 constants:
if (op->dtype.is_float8_e5m2() || op->dtype.is_float8_e5m2fnuz()) { os << "fp8_e5_t"; os << '(' << std::scientific << op->value << 'f' << ')'; return; }

- } else if (type.code() == DataType::kFloat8_e8m0fnu) { - stream << "fp8_e8" << vec << "_t"; - } else { + } else { LOG(FATAL) << "Unsupported FP8 type in HIP codegen: " << type;

🤖 Prompt for AI Agents

In src/target/codegen_hip.cc around lines 44 to 56, remove the branch that handles DataType::kFloat8_e8m0fnu (delete the else-if that streams "fp8_e8...") because HIP headers comment out E8 aliases and this causes build failures; additionally, extend the PrintConst handling for FP8 constants to detect e5 types (is_float8_e5m2 or is_float8_e5m2fnuz) and emit the constant as "fp8_e5_t" constructed from the literal value (i.e., print "fp8_e5_t(...f)"), ensuring the same formatting style as other float constant code paths so E5 constants are supported.

coderabbitai · 2025-10-10T07:14:52Z

src/target/codegen_hip.cc

+  } else if (op->dtype.is_float8_e4m3fnuz() || op->dtype.is_float8_e4m3() ||
+             op->dtype.is_float8_e4m3fn()) {
    os << "fp8_e4_t";
    os << '(' << std::scientific << op->value << 'f' << ')';


⚠️ Potential issue | 🟠 Major

Add FP8 E5 constant emission to avoid fatal path

Currently only E4 FP8 constants are emitted. Add E5 branch to PrintConst.

- } else if (op->dtype.is_float8_e4m3fnuz() || op->dtype.is_float8_e4m3() || - op->dtype.is_float8_e4m3fn()) { + } else if (op->dtype.is_float8_e4m3fnuz() || op->dtype.is_float8_e4m3() || + op->dtype.is_float8_e4m3fn()) { os << "fp8_e4_t"; os << '(' << std::scientific << op->value << 'f' << ')'; return; + } else if (op->dtype.is_float8_e5m2fnuz() + /* add is_float8_e5m2() if available in your DataType */) { + os << "fp8_e5_t"; + os << '(' << std::scientific << op->value << 'f' << ')'; + return; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

} else if (op->dtype.is_float8_e4m3fnuz() || op->dtype.is_float8_e4m3() ||

op->dtype.is_float8_e4m3fn()) {

os << "fp8_e4_t";

os << '(' << std::scientific << op->value << 'f' << ')';

} else if (op->dtype.is_float8_e4m3fnuz() || op->dtype.is_float8_e4m3() ||

op->dtype.is_float8_e4m3fn()) {

os << "fp8_e4_t";

os << '(' << std::scientific << op->value << 'f' << ')';

return;

} else if (op->dtype.is_float8_e5m2fnuz()

/* add is_float8_e5m2() if available in your DataType */) {

os << "fp8_e5_t";

os << '(' << std::scientific << op->value << 'f' << ')';

return;

}

🤖 Prompt for AI Agents

In src/target/codegen_hip.cc around lines 1178 to 1181, the PrintConst handling only emits FP8 E4 constants; add a parallel branch to handle FP8 E5 types. Add an else-if after the existing E4 check that tests the E5 dtype predicates (the analogous is_float8_e5... methods), and emit "fp8_e5_t" with the same formatting as E4 (e.g. os << "fp8_e5_t"; os << '(' << std::scientific << op->value << 'f' << ')';). Ensure the new branch mirrors the E4 branch placement and formatting so FP8 E5 constants are emitted instead of falling into the fatal path.

coderabbitai · 2025-10-10T07:14:52Z

src/tl_templates/hip/common.h

+// Overload for when the first argument is a value instead of a pointer
+template <typename T1, typename T2>
+TL_DEVICE void AtomicAdd(T1 address, T2 val) {
+  atomicAdd(reinterpret_cast<T1 *>(&address), static_cast<T1>(val));
+}
+
+template <typename T1, typename T2> TL_DEVICE T1 AtomicAddRet(T1 &ref, T2 val) {
+  return atomicAdd(&ref, static_cast<T1>(val));
+}


⚠️ Potential issue | 🔴 Critical

Invalid AtomicAdd overload takes a value; atomics must target real memory

AtomicAdd(T1 address, T2 val) atomically updates a local copy via &address, not the intended memory; this is incorrect and can cause undefined behavior on HIP.

Remove this overload and, if needed, provide a reference overload (non‑returning) alongside AtomicAddRet:

-// Overload for when the first argument is a value instead of a pointer -template <typename T1, typename T2> -TL_DEVICE void AtomicAdd(T1 address, T2 val) { - atomicAdd(reinterpret_cast<T1 *>(&address), static_cast<T1>(val)); -} +// Preferred non-returning reference overload (mirrors AtomicAddRet) +template <typename T1, typename T2> +TL_DEVICE void AtomicAdd(T1& ref, T2 val) { + atomicAdd(&ref, static_cast<T1>(val)); +}

Optionally constrain T1 to HIP‑supported atomic types via static_assert.

🤖 Prompt for AI Agents

In src/tl_templates/hip/common.h around lines 113 to 121, the overload AtomicAdd(T1 address, T2 val) takes the address by value and does atomicAdd on &address (a local copy) which is invalid; remove this value-taking overload, replace with a non-returning reference overload AtomicAdd(T1 &address, T2 val) that calls atomicAdd(&address, static_cast<T1>(val)), and (optionally) add a static_assert or type trait to constrain T1 to HIP-supported atomic types to catch misuse at compile time.

coderabbitai · 2025-10-10T07:14:52Z

src/tl_templates/hip/gemm.h

+  // Note: clear_accum=true is not fully supported in HIP implementation
+  // but we'll handle it by manually clearing the accumulator
+  // static_assert(!clear_accum, "clear_accum=true is not supported yet");


⚠️ Potential issue | 🔴 Critical

Do not drop support guard without implementing clear_accum behavior

Removing the static_assert enables clear_accum=true but no code zeros accumulators, yielding nondeterministic results.

Add zeroing when clear_accum is true. Example insertion at the start of body() and body_rs() before the ki loop:

if constexpr (clear_accum) { for (int i = 0; i < warp_rows; ++i) { for (int j = 0; j < warp_cols; ++j) { ((float32x4*)C_local)[i * warp_cols + j] = 0; } } }

Optionally re‑enable the static_assert if you cannot guarantee correct zeroing on all code paths.

🤖 Prompt for AI Agents

In src/tl_templates/hip/gemm.h around lines 73-75, the static_assert for clear_accum was removed but no accumulator zeroing was added, so enabling clear_accum=true yields nondeterministic results; add a constexpr guard at the start of both body() and body_rs() (before the ki loop) that zeroes the per-warp accumulator memory when clear_accum is true (iterate warp_rows and warp_cols and set the corresponding C_local entries to zero, e.g., by casting C_local to the appropriate vector type and writing zeros), and if you cannot guarantee zeroing on all code paths re-enable the static_assert to prevent enabling clear_accum without proper initialization.

LeiWang1999 · 2025-10-10T09:24:06Z

@Alex4210987 would be great if we can add some test case for fp8 dtype for hip backend.

…210987/main

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

examples/amd/example_amd_flash_attn_bwd.py (3)
353-354: Fix policy reference: use imported GemmWarpPolicy instead of T.GemmWarpPolicy

Line 353 uses T.GemmWarpPolicy.FullRow. Change this to GemmWarpPolicy.FullRow for consistency with the import and forward kernel.

Apply this diff:
-                T.gemm(p_cast, q_shared, dk, policy=T.GemmWarpPolicy.FullRow)
+                T.gemm(p_cast, q_shared, dk, policy=GemmWarpPolicy.FullRow)
328-329: Fix policy reference: use imported GemmWarpPolicy instead of T.GemmWarpPolicy

Line 328 uses T.GemmWarpPolicy.FullRow, but GemmWarpPolicy is imported at line 5. Unless tilelang.language re-exports GemmWarpPolicy, this will raise an AttributeError at JIT time. The forward kernel (lines 173, 205) correctly uses the imported GemmWarpPolicy.

Apply this diff:
-                T.gemm(K_shared, q_shared, qkT, transpose_B=True, policy=T.GemmWarpPolicy.FullRow)
+                T.gemm(K_shared, q_shared, qkT, transpose_B=True, policy=GemmWarpPolicy.FullRow)
343-347: Fix policy reference: use imported GemmWarpPolicy instead of T.GemmWarpPolicy

Lines 343 and 346 use T.GemmWarpPolicy.FullRow. This needs to be changed to GemmWarpPolicy.FullRow to match the import at line 5 and the forward kernel usage.

Apply this diff:
-                T.gemm(V_shared, do_shared, dP, transpose_B=True, policy=T.GemmWarpPolicy.FullRow)
+                T.gemm(V_shared, do_shared, dP, transpose_B=True, policy=GemmWarpPolicy.FullRow)
 
                 T.copy(P_acc, p_cast)
-                T.gemm(p_cast, do_shared, dv, policy=T.GemmWarpPolicy.FullRow)
+                T.gemm(p_cast, do_shared, dv, policy=GemmWarpPolicy.FullRow)

🧹 Nitpick comments (3)

examples/amd/example_amd_flash_attn_bwd.py (3)
249-273: LGTM! Preprocessing logic is correct.

The preprocessing correctly computes delta = sum(O * dO) across the dimension axis, which is needed for the backward pass. The implementation matches established patterns in similar attention kernels.

Optional: Consider renaming variable O to avoid ambiguity.

Static analysis tools flag single-letter uppercase variable names. While this is a common naming convention in attention implementations (matching mathematical notation), you could optionally rename O to output or O_tensor for improved clarity.

518-519: Fix unused variable: replace dq_mean_diff with underscore.

The dq_mean_diff variable is unpacked but never used. Replace it with _ to indicate it's intentionally unused.

Apply this diff:
-    dq_close, dq_max_diff, dq_mean_diff = debug_tensor_comparison(
+    dq_close, dq_max_diff, _ = debug_tensor_comparison(
         dQ_tl, q_ref.grad, "dQ", rtol=0.05, atol=0.05)
Also applies to: 525, 532

34-65: Align panel_size across examples
The forward-only example (example_amd_flash_attn_fwd.py) uses panel_size = [7, 8] while the forward-backward example (example_amd_flash_attn_bwd.py) uses [7, 8, 9, 10]. Update one to match the other or document why they differ.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b345ae and cd1564d.

📒 Files selected for processing (1)

examples/amd/example_amd_flash_attn_bwd.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/amd/example_amd_flash_attn_bwd.py (6)

examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (2)

main (141-204)

main (389-465)

examples/attention_sink/example_gqa_sink_bwd_bhsd.py (7)

main (450-519)

get_bwd_configs (11-19)

flashattn_bwd_preprocess (148-173)

flash_bwd_prep (154-171)

flashattn_bwd (213-321)

flashattn_bwd_postprocess (188-205)

flash_bwd_post (194-203)

tilelang/autotuner/tuner.py (1)

autotune (712-805)

tilelang/jit/__init__.py (1)

jit (244-317)

tilelang/language/__init__.py (1)

use_swizzle (99-105)

tilelang/profiler/__init__.py (1)

assert_allclose (76-145)

🪛 Ruff (0.14.0)

examples/amd/example_amd_flash_attn_bwd.py

257-257: Ambiguous variable name: O

(E741)

518-518: Unpacked variable dq_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

525-525: Unpacked variable dk_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

532-532: Unpacked variable dv_mean_diff is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: format-check

🔇 Additional comments (9)

examples/amd/example_amd_flash_attn_bwd.py (9)

5-10: LGTM! Import additions are appropriate.

The new imports (GemmWarpPolicy, itertools, partial, numpy, time) are all used appropriately throughout the file for GEMM policies, config generation, partial function application, numerical operations, and benchmarking.

13-31: LGTM! Reference implementation is correct.

The reference attention computation correctly implements:

Grouped attention via repeat_interleave

Proper scaling by 1/sqrt(dim)

Causal masking when enabled

Log-sum-exp computation for use in backward pass

68-223: LGTM! Forward kernel implementation is correct.

The forward pass kernel correctly implements FlashAttention-v2 with:

Proper online softmax computation with m_i and l_i tracking for numerical stability

Correct use of imported GemmWarpPolicy.FullRow for GEMM operations

Causal masking support

Grouped attention handling

Safe division by l_i with fallback to 1.0 when l_i <= 1e-6

226-246: LGTM! Backward config generation is comprehensive.

The backward pass autotune configurations provide good search space coverage with various block sizes, thread counts, pipeline stages, and panel sizes.

368-383: LGTM! Postprocessing logic is correct.

The postprocessing correctly converts dQ from accumulator dtype (float32) to output dtype (float16) via a simple copy operation.

386-425: LGTM! Comprehensive tensor comparison utility.

The debug function provides detailed diagnostics including:

Absolute and relative differences

NaN and Inf detection

Statistics (max, mean, std)

Position of maximum difference

This is valuable for debugging attention implementations.

428-444: LGTM! Benchmarking implementation is correct.

The benchmarking function correctly:

Uses warmup iterations to eliminate cold-start effects

Measures median latency (robust to outliers)

Includes CUDA synchronization for accurate GPU timing

447-589: LGTM! Main function provides comprehensive testing and benchmarking.

The main function correctly:

Performs autotuning for both forward and backward passes

Verifies correctness against reference implementation with appropriate tolerances

Benchmarks performance and computes speedup

Provides detailed diagnostics via debug_tensor_comparison

592-602: LGTM! CLI interface is well-designed.

The argument parser provides:

Clear help messages

Appropriate default values for testing (batch=1, heads=8, seq_len=1024, dim=64)

Support for causal attention and grouped attention

…oat8_e4m3fnuz` and various configurations

…test_tilelang_gemm_mfma_intrinsic.py`

commit b2acfc3 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Oct 19 22:08:13 2025 +0800 [Benchmark] Add matmul FP16 benchmark results (tile-ai#1067) commit 17bd0a6 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Sun Oct 19 17:34:12 2025 +0800 [Enhancement] Deprecate split&sum in attn bwd examples on Hopper and migrate to vectorized atomic add (tile-ai#1065) commit ae9a6f0 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Sun Oct 19 15:45:58 2025 +0800 [Refactor][Example] Update linear attention examples and add tests (tile-ai#1010) * [Refactor][Example] Update linear attention examples and add tests - Refactored the backward and forward linear attention kernels to use shared memory and atomic additions for improved performance. - Introduced L2 normalization in the main functions of both examples. - Added a new test suite for the linear attention examples to ensure correctness and performance. - Updated argument parsing in the main functions for better usability. * upd docstring for tma atomic add * lint * Add flash-linear-attention dependency to requirements.txt * Rename main function to chunk_linear_attn_bwd * Rename main function to chunk_linear_attn_fwd * chore --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> commit b7dfdb3 Author: Xuehai Pan <XuehaiPan@pku.edu.cn> Date: Sun Oct 19 12:16:41 2025 +0800 [Misc] Add GitHub issue templates (tile-ai#1057) commit fb8b3af Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Oct 19 12:15:44 2025 +0800 [Benchmark] Add H800 SXM Benchmark results (tile-ai#1063) * Add document PYTHONPATH build path * update fp8 benchmark result * remove redpath * remove path * tflops fix commit 4ca6c13 Author: Yuqi Dong <134183314+yyttt6@users.noreply.github.com> Date: Sun Oct 19 02:43:00 2025 +0800 [CI]:Reduce test shapes to avoid OOM errors during CI. (tile-ai#1060) * [CI]:Reduce test shapes to avoid OOM errors during CI. * rabbit * Increase number of processes for pytest from 2 to 4 --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> commit 759c2e3 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Oct 19 00:35:06 2025 +0800 [DOC] Add document for develop with PYTHONPATH (tile-ai#1062) commit bf2de5b Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Sun Oct 19 00:21:59 2025 +0800 Making version parser more robust against missing or unavailable metadata (tile-ai#1061) commit 7211164 Author: Chaofan Lin <linchaofan@bytedance.com> Date: Fri Oct 17 20:56:01 2025 +0800 [Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (tile-ai#1050) * [Refactor] Refactor Pass to support recursive load/store rewrite * lint * recursive collect conds for call_extern * fix name * [Lint]: [pre-commit.ci] auto fixes [...] * lint * [Lint]: [pre-commit.ci] auto fixes [...] * lint * [Lint]: [pre-commit.ci] auto fixes [...] * address comment * rename pad_value to safe_value * lint * add oob store test * [Lint]: [pre-commit.ci] auto fixes [...] * fix * fix --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 278c0fb Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Oct 17 18:32:43 2025 +0800 [Enhancement] Introduce a workaround for layout inference for local buffer store (tile-ai#1055) * [Enhancement] Improve layout inference for local buffer handling in parallel operations * Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions. * Updated the condition for determining parallel loop execution to account for local buffer stores. * Cleaned up comments for clarity and future considerations. * [Refactor] Clean up parallel loop condition formatting in layout inference * Reformatted the condition for determining parallel loop execution for better readability. * Maintained existing logic while enhancing code clarity for future modifications. --------- Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk> commit 37b3dbd Author: LJC00118 <77378439+LJC00118@users.noreply.github.com> Date: Fri Oct 17 17:15:59 2025 +0800 [Enhancement] Improve CUDA compiler detection in CMake (tile-ai#1054) * improve CUDA compiler detection in CMake * Minor fix commit 1281d6f Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Oct 17 13:44:08 2025 +0800 [CI] Disable autofix for pre-commit CI (tile-ai#1053) commit 35cf888 Author: LJC00118 <77378439+LJC00118@users.noreply.github.com> Date: Fri Oct 17 13:43:08 2025 +0800 [Enhancement] Remove constraint requiring last dimension stride to be 1 (tile-ai#1040) * remove last dimension stride must be 1 constraint * add vectorize test * minor fix * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit fd1493b Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Fri Oct 17 11:34:35 2025 +0800 Automatically initialize submodule if missing (tile-ai#1052) commit cc00fb6 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Fri Oct 17 11:28:14 2025 +0800 [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper (tile-ai#1024) * [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper * [BugFix] Fix shape mismatch and deprecate `T.if()` in fused_moe example * [Fix] Add `is_symbolic_expr` function to check for symbolic expressions in TIR - Introduced a new utility function `is_symbolic_expr` to determine if an expression is a symbolic expression, enhancing type checking capabilities. - Updated shape handling in `CythonKernelAdapter` to utilize the new function, improving handling for symbolic shapes. commit a79bc5c Author: Xuehai Pan <XuehaiPan@pku.edu.cn> Date: Thu Oct 16 20:38:23 2025 +0800 [CI] Fix ROCm CI (tile-ai#1043) * [CI] fix ROCm CI * feat: add a hook to error out on no test runs commit 1f4ffdb Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Thu Oct 16 17:53:45 2025 +0800 [Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. (tile-ai#1051) commit e3742d3 Author: Yichen Yan <wenji.yyc@alibaba-inc.com> Date: Thu Oct 16 15:52:10 2025 +0800 Allow mma gemm for all cuda (tile-ai#1047) commit 0ff4f42 Author: Yuqi Dong <134183314+yyttt6@users.noreply.github.com> Date: Thu Oct 16 12:41:09 2025 +0800 [Feature]: Add test for atomicadd auto vectorize and remove useless code (tile-ai#1019) * update * format * rabbit commit bd1c7b3 Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Date: Thu Oct 16 02:52:35 2025 +0800 [Refactor] Use `has_simt_copy` to decide whether to insert `set_max_nreg` (tile-ai#982) commit 8f001e0 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Thu Oct 16 01:10:28 2025 +0800 [BugFix] Phaseout dependency of Triton in sink examples to make CI happy (tile-ai#1045) * [BugFix] Phaseout dependency of Triton in sink examples to make CI happy - Added `benchmark_gqa_sink_fwd.py` and `benchmark_mha_sink_fwd.py` to evaluate performance of GQA and MHA attention mechanisms using Triton. - Refactored existing attention sink implementations to remove Triton kernel definitions from the reference programs, streamlining the code. - Updated input generation and benchmarking logic to enhance configurability and performance measurement. - Improved overall structure and organization of the examples for better clarity and usability. * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 8ce2778 Author: Xuehai Pan <XuehaiPan@pku.edu.cn> Date: Wed Oct 15 22:12:41 2025 +0800 [CI][Refactor] Merge test CI workflow files into one (tile-ai#973) * refactor: merge test CI workflow files into one * chore: set `UV_INDEX_STRATEGY=unsafe-best-match` * feat: add AST test with Python 3.8 * feat: implement manual caching mechanism for self-hosted runners * refactor: simplify cache logic for self-hosted runners * chore: clear uv cache on failure * chore: print format.sh output to logs * chore: improve uv caching * chore: disable parallel test * chore: use `PYTHONDEVMODE=1` in CI * feat: enable coredump generation * fix: fix perfbench condition * Revert "feat: enable coredump generation" This reverts commit c52da65. * chore: move example CI down * Revert "chore: move example CI down" This reverts commit 9d8e650. * chore: skip example `test_example_mha_sink_bwd_bhsd` * chore: skip example `test_example_gqa_sink_bwd_bhsd` * fix: fix example argument passing * fix: loosen test criteria * chore: rename `CMAKE_CONFIGURE_OPTIONS` -> `CLANG_TIDY_CMAKE_OPTIONS` for clarity * feat: enable parallel testings * chore: update pytest options * remove skipped test as now been resolved * chore: empty commit to re-trigger ci * test for n 1 * chore: remove ` --numprocesses=1` option in example * chore: disable failfast * chore: update cibw selection * fix: fix git submodule clone * chore: update cibw commands * fix: fix yapf multiprocessing * chore: setup ccache for CIBW on macOS only * chore: update comments * chore: update artifact listing * fix: do not fail if not found nvcc in PATH * fix: fix flash-attn installation * chore: update dist workflow trigger * chore: remove outdated comments * chore(workflows/dist): simplify build matrix strategy * fix: fix CUDA path finding * fix: fix CUDA path finding * chore: imcrease CI timeout * ci: disable failfast * fix: hide path prefix * chore: more verbose * chore: disable PR trigger for dist workflow * fix: seed for tests * fix: use nightly torch for ROCm tests * chore: enable PR trigger for dist workflow * chore: stop uploading debug wheels as artifacts in PR * chore: do not run workflows in forks * chore: housekeep requirements * chore: use Nightly-ROCm-6.3 for CI * chore: use Nightly-ROCm-6.4 for CI * Update ROCm toolkit version to 7.0 * chore: restore previous rocm-ci.yml for test * fix: cleanup PYTHONPATH * chore: remove previous rocm-ci.yml * ci fix * chore: remove previous rocm-ci.yml * chore: enable parallel example run --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: alex_xiao <xinyuxiao2024@gmail.com> commit 80665cd Author: alex_xiao <xinyuxiao2024@gmail.com> Date: Wed Oct 15 21:17:14 2025 +0800 fix bug&add amd examples (tile-ai#966) * [Enhancement] Refactor buffer index handling for improved precision and clarity (tile-ai#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. * Add new AMD FlashAttention example and test script - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability. * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner - Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings. * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217 * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c * Add example for AMD Flash Attention backward pass implementation - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications. * Enhance AMD Flash Attention example with additional testing capabilities - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications. * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * Refactor HIP intrinsic rules to CUDA - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure. * Update AMD CI workflow to uninstall specific PyTorch packages before installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts. * Remove unused shared memory allocations in AMD Flash Attention backward example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead. * Remove unnecessary pip uninstall command from AMD CI workflow - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management. * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues. * Refactor formatting of HIP intrinsic rule registrations - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability. * Update file name and documentation for HIP intrinsic rules - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules. - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA. * Enhance DispatchHIPShuffle function with clang-analyzer comments - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage. - This change improves code clarity and maintains compliance with static analysis tools. * lint fix * fix * Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency. * Add backward attention example to test script - Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py. - Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example. - Updated the command for the backward tile example to match the new configuration. * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py - Introduced new functions for forward and backward configurations to enhance autotuning capabilities. - Updated the FlashAttention forward and backward functions to improve performance and maintainability. - Adjusted test script parameters for consistency and clarity, including the addition of group handling. - Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning. - Updated the main function to reflect changes in parameter names and types for better usability. * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py - Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations. - Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning. - Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency. - Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass. - Enhanced correctness checks in the main function to include LSE validation alongside gradient checks. * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py - Introduced a scaling factor for improved numerical stability in gradient calculations. - Optimized shared memory usage by adding new shared buffers for intermediate calculations. - Refined the handling of tensor fragments to improve performance and maintainability. - Updated the main function to ensure compatibility with the new output parameters for backward operations. - Removed unnecessary parameters from the test script to streamline execution. * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py - Updated the forward and backward functions to improve numerical stability and performance. - Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters. - Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters. - Added debugging and benchmarking functions for improved correctness verification and performance analysis. - Updated the main function to reflect changes in parameter handling and ensure consistency across examples. * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py - Updated scaling factor application for improved numerical stability in gradient calculations. - Refined tensor handling to ensure consistency with forward pass operations. - Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision. - Adjusted comments for clarity and alignment with standard implementation practices. * Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh - Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning. - Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example. - Improved comments for clarity and alignment with the updated configurations. * Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py - Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost. - Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations. - Improved comments for better understanding of the performance metrics and implementation details. - Removed unnecessary parameter from test.sh to streamline execution. * Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing. * Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py - Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations. - Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning. - Refined scaling factor calculations for numerical stability in both forward and backward passes. - Improved comments and documentation for clarity and consistency across implementations. - Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements. * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py - Removed outdated comments and improved clarity in the code. - Enhanced the forward function to consistently return output and log-sum-exp (LSE) values. - Updated autotuner configurations to include new parameters for better performance tuning. - Refined tensor handling and scaling factor calculations for improved numerical stability. - Adjusted the main function to ensure compatibility with updated output requirements and parameter handling. * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py - Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization. - Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns. - Modified tensor copy operations to utilize coalesced widths for optimized memory loads. - Enhanced GEMM operations with k_pack for improved computational efficiency. - Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios. * Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py - Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning. - Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation. - Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns. - Adjusted GEMM operations to improve computational efficiency without the use of k_pack. * Enhance HIP code generation and FP8 type support - Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility. - Updated error logging to include unsupported FP8 type details for better debugging. - Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method. - Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality. - Added overloads for AtomicAdd in common.h to support both pointer and value arguments. * Enhance FP8 type support and clarify accumulator handling in HIP - Expanded FP8 type support in codegen_hip.cc to include additional float8 formats. - Updated gemm.h to clarify the handling of the accumulator when clear_accum is true. - Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version. * Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py * Update print statement formatting for clarity in example_amd_flash_attn_bwd.py * Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output. * Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements. * Refactor and enhance HIP code generation for improved FP8 support - Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability. - Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types. - Updated AtomicAdd function in common.h to streamline its implementation. - Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively. - Added function to manage the addition of new functions in the code generation process. * Fix formatting issue in HIP code generation for MFMA call - Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency. * Refactor HIP code generation and enhance FP8 type handling - Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability. - Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types. - Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading. - Refined the AddFunction method to streamline function addition in the code generation process. * Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness. * Refactor backward attention implementation in example_amd_flash_attn_bwd.py - Updated the GEMM operation to use shared memory for improved performance. - Adjusted parallelization parameters to enhance efficiency in the backward pass. * Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness. * Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations * Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py` --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> commit b78d840 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Oct 15 16:38:55 2025 +0800 [Language] Expose `T.get_warp_idx_sync` and `T.shuffle_elect` for efficient thread election (tile-ai#989) * Expose CUDA warp/lane intrinsics in TileLang frontend * generalize warp indexing intrinsics and add coverage * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 32ddc1a Author: LJC00118 <77378439+LJC00118@users.noreply.github.com> Date: Wed Oct 15 15:25:43 2025 +0800 [CUDA] Add pack functions for FP8 types (tile-ai#967) * Remove an incorrect check * add fp8 pack function * code lint * minor fix * minor fix * minor fix * Minor fix * Minor fix commit c67f73b Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Oct 15 15:12:08 2025 +0800 [Env] Optimize the mechanism for locating `TL_LIBS` (tile-ai#1038) commit e539952 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Wed Oct 15 15:11:40 2025 +0800 [TIR] Revert some changes of Pass `LowerIntrin` (tile-ai#1035) * keep >> instead of / * re think replicate * lint fix * handle const int buffers * rep fix --------- Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk> commit 5767475 Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Tue Oct 14 23:55:27 2025 +0800 [CI] Disable buggy(maybe) warp specialized kernel ci test for H20 (tile-ai#1033) commit eed320f Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Date: Tue Oct 14 21:51:31 2025 +0800 [Bugfix] Recover code for flexible parallel (tile-ai#1032) * recover flex parallel process * lint fix --------- Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk> commit 1e8f0b1 Author: Tong WU <109033598+Rachmanino@users.noreply.github.com> Date: Tue Oct 14 17:26:23 2025 +0800 [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation (tile-ai#1023) * [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation * [Lint]: [pre-commit.ci] auto fixes [...] * optimize amd ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

xinxyxiao and others added 30 commits July 29, 2025 03:26

Remove obsolete test script for AMD example, streamlining the example…

499daa3

…s directory.

Remove unused dtype_size variable in AMD example script to streamline…

555537a

… code.

Update example_amd_flash_attn_fwd.py

bc2663a

Merge branch 'main' into main

4fd8529

Remove redundant tool cache cleanup step in AMD CI workflow

cf99bef

Remove torch dependency from requirements-rocm.txt to streamline …

e839192

…requirements.

Update configurations in example_amd_flash_attn_fwd.py for autotuner

2bf7961

- Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings.

Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9ed…

f7f6131

…b217

Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41…

91e9548

…e86c

Merge branch 'tile-ai:main' into main

460c64f

Merge branch 'tile-ai:main' into main

8eefca0

Merge branch 'amd_dev'

4cf8c30

Merge branch 'main' of https://github.com/Alex4210987/tilelang

bc22219

Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

05305f2

Refactor HIP intrinsic rules to CUDA

923fc6d

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure.

Refactor formatting of HIP intrinsic rule registrations

d62b898

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability.

Update file name and documentation for HIP intrinsic rules

e7b0f30

- Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules. - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

xinyxiao and others added 9 commits October 10, 2025 05:35

Update print statement formatting for clarity in example_amd_flash_at…

9d53c8a

…tn_bwd.py

Remove redundant verification results summary print statement in exam…

cd3b6b5

…ple_amd_flash_attn_bwd.py for cleaner output.

Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and e…

3072de6

…xample_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.

Fix formatting issue in HIP code generation for MFMA call

acaf988

- Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.

Merge branch 'main' into main

ae39e35

Remove unnecessary blank line in example_amd_flash_attn_bwd.py for im…

0c0fa53

…proved code cleanliness.

Merge branch 'main' of https://github.com/Alex4210987/tilelang

9a4a08f

Refactor backward attention implementation in example_amd_flash_attn_…

8b345ae

…bwd.py - Updated the GEMM operation to use shared memory for improved performance. - Adjusted parallelization parameters to enhance efficiency in the backward pass.

Fix formatting by removing an unnecessary blank line in example_amd_f…

c34315c

…lash_attn_bwd.py for improved code cleanliness.

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

LeiWang1999 previously approved these changes Oct 10, 2025

View reviewed changes

Merge branch 'main' of https://github.com/tile-ai/tilelang into Alex4…

cd1564d

…210987/main

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

Alex4210987 and others added 3 commits October 15, 2025 20:14

Merge branch 'tile-ai:main' into main

426df21

Add additional test cases for assert_tl_matmul_correctness with `fl…

40eabcd

…oat8_e4m3fnuz` and various configurations

Refactor test case formatting for assert_tl_matmul_correctness in `…

5b6bcaa

…test_tilelang_gemm_mfma_intrinsic.py`

Alex4210987 dismissed LeiWang1999’s stale review via 5b6bcaa October 15, 2025 12:20

LeiWang1999 approved these changes Oct 15, 2025

View reviewed changes

LeiWang1999 changed the title ~~fix bug&add amd examples~~ [AMD] fix bug&add amd fp8 examples Oct 15, 2025

LeiWang1999 merged commit 80665cd into tile-ai:main Oct 15, 2025
8 checks passed

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Open

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

		T.gemm(K_shared, q_shared, qkT, transpose_B=True, policy=T.GemmWarpPolicy.FullRow)

Uh oh!

[AMD] fix bug&add amd fp8 examples #966

[AMD] fix bug&add amd fp8 examples #966

Uh oh!

Conversation

Alex4210987 commented Oct 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

coderabbitai bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Oct 10, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alex4210987 commented Oct 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 10, 2025 •

edited

Loading