Skip to content

Conversation

@Rachmanino
Copy link
Collaborator

@Rachmanino Rachmanino commented Oct 13, 2025

Fix #975.

This pull request refactors and improves the example implementations of fused chunked linear attention in both the forward (example_linear_attn_fwd.py) and backward (example_linear_attn_bwd.py) passes. The changes focus on making the TileLang kernels more idiomatic, improving correctness and performance, and enhancing testability and benchmarking. Additionally, a new test file is introduced for automated validation.

Refactoring and Kernel Improvements:

  • Rewrote the TileLang kernels for both forward and backward passes to use more idiomatic constructs (atomic_add for accumulation, shared fragments, and updated layouts), improving numerical correctness and performance. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

  • Added normalization (l2norm_fwd) for input queries and keys, which is necessary for correct linear attention behavior. [1] [2] [3] [4]

API and Usability Enhancements:

  • Simplified the main entry points for both forward and backward examples, removing legacy postprocessing and argument parsing logic, and making them callable from tests as well as the command line. [1] [2]

  • Replaced legacy reference implementations with new, explicit reference programs for both forward and backward passes, improving clarity and correctness of comparison. [1] [2]

Testing and Benchmarking:

  • Added a new test file test_linear_attn.py that automatically validates both forward and backward implementations using CUDA, improving reliability and enabling CI integration.

  • Improved benchmarking setup to use the cupti backend for more accurate GPU timing and clarified speedup reporting. [1] [2]

Documentation:

  • Updated docstrings to document new kernel arguments and options, such as use_tma for atomic adds in TileLang.

References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Summary by CodeRabbit

  • New Features

    • Fused linear-attention forward/backward with input normalization, reference programs, CLI-friendly main, and built-in numeric assertions for verification.
    • Public wrappers exposing simplified forward/backward and benchmarking entry points.
  • Refactor

    • Kernels reworked to use shared accumulation buffers with atomic accumulation; pipeline and API entry names streamlined; verification routed through reference paths.
  • Tests

    • Added CUDA-gated tests exercising forward and backward example flows.
  • Chores

    • Added runtime dependency: flash-linear-attention==0.3.2.
  • Style

    • Extended atomic API to support an optional TMA-enabled accumulation path.

- Refactored the backward and forward linear attention kernels to use shared memory and atomic additions for improved performance.
- Introduced L2 normalization in the main functions of both examples.
- Added a new test suite for the linear attention examples to ensure correctness and performance.
- Updated argument parsing in the main functions for better usability.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

Walkthrough

Refactors forward/backward linear-attention examples to fused tl_fused_chunk_* kernels and wrappers, adds reference implementations and CUDA tests, switches per-block accumulation to shared buffers with atomic_add (optional TMA), normalizes Q/K for backward, updates CLI/benchmarking, and extends atomic_add API with a use_tma flag.

Changes

Cohort / File(s) Summary
Linear attention — forward example
examples/linear_attention/example_linear_attn_fwd.py
Renamed kernel to tl_fused_chunk_fwd_kernel; added wrapper tl_fused_chunk_fwd and ref_program; changed accumulation to use o_shared + atomic_add; adjusted layout/pipelining and pass_config usage; added l2norm_fwd use, new main(B,S,H,D) and CLI parsing; updated verification and benchmarking to use Cupti.
Linear attention — backward example
examples/linear_attention/example_linear_attn_bwd.py
Replaced chunk_linear_attn_bwd_kernel with tl_fused_chunk_bwd_kernel; added tl_fused_chunk_bwd and ref_program; replaced per-tile gradient storage with shared buffers (dq_shared,dk_shared,dv_shared) and atomic_add into global dQ/dK/dV; adjusted NT/NK/NV calculations, accum_dtype handling, added l2norm_fwd normalization, assertion-based tests, and CLI main.
Examples tests
examples/linear_attention/test_linear_attn.py
New test module adding CUDA-guarded tests test_example_linear_attn_fwd() and test_example_linear_attn_bwd() that call the examples' main(); includes __main__ runner using tilelang.testing.
Atomic API
tilelang/language/atomic.py
Added use_tma: bool=False parameter to atomic_add and propagated it into the tl.atomicadd intrinsic to enable a TMA (cp.reduce) path when requested (sm90+).
Dependencies
requirements.txt
Added runtime dependency flash-linear-attention==0.3.2.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Test as test_linear_attn.py
  participant MainF as example_linear_attn_fwd.main
  participant NormF as l2norm_fwd
  participant Fwd as tl_fused_chunk_fwd(q,k,v)
  participant KernF as tl_fused_chunk_fwd_kernel
  participant O as outputs (o,h)
  participant RefF as ref_program(q,k,v)

  User->>Test: run test_example_linear_attn_fwd()
  Test->>MainF: call main()
  MainF->>NormF: normalize q,k
  MainF->>Fwd: tl_fused_chunk_fwd(q,k,v)
  Fwd->>KernF: launch fused kernel (pass_config)
  KernF->>O: atomic_add partials -> O (o_shared)
  KernF-->>Fwd: return O
  Test->>RefF: compute o_ref,h_ref
  Test->>Test: assert_close(O, o_ref), assert_close(h, h_ref)
Loading
sequenceDiagram
  autonumber
  actor User
  participant Test as test_linear_attn.py
  participant MainB as example_linear_attn_bwd.main
  participant NormB as l2norm_fwd
  participant Bwd as tl_fused_chunk_bwd(Q,K,V,dO)
  participant KernB as tl_fused_chunk_bwd_kernel
  participant Grad as gradients (dQ,dK,dV)
  participant RefB as ref_program(q,k,v,scale)

  User->>Test: run test_example_linear_attn_bwd()
  Test->>MainB: call main()
  MainB->>NormB: normalize Q,K
  MainB->>Bwd: tl_fused_chunk_bwd(Q,K,V,dO)
  Bwd->>KernB: launch fused backward kernel
  KernB->>Grad: atomic_add partials -> dQ,dK,dV (shared buffers)
  KernB-->>Bwd: return dQ,dK,dV
  Test->>RefB: compute reference grads
  Test->>Test: assert_close(dQ, ref_dq), assert_close(dK, ref_dk), assert_close(dV, ref_dv)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

Poem

In burrows of code I hop and chew,
Fused kernels weave gradients through.
Atoms add softly, TMA may sing,
Forward, backward — carrots bring! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "[Refactor][Example] Update linear attention examples and add tests" is well-aligned with the main changes in the changeset. The title clearly captures the primary objective: refactoring the linear attention example implementations (example_linear_attn_fwd.py and example_linear_attn_bwd.py) and adding new test coverage (test_linear_attn.py). Supporting changes include an atomic_add API extension (atomic.py) to enable TMA-based operations and a requirements.txt update to add the FLA dependency. The title is specific and descriptive enough that a reviewer scanning the history would understand the focus is on improving and testing the example implementations.
Linked Issues Check ✅ Passed The pull request addresses the primary objective from issue #975 to resolve inconsistencies between TileLang's LinearAttention outputs and the FLA baseline. The changeset implements several key improvements: rewritten TileLang kernels using atomic_add and shared buffers for better numerical correctness, input normalization via l2norm_fwd, and new test infrastructure (test_linear_attn.py) with CUDA guards for reproducible validation. These changes directly align with the issue's requirement to fix correctness gaps so outputs and gradients match the FLA baseline. The PR comments indicate the refactoring has reduced error metrics for forward and backward passes, suggesting the objective has been achieved.
Out of Scope Changes Check ✅ Passed All changes appear to be reasonably scoped to the stated objectives. The example file refactoring (example_linear_attn_fwd.py, example_linear_attn_bwd.py) directly addresses issue #975's requirement to fix output inconsistencies. The new test file (test_linear_attn.py) provides the reproducible test infrastructure requested in the objectives. The atomic.py API enhancement (adding use_tma parameter for TMA-based atomic operations) is a minimal infrastructure change that enables the refactored kernels to perform atomic accumulation more efficiently, supporting the correctness improvements. The requirements.txt addition of flash-linear-attention is necessary for the examples to function with the FLA reference implementation. Each change supports the core objective of fixing and validating LinearAttention correctness.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
examples/linear_attention/example_linear_attn_fwd.py (3)

44-44: Fix E741 ambiguous name: rename O to a clearer identifier

Rename O to avoid the ambiguous capital “O” and improve readability.

-            O: T.Tensor([B, S, H, DV], accum_dtype),  # type: ignore
+            out_buf: T.Tensor([B, S, H, DV], accum_dtype),  # type: ignore
@@
-                T.atomic_add(
-                    O[i_b, i * chunk_size:(i + 1) * chunk_size, i_h, i_v * BV:(i_v + 1) * BV],
-                    o_shared)
+                T.atomic_add(
+                    out_buf[i_b, i * chunk_size:(i + 1) * chunk_size, i_h, i_v * BV:(i_v + 1) * BV],
+                    o_shared)

Note: out_idx=[4] remains correct (index refers to final_state).

Also applies to: 78-81


25-26: Type hint nit: scale should be Optional[float]

The default None conflicts with annotation float. Use Optional[float].

-    dtype: str = 'float16',
-    scale: float = None,
+    dtype: str = 'float16',
+    scale: Optional[float] = None,

77-82: Optionally plumb use_tma to atomic_add for SM90

Expose a use_tma: bool = False flag in the kernel and pass to T.atomic_add(..., use_tma=use_tma) when TMA lowering isn’t disabled. This allows easy toggling on Hopper without code edits.

examples/linear_attention/example_linear_attn_bwd.py (2)

22-24: Type hint nit: scale should be Optional[float]

Match the default None.

-    dtype: str = 'float16',
-    scale: float = None,
+    dtype: str = 'float16',
+    scale: Optional[float] = None,

95-98: Optional: Plumb use_tma to gradient atomics on Hopper

Expose use_tma: bool = False in the kernel and forward to both T.atomic_add calls for dQ, dK, dV when TMA lowering is enabled.

Also applies to: 134-141

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 340bfc5 and d26799e.

📒 Files selected for processing (4)
  • examples/linear_attention/example_linear_attn_bwd.py (6 hunks)
  • examples/linear_attention/example_linear_attn_fwd.py (4 hunks)
  • examples/linear_attention/test_linear_attn.py (1 hunks)
  • tilelang/language/atomic.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/linear_attention/test_linear_attn.py (2)
examples/linear_attention/example_linear_attn_bwd.py (2)
  • main (38-142)
  • main (180-210)
examples/linear_attention/example_linear_attn_fwd.py (2)
  • main (40-84)
  • main (120-144)
examples/linear_attention/example_linear_attn_fwd.py (3)
tilelang/jit/__init__.py (1)
  • jit (244-317)
tilelang/jit/kernel.py (1)
  • out_idx (462-463)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-104)
examples/linear_attention/example_linear_attn_bwd.py (3)
tilelang/jit/__init__.py (1)
  • jit (244-317)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-104)
examples/linear_attention/example_linear_attn_fwd.py (1)
  • ref_program (97-117)
🪛 Ruff (0.13.3)
examples/linear_attention/example_linear_attn_fwd.py

44-44: Ambiguous variable name: O

(E741)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: format-check

Comment on lines +134 to +141
T.copy(dk, dk_shared)
T.atomic_add(
dK[i_b, start * chunk_size:(start + 1) * chunk_size, i_h,
i_k * BK:(i_k + 1) * BK], dk_shared)
T.copy(dv, dv_shared)
T.atomic_add(
dV[i_b, start * chunk_size:(start + 1) * chunk_size, i_h,
i_v * BV:(i_v + 1) * BV], dv_shared)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Gate benchmarks and keep test path fast; mirror forward changes

Prevent heavy benchmarking during tests and allow enabling via CLI.

-                T.atomic_add(
-                    dK[i_b, start * chunk_size:(start + 1) * chunk_size, i_h,
-                       i_k * BK:(i_k + 1) * BK], dk_shared)
+                T.atomic_add(
+                    dK[i_b, start * chunk_size:(start + 1) * chunk_size, i_h,
+                       i_k * BK:(i_k + 1) * BK], dk_shared)
@@
-def main(B=1, S=1024, H=16, D=128):
+def main(B=1, S=1024, H=16, D=128, run_bench: bool = False):
@@
-    print('Passed all tests!✅')
-
-    # Benchmark
-    q.grad = k.grad = v.grad = None
-    o_ref, _ = fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False)
-    t1 = do_bench(
-        lambda: o_ref.backward(do, retain_graph=True), warmup=25, rep=100, backend='cupti')
-    t2 = do_bench(lambda: tl_fused_chunk_bwd(q, k, v, do), warmup=25, rep=100, backend='cupti')
-    print(f'Triton latency: {t1:.3f} ms')
-    print(f'TileLang latency: {t2:.3f} ms')
-    print(f'Speedup: {t1/t2:.3f}x')
+    print('Passed all tests!✅')
+    if run_bench:
+        # Benchmark
+        q.grad = k.grad = v.grad = None
+        o_ref, _ = fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False)
+        t1 = do_bench(
+            lambda: o_ref.backward(do, retain_graph=True), warmup=25, rep=100, backend='cupti')
+        t2 = do_bench(lambda: tl_fused_chunk_bwd(q, k, v, do), warmup=25, rep=100, backend='cupti')
+        print(f'Triton latency: {t1:.3f} ms')
+        print(f'TileLang latency: {t2:.3f} ms')
+        print(f'Speedup: {t1/t2:.3f}x')
@@
-    main(args.B, args.S, args.H, args.D)
+    main(args.B, args.S, args.H, args.D, run_bench=True)

Also applies to: 180-209, 213-221

🤖 Prompt for AI Agents
In examples/linear_attention/example_linear_attn_bwd.py around lines 134-141
(and similarly lines 180-209, 213-221), the test path currently runs heavy gate
benchmarks; update the code to skip or short-circuit expensive benchmarking
during test runs and expose a CLI flag to enable full benchmarks. Modify the
logic to check a new command-line/single-run flag (e.g., --run-bench) or an
environment variable before executing benchmark code paths so tests use the
lightweight forward-mirror path by default; ensure the forward changes that
mirror behavior are applied consistently in the referenced blocks and that the
default test execution remains fast while full benchmarking is opt-in via the
CLI flag.

Comment on lines 120 to 135
def main(B=1, S=512, H=16, D=128):
q = torch.randn((B, S, H, D), device='cuda', dtype=torch.float16)
k = torch.randn((B, S, H, D), device='cuda', dtype=torch.float16)
v = torch.randn((B, S, H, D), device='cuda', dtype=torch.float16)

kernel = chunk_linear_attn_fwd_kernel(B, S, H, D, D)
o, h = postprocess(*kernel(q, k, v))
o_ref, h_ref = fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False)
# qk norm is necessary for linear attn
q, _ = l2norm_fwd(q)
k, _ = l2norm_fwd(k)

o, h = tl_fused_chunk_fwd(q, k, v)
o_ref, h_ref = ref_program(q, k, v)

if torch.allclose(o, o_ref) and torch.allclose(h, h_ref):
print('Passed all tests!✅')
else:
print('Failed some tests!❌')
assert torch.allclose(o, o_ref, atol=1e-2, rtol=1e-2), f'o max err: {(o - o_ref).abs().max()}'
assert torch.allclose(h, h_ref, atol=1e-2, rtol=1e-2), f'h max err: {(h - h_ref).abs().max()}'
print('Passed all tests!✅')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Do not run benchmarks in tests; gate with a flag and keep tests fast

test_linear_attn.py calls main() which currently benchmarks (100 reps, cupti). This will slow CI or fail if CUPTI is unavailable. Gate benchmarking and allow smaller defaults.

-def main(B=1, S=512, H=16, D=128):
+def main(B=1, S=512, H=16, D=128, run_bench: bool = False):
@@
-    assert torch.allclose(h, h_ref, atol=1e-2, rtol=1e-2), f'h max err: {(h - h_ref).abs().max()}'
-    print('Passed all tests!✅')
-
-    t1 = do_bench(
-        lambda: fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False),
-        warmup=25,
-        rep=100,
-        backend='cupti')
-    t2 = do_bench(lambda: tl_fused_chunk_fwd(q, k, v), warmup=25, rep=100, backend='cupti')
-    print(f'Triton latency: {t1:.3f} ms')
-    print(f'TileLang latency: {t2:.3f} ms')
-    print(f'Speedup: {t1/t2:.3f}x')
+    assert torch.allclose(h, h_ref, atol=1e-2, rtol=1e-2), f'h max err: {(h - h_ref).abs().max()}'
+    print('Passed all tests!✅')
+    if run_bench:
+        t1 = do_bench(
+            lambda: fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False),
+            warmup=25, rep=100, backend='cupti')
+        t2 = do_bench(lambda: tl_fused_chunk_fwd(q, k, v), warmup=25, rep=100, backend='cupti')
+        print(f'Triton latency: {t1:.3f} ms')
+        print(f'TileLang latency: {t2:.3f} ms')
+        print(f'Speedup: {t1/t2:.3f}x')
@@
-    main(args.B, args.S, args.H, args.D)
+    main(args.B, args.S, args.H, args.D, run_bench=True)

Also applies to: 136-145

Comment on lines +3 to +4
import example_linear_attn_fwd
import example_linear_attn_bwd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Use importable module paths and keep tests lightweight

  • Prefer absolute package imports to avoid path issues.
  • Call mains with small shapes and no benchmarking.
-import example_linear_attn_fwd
-import example_linear_attn_bwd
+from examples.linear_attention import example_linear_attn_fwd, example_linear_attn_bwd
@@
-@tilelang.testing.requires_cuda
-def test_example_linear_attn_fwd():
-    example_linear_attn_fwd.main()
+@tilelang.testing.requires_cuda
+def test_example_linear_attn_fwd():
+    example_linear_attn_fwd.main(B=1, S=256, H=8, D=64, run_bench=False)
@@
-@tilelang.testing.requires_cuda
-def test_example_linear_attn_bwd():
-    example_linear_attn_bwd.main()
+@tilelang.testing.requires_cuda
+def test_example_linear_attn_bwd():
+    example_linear_attn_bwd.main(B=1, S=256, H=8, D=64, run_bench=False)

If absolute imports fail, ensure examples/ is treated as a (namespace) package. Check for __init__.py and adjust as needed:

Also applies to: 8-10, 13-14


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "Checking package init files..."
fd -H "__init__.py" -a examples || true
echo "Listing test file location and siblings..."
ls -la examples/linear_attention | head -n 50

Length of output: 712


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "example_linear_attn_fwd.py main signature"
grep -nP "^def main" examples/linear_attention/example_linear_attn_fwd.py
echo "example_linear_attn_fwd.py parser arguments"
grep -R "add_argument" -n examples/linear_attention/example_linear_attn_fwd.py || true

echo
echo "example_linear_attn_bwd.py main signature"
grep -nP "^def main" -n examples/linear_attention/example_linear_attn_bwd.py
echo "example_linear_attn_bwd.py parser arguments"
grep -R "add_argument" -n examples/linear_attention/example_linear_attn_bwd.py || true

Length of output: 851


Use absolute imports and lightweight test parameters

  • Replace

    import example_linear_attn_fwd
    import example_linear_attn_bwd

    with

    from examples.linear_attention import example_linear_attn_fwd, example_linear_attn_bwd

    (add empty __init__.py in examples/ and examples/linear_attention/ if imports fail)

  • In each test, call main() with smaller shapes:

    example_linear_attn_fwd.main(B=1, S=256, H=8, D=64)
    example_linear_attn_bwd.main(B=1, S=256, H=8, D=64)
🤖 Prompt for AI Agents
In examples/linear_attention/test_linear_attn.py around lines 3 to 4, replace
the relative module imports with absolute imports and call the test mains with
smaller, lightweight shapes: change the imports to use from
examples.linear_attention import example_linear_attn_fwd,
example_linear_attn_bwd (add empty __init__.py files in examples/ and
examples/linear_attention/ if the import fails), and in each test invoke
example_linear_attn_fwd.main(B=1, S=256, H=8, D=64) and
example_linear_attn_bwd.main(B=1, S=256, H=8, D=64) so the tests run faster and
more reliably.

Comment on lines +131 to 133
use_tma (bool): If True, use TMA (cp.reduce) to perform the atomic add. This is available only for sm90+ (default False).
Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Expose architecture/disable‑TMA behavior in API and verify intrinsic signature

  • The new use_tma flag is great. Please clarify in the docstring that it’s ignored when TMA lowering is disabled (e.g., TL_DISABLE_TMA_LOWER=True) and/or when running on pre‑SM90 targets, rather than implying availability only. Optionally, guard or warn when use_tma=True under those conditions.
  • You added a third arg to tl.atomicadd intrinsic. Ensure all lowerings and call sites are updated to accept (value, dst, use_tma).

Run to confirm no remaining 2‑arg tl.atomicadd intrin calls and to locate TMA lowering:

Also applies to: 230-231


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "Searching for tl.atomicadd intrin call sites..."
rg -n -C2 "call_intrin\([^,]+,\s*op\.Op\.get\([\"']tl\.atomicadd[\"']\)" || true
echo "Searching for tl.atomicadd lowering/definitions..."
rg -n -C3 "tl\.atomicadd" || true
echo "Check presence of pass config controlling TMA..."
rg -n "TL_DISABLE_TMA_LOWER|tma_lower|TMA" tilelang || true

Length of output: 4356


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "Inspecting atomic_add signature in tilelang/language/atomic.py..."
rg -n -C2 "def atomic_add" tilelang/language/atomic.py || true

Length of output: 566


Update atomic_add API signature and docstrings

  • In tilelang/language/atomic.py, change
    def atomic_add(dst: Buffer, value: PrimExpr, memory_order: Optional[str] = None)
    to
    def atomic_add(dst: Buffer, value: PrimExpr, use_tma: bool = False, memory_order: Optional[str] = None)
    and align atomic_addx2/atomic_addx4 accordingly so the use_tma arg flows into the intrinsic call.
  • In its docstring, clarify that use_tma is ignored if TL_DISABLE_TMA_LOWER=True or on pre-SM90 targets; optionally emit a warning or error when use_tma=True under those conditions.

@LeiWang1999
Copy link
Member

It’s likely that the new test relies on fla. Should we skip the FLA reference, or install an identical fla version in CI?

print('Passed all tests!✅')
else:
print('Failed some tests!❌')
assert torch.allclose(o, o_ref, atol=1e-2, rtol=1e-2), f'o max err: {(o - o_ref).abs().max()}'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, have you considered comparing accuracy using an approach similar to FlashAttention: To compare the numerical error of the (Pytorch implementation in fp16 - Pytorch implementation in fp32), and (TileLang in fp16 - Pytorch implementation in fp32).

Ref:
Dao-AILab/flash-attention#16 (comment)
https://github.com/Dao-AILab/flash-attention/blob/v2.8.3/tests/test_flash_attn.py#L704

Copy link
Collaborator Author

@Rachmanino Rachmanino Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for your suggestion. I personally observed that linear attention operators seem to have slightly larger numerical fluctuation than other attention kernels and are typically sensible to inputs distribution. I'll investigate the approach you've just mentioned and give it a try.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, any kind of suggestions or help regarding on addressing linear attention kernels' numerical issues is welcomed😉

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
requirements.txt (1)

10-10: Move flash-linear-attention out of runtime requirements
flash-linear-attention is only used by examples/benchmarks (no imports in src or testing/). Remove

- flash-linear-attention==0.3.2

from requirements.txt and add it to requirements-test-cuda.txt (or define an extras_require “fla” in pyproject.toml/setup.py). If you must keep it in runtime, relax to ~=0.3.2.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d26799e and 6da0e2d.

📒 Files selected for processing (1)
  • requirements.txt (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

LeiWang1999
LeiWang1999 previously approved these changes Oct 17, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/linear_attention/example_linear_attn_bwd.py (1)

6-6: Gate benchmarks; make FLA optional; avoid top‑level import

  • Top‑level FLA import will fail in CI if FLA isn’t installed.
  • Benchmarks (CUPTI, 100 reps) run unconditionally and slow/break tests.
  • Add a flag to opt‑in, lazy‑import FLA, and skip if unavailable.

Apply:

-from fla.ops.linear_attn import fused_chunk_linear_attn  # We compare with FLA
+# Optional dependency: used only for benchmarking
+try:
+    from fla.ops.linear_attn import fused_chunk_linear_attn  # noqa: F401
+except Exception:
+    fused_chunk_linear_attn = None
-def main(B=1, S=1024, H=16, D=128):
+def main(B=1, S=1024, H=16, D=128, run_bench: bool = False):
+    torch.manual_seed(0)
@@
-    print('Passed all tests!✅')
-
-    # Benchmark
-    q.grad = k.grad = v.grad = None
-    o_ref, _ = fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False)
-    t1 = do_bench(
-        lambda: o_ref.backward(do, retain_graph=True), warmup=25, rep=100, backend='cupti')
-    t2 = do_bench(lambda: tl_fused_chunk_bwd(q, k, v, do), warmup=25, rep=100, backend='cupti')
-    print(f'Triton latency: {t1:.3f} ms')
-    print(f'TileLang latency: {t2:.3f} ms')
-    print(f'Speedup: {t1/t2:.3f}x')
+    print('Passed all tests!✅')
+    if run_bench:
+        if fused_chunk_linear_attn is None:
+            print('FLA not available; skipping benchmark.')
+        else:
+            # Benchmark
+            q.grad = k.grad = v.grad = None
+            o_ref, _ = fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False)
+            t1 = do_bench(
+                lambda: o_ref.backward(do, retain_graph=True), warmup=25, rep=100, backend='cupti')
+            t2 = do_bench(lambda: tl_fused_chunk_bwd(q, k, v, do),
+                          warmup=25, rep=100, backend='cupti')
+            print(f'Triton latency: {t1:.3f} ms')
+            print(f'TileLang latency: {t2:.3f} ms')
+            print(f'Speedup: {t1/t2:.3f}x')
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument('--B', type=int, default=8, help='Batch size')
     parser.add_argument('--S', type=int, default=1024, help='Seq len')
     parser.add_argument('--H', type=int, default=32, help='Num heads')
     parser.add_argument('--D', type=int, default=128, help='Head dim')
+    parser.add_argument('--run-bench', action='store_true',
+                        help='Enable benchmarks (requires FLA & CUPTI)')
     args = parser.parse_args()
 
-    main(args.B, args.S, args.H, args.D)
+    main(args.B, args.S, args.H, args.D, run_bench=args.run_bench)

Also applies to: 180-201, 202-211, 213-221

examples/linear_attention/example_linear_attn_fwd.py (1)

6-6: Gate benchmarks; make FLA optional; avoid top‑level import

  • Top‑level FLA import can break environments without FLA.
  • Benchmarks run by default; make them opt‑in.
  • Ensure CLI exposes a flag; tests stay fast.

Apply:

-from fla.ops.linear_attn import fused_chunk_linear_attn  # We compare with FLA
+# Optional dependency: used only for benchmarking
+try:
+    from fla.ops.linear_attn import fused_chunk_linear_attn  # noqa: F401
+except Exception:
+    fused_chunk_linear_attn = None
-def main(B=1, S=512, H=16, D=128):
+def main(B=1, S=512, H=16, D=128, run_bench: bool = False):
+    torch.manual_seed(0)
@@
-    assert torch.allclose(h, h_ref, atol=1e-2, rtol=1e-2), f'h max err: {(h - h_ref).abs().max()}'
-    print('Passed all tests!✅')
-    t1 = do_bench(
-        lambda: fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False),
-        warmup=25,
-        rep=100,
-        backend='cupti')
-    t2 = do_bench(lambda: tl_fused_chunk_fwd(q, k, v), warmup=25, rep=100, backend='cupti')
-    print(f'Triton latency: {t1:.3f} ms')
-    print(f'TileLang latency: {t2:.3f} ms')
-    print(f'Speedup: {t1/t2:.3f}x')
+    assert torch.allclose(h, h_ref, atol=1e-2, rtol=1e-2), f'h max err: {(h - h_ref).abs().max()}'
+    print('Passed all tests!✅')
+    if run_bench:
+        if fused_chunk_linear_attn is None:
+            print('FLA not available; skipping benchmark.')
+        else:
+            t1 = do_bench(
+                lambda: fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False),
+                warmup=25, rep=100, backend='cupti')
+            t2 = do_bench(lambda: tl_fused_chunk_fwd(q, k, v), warmup=25, rep=100, backend='cupti')
+            print(f'Triton latency: {t1:.3f} ms')
+            print(f'TileLang latency: {t2:.3f} ms')
+            print(f'Speedup: {t1/t2:.3f}x')
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument('--B', type=int, default=8, help='Batch size')
     parser.add_argument('--S', type=int, default=1024, help='Seq len')
     parser.add_argument('--H', type=int, default=32, help='Num heads')
     parser.add_argument('--D', type=int, default=128, help='Head dim')
+    parser.add_argument('--run-bench', action='store_true',
+                        help='Enable benchmarks (requires FLA & CUPTI)')
     args = parser.parse_args()
 
-    main(args.B, args.S, args.H, args.D)
+    main(args.B, args.S, args.H, args.D, run_bench=args.run_bench)

Also applies to: 120-135, 136-145, 147-155

🧹 Nitpick comments (3)
examples/linear_attention/example_linear_attn_bwd.py (1)

181-189: Seed RNG for reproducible tests

Add a fixed seed before creating random tensors to reduce flakiness.

-def main(B=1, S=1024, H=16, D=128, run_bench: bool = False):
-    q = torch.randn((B, S, H, D), device='cuda', dtype=torch.float16, requires_grad=True)
+def main(B=1, S=1024, H=16, D=128, run_bench: bool = False):
+    torch.manual_seed(0)
+    q = torch.randn((B, S, H, D), device='cuda', dtype=torch.float16, requires_grad=True)
examples/linear_attention/example_linear_attn_fwd.py (2)

44-45: Rename ambiguous variable O to out for clarity (E741)

Avoid single‑letter upper‑case names in Python.

-            O: T.Tensor([B, S, H, DV], accum_dtype),  # type: ignore
+            out: T.Tensor([B, S, H, DV], accum_dtype),  # type: ignore
@@
-                T.atomic_add(
-                    O[i_b, i * chunk_size:(i + 1) * chunk_size, i_h, i_v * BV:(i_v + 1) * BV],
-                    o_shared)
+                T.atomic_add(
+                    out[i_b, i * chunk_size:(i + 1) * chunk_size, i_h, i_v * BV:(i_v + 1) * BV],
+                    o_shared)

Also applies to: 78-80


121-129: Seed RNG for reproducible tests

Add a fixed seed before creating random tensors.

-def main(B=1, S=512, H=16, D=128, run_bench: bool = False):
-    q = torch.randn((B, S, H, D), device='cuda', dtype=torch.float16)
+def main(B=1, S=512, H=16, D=128, run_bench: bool = False):
+    torch.manual_seed(0)
+    q = torch.randn((B, S, H, D), device='cuda', dtype=torch.float16)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6da0e2d and b6fe638.

📒 Files selected for processing (2)
  • examples/linear_attention/example_linear_attn_bwd.py (6 hunks)
  • examples/linear_attention/example_linear_attn_fwd.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/linear_attention/example_linear_attn_bwd.py (5)
tilelang/jit/__init__.py (1)
  • jit (244-317)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-104)
tilelang/language/allocate.py (2)
  • alloc_shared (21-36)
  • alloc_fragment (53-64)
tilelang/layout/swizzle.py (1)
  • make_swizzled_layout (10-18)
tilelang/language/copy.py (1)
  • copy (10-86)
examples/linear_attention/example_linear_attn_fwd.py (3)
tilelang/jit/__init__.py (1)
  • jit (244-317)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-104)
tilelang/language/pipeline.py (1)
  • Pipelined (9-46)
🪛 Ruff (0.14.0)
examples/linear_attention/example_linear_attn_fwd.py

44-44: Ambiguous variable name: O

(E741)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)

@Rachmanino
Copy link
Collaborator Author

@cailun01 FYI

fwd abs err (torch ref in fp32 as golden)

o_ref_16 max err: 0.0011341571807861328
h_ref_16 max err: 0.024440765380859375
o_triton max err: 0.0005993843078613281
h_triton max err: 2.574920654296875e-05
tilelang o max err: 0.00041940435767173767
tilelang h max err: 2.574920654296875e-05

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/linear_attention/example_linear_attn_bwd.py (1)

181-203: Gate benchmarks; move FLA import inside the bench; keep CI/tests fast and independent of FLA/CUPTI

Benchmarks run unconditionally and require CUPTI and FLA at import-time. This will slow or fail CI and tests. Add a run_bench flag (default False), import FLA lazily inside the bench block, and skip gracefully if unavailable. Mirrors prior review.

Apply this diff:

@@
-def main(B=1, S=1024, H=16, D=128):
+def main(B=1, S=1024, H=16, D=128, run_bench: bool = False):
@@
-    print('Passed all tests!✅')
-
-    # Benchmark
-    q.grad = k.grad = v.grad = None
-    o_ref, _ = fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False)
-    t1 = do_bench(lambda: o_ref.backward(do, retain_graph=True), backend='cupti')
-    t2 = do_bench(lambda: tl_fused_chunk_bwd(q, k, v, do), backend='cupti')
-    print(f'Triton latency: {t1:.3f} ms')
-    print(f'TileLang latency: {t2:.3f} ms')
-    print(f'Speedup: {t1/t2:.3f}x')
+    print('Passed all tests!✅')
+    if run_bench:
+        try:
+            # Lazy import so examples/tests don't require FLA unless benchmarking.
+            from fla.ops.linear_attn import fused_chunk_linear_attn  # type: ignore
+        except Exception:
+            print('[bench] flash-linear-attention not installed; skipping benchmarks.')
+        else:
+            q.grad = k.grad = v.grad = None
+            o_ref, _ = fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False)
+            t1 = do_bench(lambda: o_ref.backward(do, retain_graph=True),
+                          warmup=25, rep=100, backend='cupti')
+            t2 = do_bench(lambda: tl_fused_chunk_bwd(q, k, v, do),
+                          warmup=25, rep=100, backend='cupti')
+            print(f'Triton latency: {t1:.3f} ms')
+            print(f'TileLang latency: {t2:.3f} ms')
+            print(f'Speedup: {t1/t2:.3f}x')
@@
-if __name__ == '__main__':
+if __name__ == '__main__':
@@
-    main(args.B, args.S, args.H, args.D)
+    main(args.B, args.S, args.H, args.D, run_bench=True)

Also remove the top-level FLA benchmark import to avoid a hard dependency during tests:

@@
-from fla.ops.linear_attn import fused_chunk_linear_attn  # We compare with FLA

Also applies to: 204-211, 214-222, 1-9

examples/linear_attention/example_linear_attn_fwd.py (1)

122-146: Do not benchmark by default; lazy-import FLA for bench only

Bench code runs unconditionally and FLA is imported at module load. Gate with run_bench (default False), move the FLA import inside the bench, and keep examples test-friendly. Mirrors prior review.

@@
-def main(B=1, S=512, H=16, D=128):
+def main(B=1, S=512, H=16, D=128, run_bench: bool = False):
@@
-    t1 = do_bench(
-        lambda: fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False),
-        backend='cupti')
-    t2 = do_bench(lambda: tl_fused_chunk_fwd(q, k, v), backend='cupti')
-    print(f'Triton latency: {t1:.3f} ms')
-    print(f'TileLang latency: {t2:.3f} ms')
-    print(f'Speedup: {t1/t2:.3f}x')
+    if run_bench:
+        try:
+            from fla.ops.linear_attn import fused_chunk_linear_attn  # type: ignore
+        except Exception:
+            print('[bench] flash-linear-attention not installed; skipping benchmarks.')
+        else:
+            t1 = do_bench(
+                lambda: fused_chunk_linear_attn(q, k, v, output_final_state=True, normalize=False),
+                warmup=25, rep=100, backend='cupti')
+            t2 = do_bench(lambda: tl_fused_chunk_fwd(q, k, v), warmup=25, rep=100, backend='cupti')
+            print(f'Triton latency: {t1:.3f} ms')
+            print(f'TileLang latency: {t2:.3f} ms')
+            print(f'Speedup: {t1/t2:.3f}x')
@@
-if __name__ == '__main__':
+if __name__ == '__main__':
@@
-    main(args.B, args.S, args.H, args.D)
+    main(args.B, args.S, args.H, args.D, run_bench=True)

Also drop the top-level FLA import:

@@
-from fla.ops.linear_attn import fused_chunk_linear_attn  # We compare with FLA

Also applies to: 147-155, 1-9

🧹 Nitpick comments (6)
examples/linear_attention/example_linear_attn_bwd.py (2)

7-9: Make FLA optional for normalization: add a local fallback for l2norm_fwd

Tests currently require FLA due to top-level l2norm_fwd import. Provide a lightweight fallback so examples run without FLA.

@@
-from fla.modules.l2norm import l2norm_fwd
+try:
+    from fla.modules.l2norm import l2norm_fwd  # type: ignore
+except Exception:
+    # Minimal fallback: returns (normalized, aux) with similar interface; aux is unused here.
+    def l2norm_fwd(x: torch.Tensor, eps: float = 1e-6):
+        n = torch.linalg.norm(x.float(), ord=2, dim=-1, keepdim=True)
+        y = x / (n + eps).to(dtype=x.dtype)
+        return y, (1.0 / (n + eps)).to(dtype=x.dtype)

Note: We only use index [0], so aux semantics won’t affect this example. Based on learnings.


31-37: More informative shape checks for chunking constraints

Replace bare assert with a clear error to aid users picking S/D that aren’t multiples of 64.

-    assert S % chunk_size == 0 and DK % BK == 0 and DV % BV == 0
+    if (S % chunk_size != 0) or (DK % BK != 0) or (DV % BV != 0):
+        raise ValueError(
+            f'Expect multiples: S({S}) % {chunk_size} == 0, '
+            f'DK({DK}) % {BK} == 0, DV({DV}) % {BV} == 0.'
+        )
examples/linear_attention/example_linear_attn_fwd.py (4)

40-46: Rename ambiguous variable O to Out (fixes Ruff E741, improves clarity)

Minor polish; avoids the ambiguous single-letter naming warning and reads better.

-    def fused_chunk_linear_attn_fwd(
+    def fused_chunk_linear_attn_fwd(
             Q: T.Tensor([B, S, H, DK], dtype),  # type: ignore
             K: T.Tensor([B, S, H, DK], dtype),  # type: ignore
             V: T.Tensor([B, S, H, DV], dtype),  # type: ignore
-            O: T.Tensor([B, S, H, DV], accum_dtype),  # type: ignore
+            Out: T.Tensor([B, S, H, DV], accum_dtype),  # type: ignore
             final_state: T.Tensor([B, H, DK, DV], accum_dtype)):  # type: ignore
@@
-                T.atomic_add(
-                    O[i_b, i * chunk_size:(i + 1) * chunk_size, i_h, i_v * BV:(i_v + 1) * BV],
-                    o_shared)
+                T.atomic_add(
+                    Out[i_b, i * chunk_size:(i + 1) * chunk_size, i_h, i_v * BV:(i_v + 1) * BV],
+                    o_shared)

As per static analysis hints.

Also applies to: 80-83


7-9: Optional: remove hard FLA dependency for normalization

Provide a local l2norm_fwd fallback so examples run even if FLA isn’t installed.

@@
-from fla.modules.l2norm import l2norm_fwd
+try:
+    from fla.modules.l2norm import l2norm_fwd  # type: ignore
+except Exception:
+    def l2norm_fwd(x: torch.Tensor, eps: float = 1e-6):
+        n = torch.linalg.norm(x.float(), ord=2, dim=-1, keepdim=True)
+        y = x / (n + eps).to(dtype=x.dtype)
+        return y, (1.0 / (n + eps)).to(dtype=x.dtype)

Based on learnings.


91-96: Output dtype choice

You allocate O as float32 and return float32, which is fine for accuracy comparisons. If you prefer API symmetry with inputs, consider returning o.to(q.dtype) in the wrapper.


80-83: Future: consider vectorized atomic/TMA reduce toggle

A use_tma/vectorized atomic toggle (SM90) would help perf tuning without changing examples. Expose as a flag and thread through to T.atomic_add/pass_configs when available.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6fe638 and 5ded38c.

📒 Files selected for processing (2)
  • examples/linear_attention/example_linear_attn_bwd.py (6 hunks)
  • examples/linear_attention/example_linear_attn_fwd.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/linear_attention/example_linear_attn_fwd.py (1)
tilelang/jit/__init__.py (1)
  • jit (244-317)
examples/linear_attention/example_linear_attn_bwd.py (5)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-104)
tilelang/language/allocate.py (2)
  • alloc_shared (21-36)
  • alloc_fragment (53-64)
tilelang/layout/swizzle.py (1)
  • make_swizzled_layout (10-18)
tilelang/language/pipeline.py (1)
  • Pipelined (9-46)
tilelang/language/copy.py (1)
  • copy (10-86)
🪛 Ruff (0.14.0)
examples/linear_attention/example_linear_attn_fwd.py

44-44: Ambiguous variable name: O

(E741)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (1)
examples/linear_attention/example_linear_attn_bwd.py (1)

116-143: Kernel logic LGTM: reverse-time dK/dV accumulation and dh updates are consistent

The reverse pipelining, triangular masking, and atomic_add accumulation patterns look correct and mirror the forward path.

Please sanity-check perf on SM80/SM90 with and without swizzle to ensure no regressions under typical H=16/32, D=128, S=4k settings.

@LeiWang1999 LeiWang1999 merged commit ae9a6f0 into tile-ai:main Oct 19, 2025
6 checks passed
RubiaCx added a commit to RubiaCx/tilelang that referenced this pull request Oct 20, 2025
commit b2acfc3
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Sun Oct 19 22:08:13 2025 +0800

    [Benchmark] Add matmul FP16 benchmark results (tile-ai#1067)

commit 17bd0a6
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Sun Oct 19 17:34:12 2025 +0800

    [Enhancement] Deprecate split&sum in attn bwd examples on Hopper and migrate to vectorized atomic add (tile-ai#1065)

commit ae9a6f0
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Sun Oct 19 15:45:58 2025 +0800

    [Refactor][Example] Update linear attention examples and add tests (tile-ai#1010)

    * [Refactor][Example] Update linear attention examples and add tests

    - Refactored the backward and forward linear attention kernels to use shared memory and atomic additions for improved performance.
    - Introduced L2 normalization in the main functions of both examples.
    - Added a new test suite for the linear attention examples to ensure correctness and performance.
    - Updated argument parsing in the main functions for better usability.

    * upd docstring for tma atomic add

    * lint

    * Add flash-linear-attention dependency to requirements.txt

    * Rename main function to chunk_linear_attn_bwd

    * Rename main function to chunk_linear_attn_fwd

    * chore

    ---------

    Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
    Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

commit b7dfdb3
Author: Xuehai Pan <XuehaiPan@pku.edu.cn>
Date:   Sun Oct 19 12:16:41 2025 +0800

    [Misc] Add GitHub issue templates (tile-ai#1057)

commit fb8b3af
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Sun Oct 19 12:15:44 2025 +0800

    [Benchmark] Add H800 SXM Benchmark results (tile-ai#1063)

    * Add document PYTHONPATH build path

    * update fp8 benchmark result

    * remove redpath

    * remove path

    * tflops fix

commit 4ca6c13
Author: Yuqi Dong <134183314+yyttt6@users.noreply.github.com>
Date:   Sun Oct 19 02:43:00 2025 +0800

    [CI]:Reduce test shapes to avoid OOM errors during CI. (tile-ai#1060)

    * [CI]:Reduce test shapes to avoid OOM errors during CI.

    * rabbit

    * Increase number of processes for pytest from 2 to 4

    ---------

    Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

commit 759c2e3
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Sun Oct 19 00:35:06 2025 +0800

    [DOC] Add document for develop with PYTHONPATH (tile-ai#1062)

commit bf2de5b
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Sun Oct 19 00:21:59 2025 +0800

    Making version parser more robust against missing or unavailable metadata (tile-ai#1061)

commit 7211164
Author: Chaofan Lin <linchaofan@bytedance.com>
Date:   Fri Oct 17 20:56:01 2025 +0800

    [Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (tile-ai#1050)

    * [Refactor] Refactor Pass  to support recursive load/store rewrite

    * lint

    * recursive collect conds for call_extern

    * fix name

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * lint

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * lint

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * address comment

    * rename pad_value to safe_value

    * lint

    * add oob store test

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * fix

    * fix

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 278c0fb
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Fri Oct 17 18:32:43 2025 +0800

    [Enhancement] Introduce a workaround for layout inference for local buffer store (tile-ai#1055)

    * [Enhancement] Improve layout inference for local buffer handling in parallel operations

    * Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions.
    * Updated the condition for determining parallel loop execution to account for local buffer stores.
    * Cleaned up comments for clarity and future considerations.

    * [Refactor] Clean up parallel loop condition formatting in layout inference

    * Reformatted the condition for determining parallel loop execution for better readability.
    * Maintained existing logic while enhancing code clarity for future modifications.

    ---------

    Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

commit 37b3dbd
Author: LJC00118 <77378439+LJC00118@users.noreply.github.com>
Date:   Fri Oct 17 17:15:59 2025 +0800

    [Enhancement] Improve CUDA compiler detection in CMake (tile-ai#1054)

    * improve CUDA compiler detection in CMake

    * Minor fix

commit 1281d6f
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Fri Oct 17 13:44:08 2025 +0800

    [CI] Disable autofix for pre-commit CI (tile-ai#1053)

commit 35cf888
Author: LJC00118 <77378439+LJC00118@users.noreply.github.com>
Date:   Fri Oct 17 13:43:08 2025 +0800

    [Enhancement] Remove constraint requiring last dimension stride to be 1 (tile-ai#1040)

    * remove last dimension stride must be 1 constraint

    * add vectorize test

    * minor fix

    * [Lint]: [pre-commit.ci] auto fixes [...]

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit fd1493b
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Fri Oct 17 11:34:35 2025 +0800

    Automatically initialize submodule if missing (tile-ai#1052)

commit cc00fb6
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Fri Oct 17 11:28:14 2025 +0800

    [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper (tile-ai#1024)

    * [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper

    * [BugFix] Fix shape mismatch and deprecate `T.if()` in fused_moe example

    * [Fix] Add `is_symbolic_expr` function to check for symbolic expressions in TIR

    - Introduced a new utility function `is_symbolic_expr` to determine if an expression is a symbolic expression, enhancing type checking capabilities.
    - Updated shape handling in `CythonKernelAdapter` to utilize the new function, improving handling for symbolic shapes.

commit a79bc5c
Author: Xuehai Pan <XuehaiPan@pku.edu.cn>
Date:   Thu Oct 16 20:38:23 2025 +0800

    [CI] Fix ROCm CI (tile-ai#1043)

    * [CI] fix ROCm CI

    * feat: add a hook to error out on no test runs

commit 1f4ffdb
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Thu Oct 16 17:53:45 2025 +0800

    [Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. (tile-ai#1051)

commit e3742d3
Author: Yichen Yan <wenji.yyc@alibaba-inc.com>
Date:   Thu Oct 16 15:52:10 2025 +0800

    Allow mma gemm for all cuda (tile-ai#1047)

commit 0ff4f42
Author: Yuqi Dong <134183314+yyttt6@users.noreply.github.com>
Date:   Thu Oct 16 12:41:09 2025 +0800

    [Feature]: Add test for atomicadd auto vectorize and remove useless code (tile-ai#1019)

    * update

    * format

    * rabbit

commit bd1c7b3
Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com>
Date:   Thu Oct 16 02:52:35 2025 +0800

    [Refactor] Use `has_simt_copy` to decide whether to insert `set_max_nreg` (tile-ai#982)

commit 8f001e0
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Thu Oct 16 01:10:28 2025 +0800

    [BugFix] Phaseout dependency of Triton in sink examples to make CI happy (tile-ai#1045)

    * [BugFix] Phaseout dependency of Triton in sink examples to make CI happy

    - Added `benchmark_gqa_sink_fwd.py` and `benchmark_mha_sink_fwd.py` to evaluate performance of GQA and MHA attention mechanisms using Triton.
    - Refactored existing attention sink implementations to remove Triton kernel definitions from the reference programs, streamlining the code.
    - Updated input generation and benchmarking logic to enhance configurability and performance measurement.
    - Improved overall structure and organization of the examples for better clarity and usability.

    * [Lint]: [pre-commit.ci] auto fixes [...]

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 8ce2778
Author: Xuehai Pan <XuehaiPan@pku.edu.cn>
Date:   Wed Oct 15 22:12:41 2025 +0800

    [CI][Refactor] Merge test CI workflow files into one (tile-ai#973)

    * refactor: merge test CI workflow files into one

    * chore: set `UV_INDEX_STRATEGY=unsafe-best-match`

    * feat: add AST test with Python 3.8

    * feat: implement manual caching mechanism for self-hosted runners

    * refactor: simplify cache logic for self-hosted runners

    * chore: clear uv cache on failure

    * chore: print format.sh output to logs

    * chore: improve uv caching

    * chore: disable parallel test

    * chore: use `PYTHONDEVMODE=1` in CI

    * feat: enable coredump generation

    * fix: fix perfbench condition

    * Revert "feat: enable coredump generation"

    This reverts commit c52da65.

    * chore: move example CI down

    * Revert "chore: move example CI down"

    This reverts commit 9d8e650.

    * chore: skip example `test_example_mha_sink_bwd_bhsd`

    * chore: skip example `test_example_gqa_sink_bwd_bhsd`

    * fix: fix example argument passing

    * fix: loosen test criteria

    * chore: rename `CMAKE_CONFIGURE_OPTIONS` -> `CLANG_TIDY_CMAKE_OPTIONS` for clarity

    * feat: enable parallel testings

    * chore: update pytest options

    * remove skipped test as now been resolved

    * chore: empty commit to re-trigger ci

    * test for n 1

    * chore: remove ` --numprocesses=1` option in example

    * chore: disable failfast

    * chore: update cibw selection

    * fix: fix git submodule clone

    * chore: update cibw commands

    * fix: fix yapf multiprocessing

    * chore: setup ccache for CIBW on macOS only

    * chore: update comments

    * chore: update artifact listing

    * fix: do not fail if not found nvcc in PATH

    * fix: fix flash-attn installation

    * chore: update dist workflow trigger

    * chore: remove outdated comments

    * chore(workflows/dist): simplify build matrix strategy

    * fix: fix CUDA path finding

    * fix: fix CUDA path finding

    * chore: imcrease CI timeout

    * ci: disable failfast

    * fix: hide path prefix

    * chore: more verbose

    * chore: disable PR trigger for dist workflow

    * fix: seed for tests

    * fix: use nightly torch for ROCm tests

    * chore: enable PR trigger for dist workflow

    * chore: stop uploading debug wheels as artifacts in PR

    * chore: do not run workflows in forks

    * chore: housekeep requirements

    * chore: use Nightly-ROCm-6.3 for CI

    * chore: use Nightly-ROCm-6.4 for CI

    * Update ROCm toolkit version to 7.0

    * chore: restore previous rocm-ci.yml for test

    * fix: cleanup PYTHONPATH

    * chore: remove previous rocm-ci.yml

    * ci fix

    * chore: remove previous rocm-ci.yml

    * chore: enable parallel example run

    ---------

    Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
    Co-authored-by: alex_xiao <xinyuxiao2024@gmail.com>

commit 80665cd
Author: alex_xiao <xinyuxiao2024@gmail.com>
Date:   Wed Oct 15 21:17:14 2025 +0800

    fix bug&add amd examples (tile-ai#966)

    * [Enhancement] Refactor buffer index handling for improved precision and clarity (tile-ai#668)

    - Enhanced buffer index handling to address precision issues by removing redundant operations.
    - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
    - Updated related documentation to reflect changes in buffer management practices.

    * Remove obsolete test script for AMD example, streamlining the examples directory.

    * Remove unused dtype_size variable in AMD example script to streamline code.

    * Add input configuration file and update AMD example script for enhanced flexibility

    - Introduced a new input.txt file for configurable parameters.
    - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
    - Streamlined the main function for better clarity and organization.
    - Added a new test script to facilitate running the example with specified parameters.

    * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

    - Deleted input.txt and test.sh files as they are no longer needed.
    - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
    - Reintroduced swizzle usage in the kernel for better performance.

    * Refactor AMD example script for FlashAttention-2

    - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
    - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
    - Removed outdated comments and improved code organization for better readability.

    * Refactor formatting in AMD FlashAttention example script

    - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
    - Streamlined the `main` function parameter formatting for consistency.
    - Removed unnecessary blank lines to enhance overall code organization.

    * Update example_amd_flash_attn_fwd.py

    * Enhance AMD example script and update CI workflows

    - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
    - Added new CI workflows for AMD and documentation publishing.
    - Updated various requirements files to include necessary dependencies.
    - Introduced new test cases and examples for better coverage and functionality.
    - Refactored existing code for improved readability and maintainability.

    * Remove redundant tool cache cleanup step in AMD CI workflow

    * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

    * Add new AMD FlashAttention example and test script

    - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
    - Added `test.sh` script to facilitate running the new example with specified parameters.
    - Enhanced the overall structure and organization of the example for better clarity and usability.

    * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

    - Reduced the number of threads and `num_split_q` options for improved performance.
    - Adjusted `panel_size` options to streamline configuration settings.

    * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

    * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

    * Add example for AMD Flash Attention backward pass implementation

    - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
    - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
    - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
    - Included reference implementation for validation against PyTorch's attention mechanism.

    This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

    * Enhance AMD Flash Attention example with additional testing capabilities

    - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
    - Improved the main function to allow for better parameter configuration and benchmarking.
    - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

    This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

    * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

    * Refactor HIP intrinsic rules to CUDA

    - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
    - Adjusted include paths for better organization and clarity in the code structure.

    * Update AMD CI workflow to uninstall specific PyTorch packages before installation

    - Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
    - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

    * Remove unused shared memory allocations in AMD Flash Attention backward example

    - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
    - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

    * Remove unnecessary pip uninstall command from AMD CI workflow

    - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
    - This change simplifies the CI process and reduces potential overhead during package management.

    * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

    - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
    - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

    * Refactor formatting of HIP intrinsic rule registrations

    - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
    - No functional changes were made; this update focuses on code style improvements to enhance maintainability.

    * Update file name and documentation for HIP intrinsic rules

    - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
    - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

    * Enhance DispatchHIPShuffle function with clang-analyzer comments

    - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
    - This change improves code clarity and maintains compliance with static analysis tools.

    * lint fix

    * fix

    * Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency.

    * Add backward attention example to test script

    - Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py.
    - Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example.
    - Updated the command for the backward tile example to match the new configuration.

    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

    - Introduced new functions for forward and backward configurations to enhance autotuning capabilities.
    - Updated the FlashAttention forward and backward functions to improve performance and maintainability.
    - Adjusted test script parameters for consistency and clarity, including the addition of group handling.
    - Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning.
    - Updated the main function to reflect changes in parameter names and types for better usability.

    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

    - Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations.
    - Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning.
    - Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency.
    - Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass.
    - Enhanced correctness checks in the main function to include LSE validation alongside gradient checks.

    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

    - Introduced a scaling factor for improved numerical stability in gradient calculations.
    - Optimized shared memory usage by adding new shared buffers for intermediate calculations.
    - Refined the handling of tensor fragments to improve performance and maintainability.
    - Updated the main function to ensure compatibility with the new output parameters for backward operations.
    - Removed unnecessary parameters from the test script to streamline execution.

    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py

    - Updated the forward and backward functions to improve numerical stability and performance.
    - Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters.
    - Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters.
    - Added debugging and benchmarking functions for improved correctness verification and performance analysis.
    - Updated the main function to reflect changes in parameter handling and ensure consistency across examples.

    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

    - Updated scaling factor application for improved numerical stability in gradient calculations.
    - Refined tensor handling to ensure consistency with forward pass operations.
    - Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision.
    - Adjusted comments for clarity and alignment with standard implementation practices.

    * Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh

    - Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning.
    - Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example.
    - Improved comments for clarity and alignment with the updated configurations.

    * Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py

    - Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost.
    - Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations.
    - Improved comments for better understanding of the performance metrics and implementation details.
    - Removed unnecessary parameter from test.sh to streamline execution.

    * Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing.

    * Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

    - Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations.
    - Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning.
    - Refined scaling factor calculations for numerical stability in both forward and backward passes.
    - Improved comments and documentation for clarity and consistency across implementations.
    - Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements.

    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py

    - Removed outdated comments and improved clarity in the code.
    - Enhanced the forward function to consistently return output and log-sum-exp (LSE) values.
    - Updated autotuner configurations to include new parameters for better performance tuning.
    - Refined tensor handling and scaling factor calculations for improved numerical stability.
    - Adjusted the main function to ensure compatibility with updated output requirements and parameter handling.

    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

    - Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization.
    - Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns.
    - Modified tensor copy operations to utilize coalesced widths for optimized memory loads.
    - Enhanced GEMM operations with k_pack for improved computational efficiency.
    - Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios.

    * Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py

    - Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning.
    - Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation.
    - Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns.
    - Adjusted GEMM operations to improve computational efficiency without the use of k_pack.

    * Enhance HIP code generation and FP8 type support

    - Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility.
    - Updated error logging to include unsupported FP8 type details for better debugging.
    - Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method.
    - Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality.
    - Added overloads for AtomicAdd in common.h to support both pointer and value arguments.

    * Enhance FP8 type support and clarify accumulator handling in HIP

    - Expanded FP8 type support in codegen_hip.cc to include additional float8 formats.
    - Updated gemm.h to clarify the handling of the accumulator when clear_accum is true.
    - Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version.

    * Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py

    * Update print statement formatting for clarity in example_amd_flash_attn_bwd.py

    * Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output.

    * Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.

    * Refactor and enhance HIP code generation for improved FP8 support

    - Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability.
    - Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types.
    - Updated AtomicAdd function in common.h to streamline its implementation.
    - Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively.
    - Added function to manage the addition of new functions in the code generation process.

    * Fix formatting issue in HIP code generation for MFMA call

    - Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.

    * Refactor HIP code generation and enhance FP8 type handling

    - Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability.
    - Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types.
    - Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading.
    - Refined the AddFunction method to streamline function addition in the code generation process.

    * Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

    * Refactor backward attention implementation in example_amd_flash_attn_bwd.py

    - Updated the GEMM operation to use shared memory for improved performance.
    - Adjusted parallelization parameters to enhance efficiency in the backward pass.

    * Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

    * Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations

    * Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py`

    ---------

    Co-authored-by: xinxyxiao <xinyxiao@amd.com>
    Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
    Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

commit b78d840
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Wed Oct 15 16:38:55 2025 +0800

    [Language] Expose `T.get_warp_idx_sync` and `T.shuffle_elect` for efficient thread election (tile-ai#989)

    * Expose CUDA warp/lane intrinsics in TileLang frontend

    * generalize warp indexing intrinsics and add coverage

    * [Lint]: [pre-commit.ci] auto fixes [...]

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 32ddc1a
Author: LJC00118 <77378439+LJC00118@users.noreply.github.com>
Date:   Wed Oct 15 15:25:43 2025 +0800

    [CUDA] Add pack functions for FP8 types (tile-ai#967)

    * Remove an incorrect check

    * add fp8 pack function

    * code lint

    * minor fix

    * minor fix

    * minor fix

    * Minor fix

    * Minor fix

commit c67f73b
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Wed Oct 15 15:12:08 2025 +0800

    [Env] Optimize the mechanism for locating `TL_LIBS` (tile-ai#1038)

commit e539952
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Wed Oct 15 15:11:40 2025 +0800

    [TIR] Revert some changes of Pass `LowerIntrin` (tile-ai#1035)

    * keep >> instead of /

    * re think replicate

    * lint fix

    * handle const int buffers

    * rep fix

    ---------

    Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

commit 5767475
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Tue Oct 14 23:55:27 2025 +0800

    [CI] Disable buggy(maybe) warp specialized kernel ci test for H20 (tile-ai#1033)

commit eed320f
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Tue Oct 14 21:51:31 2025 +0800

    [Bugfix] Recover code for flexible parallel (tile-ai#1032)

    * recover flex parallel process

    * lint fix

    ---------

    Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

commit 1e8f0b1
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Tue Oct 14 17:26:23 2025 +0800

    [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation (tile-ai#1023)

    * [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * optimize amd ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The LinearAttention output of tilelang are inconsistent with the FLA baseline

3 participants