[Feature] Low-bit twiddling dequantization and FP4 GEMM #725

tzj-fxz · 2025-08-15T14:37:03Z

Add bit-twiddling dequantization for bf16 to fp4
Add extern call for new dequantization
Implementation of BF16xFP4 GEMM (the kernel with suffix serial)
Implementation of BF16xMXFP4 GEMM (the kernel with suffix serial)
The example_dequant_gemm_mxfp4_hopper.py should be further debugged for strict proxy_fence after dequant

Summary by CodeRabbit

New Features
- Added BF16/FP4 dequantized GEMM example with dual dequant paths and exposed MXFP quantization intrinsic.
Tests
- Replaced the old Hopper test variant with a new BF16/FP4 Hopper-capable test.
Refactor
- Moved index-to-coordinate utilities into core language helpers and simplified import/arch selection in an example.
Chores
- Added a local CI script to run examples and Python tests in parallel.
Removed
- Deleted an outdated FP4-to-BF16 dequantized GEMM example and one intrinsics helper.

coderabbitai · 2025-08-15T14:37:12Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Deletes a legacy MXFP4 dequant/GEMM example, adds BF16+FP4 GEMM examples with dual dequant paths (extern twiddling and TIR), introduces MXFP intrinsics and a TIR FP4→BF16 helper, moves index utilities, adds example utils/tests, minor example edits, and a local CI test runner.

Changes

Cohort / File(s)	Summary of Changes
Removed example `examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py`	Deleted legacy FP4→BF16 dequant/GEMM example (kernels, autotuning scaffolding, refs, tests, and CLI removed).
New BF16 FP4 examples `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py`, `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py`	Added BF16/FP4 GEMM examples with tiled GEMM kernels, autotune configs, dual dequantization paths (extern twiddling and TIR), Torch references, and main drivers/benchmarks.
Examples utilities & tests `examples/dequantize_gemm/utils.py`, `examples/dequantize_gemm/test_example_dequantize_gemm.py`	Added Torch helpers for FP4→BF16 conversion (bit-twiddling and simple) and a print helper; test updated to call the new BF16 FP4 serial example.
MXFP quantization core `tilelang/quantize/mxfp.py`, `tilelang/quantize/__init__.py`, `tilelang/quantize/quantization.py`	New MXFP module exposing `get_mxfp_intrin_group` (re-exported) and CUDA twiddling decode snippet; added TIR helper `_tir_u8_to_f4_to_bf16` for FP4→BF16 conversion.
Language utils & namespace edits `tilelang/language/utils.py`, `tilelang/language/__init__.py`, `tilelang/intrinsics/utils.py`, `tilelang/language/print.py`	Introduced `index_to_coordinates` and `linear_index` in `tilelang.language.utils`, re-exported `index_to_coordinates` from language.init, removed old `index_to_coordinates` from `tilelang.intrinsics.utils`, and updated imports/usages to the new location.
Minor example change `examples/gemm/example_gemm_autotune.py`	Reordered an import and simplified architecture selection expression (behavior preserved).
CI helper `maint/scripts/run_local_ci_test.sh`	Added script to run pytest in parallel for examples and testing/python.
Other example edits `examples/bitnet-1.58b/.../tilelang_bitnet_158_int8xint2_prefill.py`	Replaced local `index_to_coordinates` calls with `T.index_to_coordinates` (namespace-qualified use).

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant ExampleMain as example_dequant_gemm::main
  participant TL as TileLang kernel
  participant Intrin as MXFP extern (twiddling)
  participant TIR as Simple TIR dequant
  participant Torch as Torch reference

  User->>ExampleMain: run(m,n,k, fast_dequant, tune)
  ExampleMain->>TL: build/run matmul kernel
  alt fast_dequant
    TL->>Intrin: request get_mxfp_intrin_group & extern func
    TL->>Intrin: call extern twiddling dequant on packed qB
  else simple
    TL->>TIR: run _tir_u8_to_f4_to_bf16 per-element dequant
  end
  TL->>ExampleMain: produce C
  ExampleMain->>Torch: compute reference (twiddling or simple)
  Torch-->>ExampleMain: C_ref
  ExampleMain->>User: validate & report perf

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibble bits by lamplight glow,
Four tiny hops where bf16s grow.
A twiddle here, a TIR there,
Two paths to wake the kernel's care.
Hop, test, tune — the rabbit's flair. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Docstrings were successfully generated. (🔄 Check again to generate docstrings again)

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-08-15T14:37:13Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist

Summary of Changes

Hello @tzj-fxz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance enhancements for quantized models by integrating optimized low-bit dequantization and General Matrix Multiply (GEMM) kernels. Specifically, it focuses on operations involving FP4 and mixed-precision FP4 (MXFP4) with bfloat16 outputs. These changes are designed to leverage the capabilities of modern hardware, such as NVIDIA Hopper GPUs, to accelerate AI training and inference workloads.

Highlights

Bit-Twiddling Dequantization: A new dequantization method for converting FP4 to bfloat16 is implemented using bit-twiddling techniques, leveraging custom CUDA assembly for optimized performance.
New GEMM Kernels: Two new General Matrix Multiply (GEMM) kernels have been added: one for BF16xFP4 matrix multiplication and another for BF16xMXFP4, which includes support for scaling.
mxfp Quantization Module: A new module, tilelang/quantize/mxfp.py, has been introduced to encapsulate the intrinsic group for mixed-precision floating-point (MXFP) operations, including the new bit-twiddling dequantization.
Updated Autotuning and Benchmarking: The existing example_dequant_gemm_mxfp4_hopper.py has been refactored to utilize the AutoTuner class, simplifying the tuning and benchmarking process and improving code clarity.
Explicit Dequantization Function: The _tir_u8_to_f4_to_bf16 function has been added to tilelang/quantize/quantization.py, providing a low-level, explicit implementation for FP4 to bfloat16 dequantization within the TIR framework.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces low-bit dequantization using bit-twiddling for FP4 to BF16, along with new GEMM kernels and examples. The changes are significant and add valuable new features. My review focuses on correctness, maintainability, and potential bugs. I've identified a few critical issues, such as using an uninitialized shared memory buffer and using Python's built-in min function on TIR expressions, which will cause runtime errors. There are also several medium-severity issues like hardcoded paths, unused code, and incorrect comments that should be addressed to improve code quality.

gemini-code-assist · 2025-08-15T14:40:04Z

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

+                C_shared = T.alloc_shared((block_M, block_N), out_dtype)
+
+                T.annotate_layout({
+                    A_shared: tilelang.layout.make_swizzled_layout(A_shared),


The shared memory buffer Scale_shared is used without being initialized with data from the global Scale tensor, which will lead to incorrect results. You need to copy the scale data from global to shared memory within the pipelined loop.

Suggested change

A_shared: tilelang.layout.make_swizzled_layout(A_shared),

T.copy(B[bx * block_N, k * block_K // num_elems_per_byte], B_shared)

T.copy(Scale[bx * block_N, k * block_K // scale_size], Scale_shared)

gemini-code-assist · 2025-08-15T14:40:04Z

tilelang/quantize/quantization.py

+    e_bf16 = e_f4 + tir.const(126, "uint16")
+    # Scale is the exponential part, within the representation of uint8
+    # To handle the overflow, we use the max function to limit the exponential part to 8 bits
+    e_bf16 = min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))


You are using Python's built-in min function on TIR expressions. This will raise a TypeError at runtime because min cannot handle PrimExpr objects. You should use tir.min instead.

Suggested change

e_bf16 = min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))

e_bf16 = tir.min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))

gemini-code-assist · 2025-08-15T14:40:04Z

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

+        debug_root_path="/home/tzj/tilelang/examples/dequantize_gemm/",
+        #   pass_configs={tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True},
+    )
+    def kernel_func(block_M, block_N, block_K, num_stages, threads, split=1):


The comment states that local_compress_size is the same as vectorize_dequant_size. However, based on the calculations on lines 75-76 and the value on line 78, local_compress_size is 4 while vectorize_dequant_size is 8. The comment is incorrect and should be removed or corrected.

Suggested change

def kernel_func(block_M, block_N, block_K, num_stages, threads, split=1):

# local_compress_size is used for allocating local thread buffer for compressed data

gemini-code-assist · 2025-08-15T14:40:04Z

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

+        local_size = MAX_TRANSACTION_SIZE_BITS // DataType(in_dtype).bits
+        local_compress_size = local_size // num_elems_per_byte
+        # local_compress_size is the same as vectorize_dequant_size
+        vectorize_dequant_size = 8


The assignment import_source = import_source is redundant and has no effect. It should be removed.

gemini-code-assist · 2025-08-15T14:40:04Z

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py

+    # To use the autotuner, we need to return a PrimFunc
+    # So we need to comment out the @tilelang.jit decorator
+    @tilelang.jit(
+        out_idx=[-1],


The debug_root_path contains a hardcoded, user-specific path (/home/tzj/...). This will cause issues for other users running this example. It should be removed or made configurable.

)

gemini-code-assist · 2025-08-15T14:40:06Z

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

+                        sj = index_scale % (block_K // scale_size)
+                        Scale_local_thread[0] = Scale_shared[si, sj]


This loop is applying scaling to the dequantized values. Using T.Parallel might not be the most efficient approach here. Given that the subsequent loop is vectorized, it's likely that this loop should also be T.vectorized for better performance and to ensure correct code generation for SIMD instructions.

Suggested change

sj = index_scale % (block_K // scale_size)

Scale_local_thread[0] = Scale_shared[si, sj]

for v in T.vectorized(0, vectorize_dequant_size):

B_dequantize_local_thread[v] *= Scale_local_thread_exponent[0]

gemini-code-assist · 2025-08-15T14:40:06Z

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

+    for i in range(B.shape[0]):
+        for j in range(B.shape[1]):
+            B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127))
+    C = torch.matmul(A.to(torch.float), B.T.to(torch.float))
+    C = C.to(torch.__getattribute__(dtypeC))


The correctness check (assert_allclose) and reference program benchmark are commented out. While this might be useful for debugging, they should be re-enabled to ensure the kernel produces correct results before merging.

Suggested change

for i in range(B.shape[0]):

for j in range(B.shape[1]):

B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127))

C = torch.matmul(A.to(torch.float), B.T.to(torch.float))

C = C.to(torch.__getattribute__(dtypeC))

profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)

print("All checks pass.")

latency = profiler.do_bench(ref_program, warmup=500)

print("Ref: {:.2f} ms".format(latency))

print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))

gemini-code-assist · 2025-08-15T14:40:06Z

tilelang/quantize/mxfp.py

@@ -0,0 +1,94 @@
+from typing import Literal, Dict
+
+# Implementation asm for fp4 to bf16, using twiddling


The import func_attr from tvm.script.ir_builder.tir.ir is not used in this file and should be removed.

gemini-code-assist · 2025-08-15T14:40:06Z

tilelang/quantize/mxfp.py

+    This function is used to get the intrinsic group of the MXFP operation to avoid the overhead of fast decoding.
+    MXFP is a type of logic operation that takes three inputs. The intrinsic group refers to the set of
+    intrinsic operations that can be performed on these inputs. This function retrieves and returns this group.
+    """
+    assert out_dtype in ["float16", "bfloat16"


The docstring for get_mxfp_intrin_group appears to be copied from another module and is inaccurate. It describes MXFP as a three-input logic operation, but this function provides intrinsics for FP4 dequantization. The docstring should be updated to reflect the actual functionality.

Suggested change

This function is used to get the intrinsic group of the MXFP operation to avoid the overhead of fast decoding.

MXFP is a type of logic operation that takes three inputs. The intrinsic group refers to the set of

intrinsic operations that can be performed on these inputs. This function retrieves and returns this group.

"""

assert out_dtype in ["float16", "bfloat16"

"""

This function returns an intrinsic group for MXFP dequantization.

It provides the C source code and function name for a fast decoding

intrinsic, for example from FP4 to bfloat16.

"""

gemini-code-assist · 2025-08-15T14:40:06Z

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py

+        val_concat = (val0.item() << 8) | val1.item()
+        mask = 0b1000000111000000


The .view(torch.uint8) calls are redundant because the preceding assert statements on lines 31-32 already ensure that val0 and val1 are of type torch.uint8. These can be removed for conciseness.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (12)

tilelang/quantize/quantization.py (1)
29-49: Fix TIR expr ops and exponent mapping in _tir_u8_to_f4_to_bf16

Two correctness issues:

Python min on PrimExpr (Line 43) must be tir.min. This will otherwise raise TypeError.

The F4→BF16 exponent mapping should mirror the f4→f32 path (+120 with zero handling). Current +126 is inconsistent with _tir_u32_to_f4_to_f32 and likely wrong. Also cast val to uint16 before shifting to avoid dtype mismatch, and handle e_f4 == 0 as zero.

Apply this diff:
 def _tir_u8_to_f4_to_bf16(nbit: int, val: tir.PrimExpr, pos: tir.PrimExpr, scale: tir.PrimExpr,
                           dtype: str):
     assert nbit == 4
     assert dtype == "bfloat16"
     assert val.dtype == "uint8"
-    mask = tir.const((1 << nbit) - 1, "uint16")
-    f4 = (val >> (pos.astype("uint16") * tir.const(nbit, "uint16"))) & mask
+    mask = tir.const((1 << nbit) - 1, "uint16")
+    # Upcast to avoid mixed-width shift ops on TIR
+    val_u16 = val.astype("uint16")
+    shift = pos.astype("uint16") * tir.const(nbit, "uint16")
+    f4 = (val_u16 >> shift) & mask
     s = f4 >> tir.const(3, "uint16")
     e_f4 = (f4 & tir.const(6, "uint16")) >> tir.const(1, "uint16")
-    # Exponential bias between f4 and bf16 is 2^(8-1) - 2^(2-1) = 126
-    e_bf16 = e_f4 + tir.const(126, "uint16")
-    # Scale is the exponential part, within the representation of uint8
-    # To handle the overflow, we use the max function to limit the exponential part to 8 bits
-    e_bf16 = min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))
+    # Map e_f4 -> bf16 exponent: e_bf16 = e_f4 + 120 (same as f4->f32), then apply exponent scaling
+    e_bf16 = (e_f4 | tir.const(120, "uint16")) + scale.astype("uint16")
+    # Clamp exponent into 8-bit range
+    e_bf16 = tir.min(e_bf16, tir.const((1 << 8) - 1, "uint16"))
     m_f4 = f4 & tir.const(1, "uint16")
-    val_bf16 = tir.reinterpret("bfloat16",
-                               ((((s << tir.const(8, "uint16")) | e_bf16) << tir.const(7, "uint16"))
-                                | (m_f4 << tir.const(6, "uint16"))).astype("uint16"))
-    return val_bf16
+    bf16_bits = ((((s << tir.const(8, "uint16")) | e_bf16) << tir.const(7, "uint16"))
+                 | (m_f4 << tir.const(6, "uint16"))).astype("uint16")
+    # Zero handling: e_f4 == 0 => 0.0 bf16 (aligns with other f4 conversion helpers)
+    bf16_bits = tir.Select(e_f4 == tir.const(0, "uint16"), tir.const(0, "uint16"), bf16_bits)
+    return tir.reinterpret("bfloat16", bf16_bits)
Follow-up:

If scale semantics intend to affect zeros, adjust the Select accordingly. Otherwise this matches the repository’s existing f4 conversion behavior.
tilelang/quantize/mxfp.py (1)
62-66: Update inaccurate docstring

The docstring describes MXFP as a 3-input logic op; here we’re returning metadata for FP4→BF16 decoding intrinsics. Please correct to avoid confusion.

Apply this diff:
-    """
-    This function is used to get the intrinsic group of the MXFP operation to avoid the overhead of fast decoding.
-    MXFP is a type of logic operation that takes three inputs. The intrinsic group refers to the set of
-    intrinsic operations that can be performed on these inputs. This function retrieves and returns this group.
-    """
+    """
+    Return the intrinsic (function name + C source) for MXFP dequantization.
+
+    Currently supports FP4 → bfloat16 decoding, optionally via a twiddling-based
+    path on Hopper (sm_90+).
+    """
examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py (1)
202-204: Remove hardcoded user-specific path

The debug_root_path contains a hardcoded, user-specific path which will cause issues for other users. It should be removed or made configurable.
     @tilelang.jit(
         out_idx=[-1],
-        debug_root_path="/home/tzj/tilelang/examples/dequantize_gemm/",
     )
examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (6)
23-26: Remove unused function

The function print_bit is defined but never used.
-    def print_bit(name, val):
-        val_cpu = val.cpu().item()
-        binary_repr = f'{val_cpu:032b}'
-        print(name, binary_repr)
31-32: Remove redundant .view(torch.uint8) calls

The .view(torch.uint8) calls are redundant because the assertions already ensure the tensors are of type torch.uint8.
-        val0 = val0.view(torch.uint8)
-        val1 = val1.view(torch.uint8)
110-110: Remove redundant assignment

The assignment import_source = import_source has no effect and should be removed.
-        import_source = import_source
163-164: Consider using T.vectorized for consistency

The loop applying scaling uses T.Parallel, but the subsequent loop uses T.vectorized. For better consistency and potentially better code generation for SIMD instructions, consider using T.vectorized here as well.
-                        for v in T.Parallel(vectorize_dequant_size):
+                        for v in T.vectorized(0, vectorize_dequant_size):
                             B_dequantize_local_thread[v] *= Scale_local_thread_exponent[0]
74-74: Remove hardcoded user-specific debug path

The debug_root_path contains a hardcoded, user-specific path that will cause issues for other users.
     @tilelang.jit(
         out_idx=[-1],
-        debug_root_path="/home/tzj/tilelang/examples/dequantize_gemm/",
         #   pass_configs={tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True},
     )
219-223: Enable correctness check before merging

The correctness check and reference program benchmark are commented out. These should be re-enabled to ensure the kernel produces correct results.
-    # profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
-    # print("All checks pass.")
-    # latency = profiler.do_bench(ref_program, warmup=500)
-    # print("Ref: {:.2f} ms".format(latency))
-    # print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))
+    profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
+    print("All checks pass.")
+    latency = profiler.do_bench(ref_program, warmup=500)
+    print("Ref: {:.2f} ms".format(latency))
+    print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))
examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py (3)
23-26: Remove unused function

The function print_bit is defined but never used.
-    def print_bit(name, val):
-        val_cpu = val.cpu().item()
-        binary_repr = f'{val_cpu:032b}'
-        print(name, binary_repr)
31-32: Remove redundant .view(torch.uint8) calls

The .view(torch.uint8) calls are redundant as the assertions already ensure the tensors are of type torch.uint8.
-        val0 = val0.view(torch.uint8)
-        val1 = val1.view(torch.uint8)
101-101: Remove redundant assignment

The assignment import_source = import_source has no effect and should be removed.
-        import_source = import_source

🧹 Nitpick comments (6)

examples/gemm/example_gemm_autotune.py (1)
19-20: Guard HIP/CUDA arch selection more defensively

Using torch.version.hip to choose CDNA vs CUDA works in ROCm builds, but it may still be set even if no HIP device is available—CDNA("hip") will then raise at runtime. Consider a defensive check to prefer CUDA if no ROCm device exists.

Apply this diff to make the selection more robust:
-        arch = CDNA("hip") if torch.version.hip is not None else CUDA("cuda")
+        # Prefer ROCm only when torch was built with HIP and a ROCm device is present.
+        try:
+            use_rocm = (getattr(torch.version, "hip", None) is not None)
+            arch = CDNA("hip") if use_rocm else CUDA("cuda")
+        except RuntimeError:
+            # Fallback to CUDA if HIP device is not available at runtime
+            arch = CUDA("cuda")
tilelang/quantize/mxfp.py (1)
5-48: Define uint alias inside the C source for portability

The CUDA snippet uses the non-standard alias uint. Define it to avoid compiler differences across toolchains.

Apply this diff to the snippet header:
 decode_f4_to_bf16_twiddling = """
+using uint = unsigned int;
 // N should be the number of elements processed by one thread
 template<typename T1, typename T2>
 __device__ void decode_fp4_to_bf16_twiddling(T1 *B_local, T2 *B_local_decode, const int N = 8) {
examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py (2)
3-3: Avoid wildcard imports

Using wildcard imports makes it difficult to track which symbols are imported and can lead to namespace pollution. Import only the required symbols explicitly.
-from tilelang.autotuner import AutoTuner
+from tilelang.autotuner import AutoTuner
369-381: Inconsistent warmup/rep values between tune and non-tune paths

When tune=False, the code uses different warmup values (5 for ref, 3 for Tile-lang) compared to the tuning path (3 warmup, 10 rep). Consider standardizing these values for consistency.
-    latency = profiler.do_bench(ref_program_scale, warmup=5)
+    latency = profiler.do_bench(ref_program_scale, warmup=3, rep=10)
     print("Ref: {:.2f} ms".format(latency))
     print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))
-    latency = profiler.do_bench(warmup=3, rep=5)
+    latency = profiler.do_bench(warmup=3, rep=10)
examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (1)
3-3: Avoid wildcard imports

Using wildcard imports makes it difficult to track which symbols are imported and can lead to namespace pollution. Consider importing only required symbols explicitly.

Since the file doesn't appear to use any symbols from tilelang.autotuner, you can remove this import entirely:
-from tilelang.autotuner import *
examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py (1)
3-3: Remove unnecessary wildcard import

The file doesn't appear to use any symbols from tilelang.autotuner, so this import can be removed entirely.
-from tilelang.autotuner import *

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2bd2d69 and 0cb0b59.

📒 Files selected for processing (8)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py (1 hunks)
examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py (6 hunks)
examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (1 hunks)
examples/dequantize_gemm/test_example_dequantize_gemm.py (2 hunks)
examples/gemm/example_gemm_autotune.py (2 hunks)
tilelang/quantize/__init__.py (1 hunks)
tilelang/quantize/mxfp.py (1 hunks)
tilelang/quantize/quantization.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (7)

examples/gemm/example_gemm_autotune.py (3)

tilelang/carver/template/base.py (1)

arch (152-159)

tilelang/carver/arch/cdna.py (1)

CDNA (11-32)

tilelang/carver/arch/cuda.py (1)

CUDA (106-147)

tilelang/quantize/__init__.py (1)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-94)

examples/dequantize_gemm/test_example_dequantize_gemm.py (1)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (95-96)

tilelang/quantize/mxfp.py (1)

tilelang/language/ast/ir.py (1)

func_name (206-214)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py (6)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (2)

print_bit (23-26)

_convert (28-50)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-94)

tilelang/language/__init__.py (1)

import_source (181-183)

tilelang/language/allocate.py (3)

alloc_shared (20-35)

alloc_local (38-49)

alloc_fragment (52-63)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (9)

tilelang/engine/callback.py (1)

register_cuda_postproc (6-14)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-94)

tilelang/language/__init__.py (2)

import_source (181-183)

annotate_layout (86-124)

tilelang/language/allocate.py (3)

alloc_shared (20-35)

alloc_local (38-49)

alloc_fragment (52-63)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/layout/layout.py (1)

index (48-57)

tilelang/language/tir/op.py (2)

call_extern (172-194)

address_of (463-479)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py (6)

tilelang/autotuner/tuner.py (4)

AutoTuner (98-559)

from_kernel (125-135)

set_compile_args (137-165)

set_profile_args (167-228)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/jit/kernel.py (2)

out_idx (435-436)

get_profiler (349-365)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

tilelang/language/kernel.py (1)

threads (195-199)

tilelang/profiler/__init__.py (1)

assert_allclose (76-137)

🪛 Ruff (0.12.2)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py

3-3: from tilelang.autotuner import * used; unable to detect undefined names

(F403)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

3-3: from tilelang.autotuner import * used; unable to detect undefined names

(F403)

🔇 Additional comments (7)

examples/gemm/example_gemm_autotune.py (1)

10-10: Import reordering looks fine

Moving torch import below NoRasterization is harmless and keeps startup cost the same. No issues.

tilelang/quantize/__init__.py (1)

17-17: Re-export LGTM

Re-exporting get_mxfp_intrin_group is appropriate and keeps the public API cohesive.

examples/dequantize_gemm/test_example_dequantize_gemm.py (1)

26-30: New BF16 FP4 Hopper serial test looks good

CUDA+SM90 guards are appropriate; calling main() mirrors existing tests. No issues.

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py (3)

315-315: Performance improvement: Direct GEMM without unnecessary copy

The removal of the intermediate B_dequantize_prev_local buffer and direct GEMM operation on B_dequantize_local is a good optimization that reduces memory operations in the main kernel path.

325-341: Good refactoring to programmatic AutoTuner usage

The migration from decorator-based autotuning to programmatic AutoTuner instantiation provides better control and clarity over the tuning workflow. The explicit configuration of compile and profile arguments makes the tuning process more transparent.

390-390: Good addition of contextual comment

The comment referencing the Triton benchmark scale provides helpful context for understanding the expected workload dimensions.

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py (1)

180-180: Good practice: Commented alternative implementation

The commented line showing an alternative conversion function provides useful context for future maintenance.

coderabbitai · 2025-08-15T14:59:52Z

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py

+
+    @tilelang.jit(
+        out_idx=[-1],
+        debug_root_path="/home/tzj/tilelang/examples/dequantize_gemm/",


⚠️ Potential issue

Remove hardcoded user-specific debug path

The debug_root_path contains a hardcoded, user-specific path that will cause issues for other users.

@tilelang.jit( out_idx=[-1], - debug_root_path="/home/tzj/tilelang/examples/dequantize_gemm/", pass_configs={ tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True }, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

debug_root_path="/home/tzj/tilelang/examples/dequantize_gemm/",

@tilelang.jit(

out_idx=[-1],

pass_configs={

tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True,

tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True

},

)

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py around line 65, the debug_root_path is set to a hardcoded, user-specific path; replace this with a portable alternative (e.g., a relative project path, a path constructed from os.path.expanduser("~"), a tempdir from tempfile, or read from an environment variable/config) so the code works for other users and CI; update any callers to accept the new configurable value and ensure the default is non-user-specific.

coderabbitai · 2025-08-15T14:59:53Z

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py

+    # profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)
+    # print("All checks pass.")
+    # latency = profiler.do_bench(ref_program, warmup=500)
+    # print("Ref: {:.2f} ms".format(latency))
+    # print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))


⚠️ Potential issue

Enable correctness check before merging

The correctness check and reference program benchmark are commented out. These should be re-enabled to ensure the kernel produces correct results.

- # profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01) - # print("All checks pass.") - # latency = profiler.do_bench(ref_program, warmup=500) - # print("Ref: {:.2f} ms".format(latency)) - # print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9)) + profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01) + print("All checks pass.") + latency = profiler.do_bench(ref_program, warmup=500) + print("Ref: {:.2f} ms".format(latency)) + print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)

# print("All checks pass.")

# latency = profiler.do_bench(ref_program, warmup=500)

# print("Ref: {:.2f} ms".format(latency))

# print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))

profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01)

print("All checks pass.")

latency = profiler.do_bench(ref_program, warmup=500)

print("Ref: {:.2f} ms".format(latency))

print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py around lines 194 to 198, the correctness check and reference benchmark are commented out; re-enable them by uncommenting the profiler.assert_allclose(ref_program, rtol=0.01, atol=0.01) and the subsequent print/benchmark lines so the script verifies numerical correctness and reports latency/TFlops before merging; ensure any required variables (ref_program, profiler, total_flops) are in scope and the warmup/benchmark parameters remain appropriate.

coderabbitai · 2025-08-15T14:59:53Z

tilelang/quantize/mxfp.py

+    dtype_map = {"float16": "f16", "bfloat16": "bf16"}
+    key = f"fp{source_bit}_to_{dtype_map[out_dtype]}"
+    if use_twiddling:
+        key += "_twiddling"
+
+    import_c_map = {
+        "fp4_to_bf16_twiddling": decode_f4_to_bf16_twiddling,
+    }
+
+    func_name = f"decode_fp{source_bit}_to_{dtype_map[out_dtype]}"
+    if use_twiddling:
+        func_name += "_twiddling"
+
+    return {
+        "func_name": func_name,
+        "c_source": import_c_map[key],
+    }


🛠️ Refactor suggestion

Guard unsupported key paths and validate source_bit

As written, use_twiddling=False (the default) produces key='fp4_to_bf16' which isn’t present in import_c_map, leading to KeyError. Also, unsupported source_bit values aren’t validated. Add checks to fail fast with a clear message.

Apply this diff:

- dtype_map = {"float16": "f16", "bfloat16": "bf16"} - key = f"fp{source_bit}_to_{dtype_map[out_dtype]}" + dtype_map = {"float16": "f16", "bfloat16": "bf16"} + assert source_bit == 4, f"Unsupported source_bit={source_bit}; only 4 is supported currently." + key = f"fp{source_bit}_to_{dtype_map[out_dtype]}" if use_twiddling: key += "_twiddling" - import_c_map = { + import_c_map = { "fp4_to_bf16_twiddling": decode_f4_to_bf16_twiddling, } func_name = f"decode_fp{source_bit}_to_{dtype_map[out_dtype]}" if use_twiddling: func_name += "_twiddling" - return { - "func_name": func_name, - "c_source": import_c_map[key], - } + if key not in import_c_map: + raise NotImplementedError( + f"MXFP intrinsic '{key}' is not implemented. " + f"Try use_twiddling=True or extend import_c_map." + ) + return {"func_name": func_name, "c_source": import_c_map[key]}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dtype_map = {"float16": "f16", "bfloat16": "bf16"}

key = f"fp{source_bit}_to_{dtype_map[out_dtype]}"

if use_twiddling:

key += "_twiddling"

import_c_map = {

"fp4_to_bf16_twiddling": decode_f4_to_bf16_twiddling,

}

func_name = f"decode_fp{source_bit}_to_{dtype_map[out_dtype]}"

if use_twiddling:

func_name += "_twiddling"

return {

"func_name": func_name,

"c_source": import_c_map[key],

}

dtype_map = {"float16": "f16", "bfloat16": "bf16"}

assert source_bit == 4, f"Unsupported source_bit={source_bit}; only 4 is supported currently."

key = f"fp{source_bit}_to_{dtype_map[out_dtype]}"

if use_twiddling:

key += "_twiddling"

import_c_map = {

"fp4_to_bf16_twiddling": decode_f4_to_bf16_twiddling,

}

func_name = f"decode_fp{source_bit}_to_{dtype_map[out_dtype]}"

if use_twiddling:

func_name += "_twiddling"

if key not in import_c_map:

raise NotImplementedError(

f"MXFP intrinsic '{key}' is not implemented. "

f"Try use_twiddling=True or extend import_c_map."

)

return {"func_name": func_name, "c_source": import_c_map[key]}

🤖 Prompt for AI Agents

In tilelang/quantize/mxfp.py around lines 78-94, the code builds a key from source_bit, out_dtype and use_twiddling but doesn't validate source_bit or guard against unsupported key paths (causing KeyError); update the function to first validate that source_bit is one of the supported bit widths (e.g. 4), that out_dtype exists in dtype_map, then construct the key and check if it exists in import_c_map before using it; if the key is unsupported, raise a clear ValueError listing the supported combinations (or keys) so callers fail fast with an explanatory message.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (6)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (6)

23-27: Remove unused print_bit helper

print_bit isn’t used; drop it to keep the reference path lean.

-    def print_bit(name, val):
-        val_cpu = val.cpu().item()
-        binary_repr = f'{val_cpu:032b}'
-        print(name, binary_repr)

31-33: Redundant dtype views

val0 and val1 are already torch.uint8 due to the asserts. The .view(torch.uint8) calls are no-ops.

-        val0 = val0.view(torch.uint8)
-        val1 = val1.view(torch.uint8)

93-98: Fix incorrect comment: local_compress_size ≠ vectorize_dequant_size

With BF16 (16 bits) and FP4 (2 elems/byte), local_size=8 and local_compress_size=4 while vectorize_dequant_size=8. The comment is misleading; reword it.

-        # local_compress_size is the same as vectorize_dequant_size
+        # local_compress_size is the per-thread count of compressed elements (bytes),
+        # used to load the packed values before dequantization.

99-114: Tidy up MXFP intrin setup and improve assert messages

import_source = import_source is a no-op; remove it.
Make the assertion messages specific to aid debugging.

-        assert import_source is not None, "mxfp_intrin_info is not found"
-        assert func_name is not None, "mxfp_intrin_info is not found"
-        import_source = import_source
+        assert import_source is not None, "MXFP intrinsic C source not found"
+        assert func_name is not None, "MXFP intrinsic function name not found"

147-151: Good: Scale is now copied to shared before use

This addresses the earlier bug about uninitialized Scale_shared. The shared scale load matches the B and A loads per K-tile.

181-189: Vectorize the scaling step and fuse with the store loop

Using T.Parallel here is suboptimal and may generate less efficient code than a vectorized loop. You can vectorize the multiply and store in a single pass.

-                        for v in T.Parallel(vectorize_dequant_size):
-                            B_dequantize_local_thread[v] *= Scale_local_thread_exponent[0]
-                        for v in T.vectorized(0, vectorize_dequant_size):
-                            index = i * threads * vectorize_dequant_size + tx * vectorize_dequant_size + v
-                            vi = index // block_K
-                            vj = index % block_K
-                            # 127 = 2^7 - 1 is the exponent bias for bfloat16
-                            B_dequantize_shared[vi, vj] = B_dequantize_local_thread[v]
+                        for v in T.vectorized(0, vectorize_dequant_size):
+                            B_dequantize_local_thread[v] *= Scale_local_thread_exponent[0]
+                            index = i * threads * vectorize_dequant_size + tx * vectorize_dequant_size + v
+                            vi = index // block_K
+                            vj = index % block_K
+                            B_dequantize_shared[vi, vj] = B_dequantize_local_thread[v]

🧹 Nitpick comments (4)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (4)
3-3: Avoid star import; it isn’t used and trips linters

from tilelang.autotuner import * is unused here and violates F403; drop it to reduce namespace pollution.
-from tilelang.autotuner import *
12-19: Remove commented debug postproc hook and hardcoded paths

This dead/debug code shouldn’t ship in examples; it also references a repo-local file path. Drop it or guard via an env flag if needed.
-# @register_cuda_postproc
-# def tilelang_callback_cuda_postproc(code, _):
-#     cuda_code = ""
-#     # with open("examples/dequantize_gemm/tilelang_jit_kernel_kernel_func_backup.c", "r") as f:
-#     with open("examples/dequantize_gemm/tilelang_jit_kernel_kernel_func_test.c", "r") as f:
-#         cuda_code = f.read()
-#     return cuda_code
47-50: Clarify the 2126 scaling rationale in dequantization**

Multiplying the bfloat16 value by 2**126 is non-obvious. Add a short docstring or inline comment explaining the FP4→BF16 exponent bias mapping and why this factor is applied (e.g., relocating the 3-bit exponent into BF16’s biased exponent domain).

Would you confirm this matches the mapping used by decode_fp4_to_bf16_twiddling, so the CPU ref stays consistent with the extern path?

168-169: Make integer cast explicit before subtracting 127

Subtracting 127 from a uint8 can be backend-dependent if promotion rules change; cast first to an integer type to avoid underflow ambiguity.
-                        Scale_local_thread_exponent[0] = T.exp2(
-                            T.cast(Scale_local_thread[0] - 127, "float"))
+                        Scale_local_thread_exponent[0] = T.exp2(
+                            T.cast(T.cast(Scale_local_thread[0], "int32") - 127, "float"))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0cb0b59 and 07752a9.

📒 Files selected for processing (2)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (1 hunks)
examples/gemm/example_gemm_autotune.py (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

examples/gemm/example_gemm_autotune.py (3)

tilelang/carver/template/base.py (1)

arch (152-159)

tilelang/carver/arch/cuda.py (1)

CUDA (106-147)

tilelang/carver/arch/cdna.py (1)

CDNA (11-32)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (9)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/jit/kernel.py (2)

out_idx (435-436)

get_profiler (349-365)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-94)

tilelang/language/__init__.py (2)

import_source (181-183)

annotate_layout (86-124)

tilelang/language/allocate.py (3)

alloc_shared (20-35)

alloc_local (38-49)

alloc_fragment (52-63)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/language/parallel.py (1)

Parallel (8-28)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

tilelang/profiler/__init__.py (1)

assert_allclose (76-137)

🪛 Ruff (0.12.2)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

3-3: from tilelang.autotuner import * used; unable to detect undefined names

(F403)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-test-nvidia

🔇 Additional comments (3)

examples/gemm/example_gemm_autotune.py (2)

10-10: Import reorder is fine

Moving torch import here is harmless and keeps device checks local. No functional change.

19-19: HIP/CUDA arch selection expression looks correct

Using torch.version.hip to pick between CUDA and CDNA at runtime retains prior behavior with cleaner syntax.

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (1)

221-229: Nice: Correctness check is enabled before benchmarking

assert_allclose is active with reasonable tolerances prior to measuring kernel latency. Good guardrail for the example.

coderabbitai · 2025-08-15T15:09:34Z

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

+                        T.call_extern(
+                            func_name,
+                            T.address_of(B_local_thread[0]),
+                            T.address_of(B_dequantize_local_thread[0]),
+                            1,
+                            dtype=in_dtype,
+                        )


⚠️ Potential issue

Bug: Incorrect argument order in T.call_extern leads to TypeError

call_extern(dtype, func_name, *args) expects dtype first. Passing func_name first and then dtype=... provides two values for ‘dtype’ and will error. Reorder to pass dtype first.

- T.call_extern( - func_name, - T.address_of(B_local_thread[0]), - T.address_of(B_dequantize_local_thread[0]), - 1, - dtype=in_dtype, - ) + T.call_extern( + in_dtype, + func_name, + T.address_of(B_local_thread[0]), + T.address_of(B_dequantize_local_thread[0]), + 1, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

T.call_extern(

func_name,

T.address_of(B_local_thread[0]),

T.address_of(B_dequantize_local_thread[0]),

1,

dtype=in_dtype,

)

T.call_extern(

in_dtype,

func_name,

T.address_of(B_local_thread[0]),

T.address_of(B_dequantize_local_thread[0]),

1,

)

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py around lines 172 to 178, T.call_extern is called with func_name first and dtype as a named argument which violates the expected signature call_extern(dtype, func_name, *args) and causes a TypeError; reorder the call to pass the dtype as the first positional argument, then func_name, followed by the T.address_of(...) args (remove the dtype= keyword) so it becomes call_extern(in_dtype, func_name, T.address_of(B_local_thread[0]), T.address_of(B_dequantize_local_thread[0]), 1).

…rmance

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (6)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (6)

23-27: Remove unused helper print_bit.

The function is unused and adds noise to the example.

-    def print_bit(name, val):
-        val_cpu = val.cpu().item()
-        binary_repr = f'{val_cpu:032b}'
-        print(name, binary_repr)

31-32: Redundant .view(torch.uint8) calls after dtype assertions.

You already assert val0/val1 are torch.uint8. The extra view is a no-op.

-        val0 = val0.view(torch.uint8)
-        val1 = val1.view(torch.uint8)

91-92: Fix incorrect comment: local_compress_size != vectorize_dequant_size.

For bfloat16 with 4-bit storage, local_size=8, num_elems_per_byte=2, hence local_compress_size=4 while vectorize_dequant_size=8.

-        # local_compress_size is the same as vectorize_dequant_size
+        # local_compress_size is the number of compressed elements handled per thread-vec step

104-108: Remove redundant self-assignment import_source = import_source.

No-op; safe to delete.

         import_source = mxfp_intrin_info["c_source"]
         func_name = mxfp_intrin_info["func_name"]
         assert import_source is not None, "mxfp_intrin_info is not found"
         assert func_name is not None, "mxfp_intrin_info is not found"
-        import_source = import_source

175-177: Vectorize the scaling loop for consistency with the following store loop.

Using T.vectorized here typically yields better codegen and matches the next loop.

-                        for v in T.Parallel(vectorize_dequant_size):
-                            B_dequantize_local_thread[v] *= Scale_local_thread_exponent[0]
+                        for v in T.vectorized(0, vectorize_dequant_size):
+                            B_dequantize_local_thread[v] *= Scale_local_thread_exponent[0]

165-172: Bug: Incorrect T.call_extern argument order (dtype must be first).

Signature is call_extern(dtype, func_name, *args). Passing func_name first and dtype=... last will error.

-                        T.call_extern(
-                            func_name,
-                            T.address_of(B_local_thread[0]),
-                            T.address_of(B_dequantize_local_thread[0]),
-                            1,
-                            dtype=in_dtype,
-                        )
+                        T.call_extern(
+                            in_dtype,
+                            func_name,
+                            T.address_of(B_local_thread[0]),
+                            T.address_of(B_dequantize_local_thread[0]),
+                            1,
+                        )

Optional: if the extern has no return, consider dtype="void" to better reflect intent.

🧹 Nitpick comments (8)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (8)
3-3: Remove unused star import to avoid namespace pollution (Ruff F403).

from tilelang.autotuner import * isn’t used in this example. Drop it to keep the example minimal and quiet static analysis.
-from tilelang.autotuner import *
49-50: Double-check exponent handling in CPU reference conversion.

bf16_new = bf16_new.item() * (2**126) looks suspicious and can overflow easily. If the intention is to adjust exponent bias between FP4 and BF16, consider switching to the official utility used in this PR to avoid drift:

Prefer using an existing decoder (e.g., tilelang.quantize.mxfp.decode_f4_to_bf16_twiddling if exposed) to keep the reference aligned with the kernel’s extern.

Would you like me to refactor torch_convert_bit_twiddling to use the public decode helper if available?

119-127: Remove unused shared buffer Scale_shared or actually use it.

Currently Scale_shared is allocated (Line 124) but never written nor read, tripping static analysis (F841) and wasting shared memory. Since you read Scale[...] directly, the simplest fix is to remove Scale_shared. Alternatively, copy Scale into shared inside the pipeline and read from it for better locality.

Option A (remove allocation):
-                Scale_shared = T.alloc_shared(Scale_shared_shape, storage_dtype)
Option B (use shared scale; requires adding a copy and reading from shared):

Add after Line 145:
+                    T.copy(Scale[bx * block_N, k * block_K // scale_size], Scale_shared)
Replace Line 161:
-                        Scale_local_thread[0] = Scale[bx * block_N + si, k * block_K // scale_size + sj]
+                        Scale_local_thread[0] = Scale_shared[si, sj]
Choose one approach; I’d lean toward A for minimal change unless you’re chasing perf.

85-93: Add guards to prevent silent under-coverage in the dequant loop.

The loop count assumes divisibility by threads * vectorize_dequant_size and by scale_size. Add asserts to prevent partial tiles and mismatched scale groups.
         assert K % (block_K * split) == 0
+        assert (block_N * block_K) % (threads * vectorize_dequant_size) == 0, \
+            "block_N*block_K must be divisible by threads*vectorize_dequant_size"
+        assert K % scale_size == 0, "K must be divisible by scale_size"
131-136: Consider enabling swizzled layout for B_dequantize_shared.

You commented this out. Re-enabling may reduce bank conflicts and improve GEMM reads, depending on your layout strategy.
-                    # B_dequantize_shared: tilelang.layout.make_swizzled_layout(B_dequantize_shared),
+                    B_dequantize_shared: tilelang.layout.make_swizzled_layout(B_dequantize_shared),
143-146: Optional: Prefetch Scale into shared to reduce global reads.

If the dequant path is bandwidth-bound, consider copying Scale to shared alongside A and B for each k-iteration and reading scales from shared (see earlier comment).

12-19: Hopper proxy fence: wire in a postproc or IR pragma to enforce strict fence after dequant.

Given the PR note about “strict proxy_fence after dequantization,” you currently have the postproc hook commented out. If correctness depends on the SM90 proxy fence after the extern dequant call, either:

Inject an sm_90 fence via register_cuda_postproc right after the generated extern call, or

Provide a dedicated TileLang pragma that lowers to fence.proxy (if available in your stack).

I can help produce a minimal postproc to insert asm volatile("fence.proxy.global::cta;"); after the extern call site.

198-206: Speed up the reference program with vectorized/broadcasted ops.

The nested Python loops can be slow for bigger shapes. You can broadcast scales over the K-dimension and avoid Python loops.

Example change outside this hunk (conceptual):

Expand Scale to per-element exponents by repeating each scale value scale_size times along K and then apply torch.exp2((Scale - 127).float()) to get the multiplier.

Then B = torch_convert_bit_twiddling(qB); B *= expanded_multiplier.

I can draft an exact vectorized version if helpful.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 07752a9 and b86874b.

📒 Files selected for processing (1)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (14)

tilelang/engine/callback.py (1)

register_cuda_postproc (6-14)

examples/bitnet-1.58b/eval_utils.py (1)

device (102-103)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/language/kernel.py (1)

threads (195-199)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-94)

tilelang/language/__init__.py (1)

import_source (181-183)

tilelang/language/ast/ir.py (1)

func_name (206-214)

tilelang/language/allocate.py (3)

alloc_shared (20-35)

alloc_local (38-49)

alloc_fragment (52-63)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/language/copy.py (1)

copy (84-152)

tilelang/layout/layout.py (1)

index (48-57)

tilelang/language/tir/op.py (2)

call_extern (172-194)

address_of (463-479)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

tilelang/profiler/__init__.py (1)

assert_allclose (76-137)

🪛 Ruff (0.12.2)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

3-3: from tilelang.autotuner import * used; unable to detect undefined names

(F403)

124-124: Local variable Scale_shared is assigned to but never used

Remove assignment to unused variable Scale_shared

(F841)

🔇 Additional comments (1)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (1)

216-216: Good: Correctness check enabled.

Keeping profiler.assert_allclose in the example is great. It will catch regressions in the extern dequant or scaling.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (8)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (8)
23-26: Remove unused debug function

The print_bit function is defined but never used in the code.
-    def print_bit(name, val):
-        val_cpu = val.cpu().item()
-        binary_repr = f'{val_cpu:032b}'
-        print(name, binary_repr)
-
31-32: Remove redundant view operations

The .view(torch.uint8) calls are unnecessary since the assertions on lines 29-30 already ensure the tensors are of type torch.uint8.
-        val0 = val0.view(torch.uint8)
-        val1 = val1.view(torch.uint8)
105-105: Remove redundant self-assignment

The assignment import_source = import_source is redundant and serves no purpose.
-        import_source = import_source
118-118: Remove unused variable allocation

The variable B_local_thread is allocated but never used in the code.

Since this variable appears to be replaced by the actual usage pattern in the code, it should be removed:
-                B_local_thread = T.alloc_local((local_compress_size,), storage_dtype)
172-173: Consider using T.vectorized for SIMD operations

Using T.Parallel for element-wise scaling operations may not be optimal. Since the subsequent loop uses T.vectorized, this loop should also use vectorization for better SIMD performance.
-                        for v in T.Parallel(vectorize_dequant_size):
+                        for v in T.vectorized(0, vectorize_dequant_size):
                            B_dequantize_local_thread[v] *= Scale_local_thread_exponent[0]
88-89: Fix misleading comment about variable sizes

The comment states that local_compress_size equals vectorize_dequant_size, but local_compress_size is 4 (line 87) while vectorize_dequant_size is 8 (line 89).
-        # local_compress_size is the same as vectorize_dequant_size
+        # local_compress_size is used for allocating compressed data buffer
163-169: Fix incorrect T.call_extern argument order

The T.call_extern function expects dtype as the first positional argument, but the current code passes func_name first and then dtype as a keyword argument, which will cause a TypeError.
                        T.call_extern(
-                            func_name,
+                            in_dtype,
+                            func_name,
                            T.address_of(B_local_thread[0]),
                            T.address_of(B_dequantize_local_thread[0]),
                            1,
-                            dtype=in_dtype,
                        )
215-217: Re-enable reference benchmark for performance comparison

The reference program benchmark is commented out, which prevents performance comparison. This should be re-enabled to validate the speedup achieved by the optimized kernel.
-    # latency = profiler.do_bench(ref_program, warmup=500)
-    # print("Ref: {:.2f} ms".format(latency))
-    # print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))
+    latency = profiler.do_bench(ref_program, warmup=500)
+    print("Ref: {:.2f} ms".format(latency))
+    print("Ref: {:.2f} TFlops".format(total_flops / latency * 1e-9))

🧹 Nitpick comments (4)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (4)
3-3: Avoid wildcard imports for better code clarity

Wildcard imports make it difficult to track which symbols are being used and can lead to namespace pollution.

Replace with explicit imports:
-from tilelang.autotuner import *
+from tilelang.autotuner import AutoTuner  # or whatever specific symbols you need
12-19: Remove commented-out debug code

This commented-out debug callback registration appears to be temporary development code that should be removed before merging.
-# @register_cuda_postproc
-# def tilelang_callback_cuda_postproc(code, _):
-#     cuda_code = ""
-#     # with open("examples/dequantize_gemm/tilelang_jit_kernel_kernel_func_backup.c", "r") as f:
-#     with open("examples/dequantize_gemm/tilelang_jit_kernel_kernel_func_test.c", "r") as f:
-#         cuda_code = f.read()
-#     return cuda_code
-
224-225: Uncomment larger problem size for realistic benchmarking

The larger problem size (16384, 8192, 8192) is commented out. For meaningful performance evaluation, consider using the larger size or making it configurable via command-line arguments.
-    M, N, K = 256, 256, 256
-    # M, N, K = 16384, 8192, 8192
+    import sys
+    if len(sys.argv) > 1 and sys.argv[1] == "--large":
+        M, N, K = 16384, 8192, 8192
+    else:
+        M, N, K = 256, 256, 256
141-141: Load scale data into shared memory in the serial kernel

The serial example (examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py) currently reads Scale directly from global memory inside the inner loop (around lines 157–158), which can hurt performance. To improve memory coalescing:

• Allocate a shared buffer for scales (shape (block_N, block_K // scale_size)) alongside your existing B_shared allocation.
• Inside the pipelined loop, copy from global Scale[…] into Scale_shared, then from Scale_shared into your per-thread Scale_local fragment.

For comparison, see the non-serial version (example_dequant_gemm_mxfp4_hopper.py) at lines 124–131:
             T.copy(B[bx * block_N, k * block_K // num_elems_per_byte], B_shared)
             T.copy(B_shared, B_local)
+            T.copy(Scale[bx * block_N, k * block_K // scale_size], Scale_shared)
+            T.copy(Scale_shared, Scale_local)
             for i, j in T.Parallel(block_N, block_K):
                 B_dequantize_local[i, j] = _tir_u8_to_f4_to_bf16(…)
Adding these copies in the serial example will reduce repeated global‐memory reads of scale values.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b86874b and 09ed919.

📒 Files selected for processing (1)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py (9)

tilelang/engine/callback.py (1)

register_cuda_postproc (6-14)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/jit/kernel.py (2)

out_idx (435-436)

get_profiler (349-365)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-94)

tilelang/language/__init__.py (1)

import_source (181-183)

tilelang/language/proxy.py (1)

Tensor (210-211)

tilelang/language/tir/op.py (1)

call_extern (172-194)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

tilelang/profiler/__init__.py (1)

assert_allclose (76-137)

🪛 Ruff (0.12.2)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper_serial.py

3-3: from tilelang.autotuner import * used; unable to detect undefined names

(F403)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-nvidia
GitHub Check: bot-task

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py (5)

14-18: Unused helper function print_bit should be removed.

This function is defined but never used in the code.

22-24: Remove redundant .view(torch.uint8) calls.

The .view(torch.uint8) calls are redundant since the preceding assert statements already ensure these are torch.uint8.

56-56: Remove hardcoded user-specific debug path.

The debug_root_path contains a hardcoded, user-specific path that will cause issues for other users.

92-92: Remove redundant assignment.

The assignment import_source = import_source is redundant and has no effect.

185-189: Enable correctness check before merging.

The correctness check and reference program benchmark are commented out. These should be re-enabled to ensure the kernel produces correct results.

🧹 Nitpick comments (4)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py (4)

105-107: Remove unused local variable allocations.

The variables B_local_thread, B_dequantize_local_thread appear to be unused based on the pattern in previous reviews, but they are actually used in the dequantization loop (lines 137, 143, 153). However, the variable names suggest they should be renamed for clarity.

-                B_local_thread = T.alloc_local((local_compress_size,), storage_dtype)
-                B_dequantize_local_thread = T.alloc_local((local_size,), in_dtype)
+                B_compressed_local = T.alloc_local((local_compress_size,), storage_dtype)
+                B_dequantized_local = T.alloc_local((local_size,), in_dtype)

And update the corresponding usage:

-                            B_local_thread[v] = B_shared[vi, vj]
+                            B_compressed_local[v] = B_shared[vi, vj]
                        T.call_extern(
                            func_name,
-                            T.address_of(B_local_thread[0]),
-                            T.address_of(B_dequantize_local_thread[0]),
+                            T.address_of(B_compressed_local[0]),
+                            T.address_of(B_dequantized_local[0]),
                            1,
                            dtype=in_dtype,
                        )
-                            B_dequantize_shared[vi, vj] = B_dequantize_local_thread[v]
+                            B_dequantize_shared[vi, vj] = B_dequantized_local[v]

3-3: Replace wildcard import with explicit imports.

The wildcard import from tilelang.autotuner import * makes it unclear which symbols are being imported and can lead to namespace pollution.

-from tilelang.autotuner import *
+from tilelang.autotuner import AutoTuner  # Add other specific imports as needed

78-91: Add error handling for missing MXFP intrinsics.

While assertions check for None values, consider providing more descriptive error messages to help with debugging when intrinsics are unavailable.

-        assert import_source is not None, "mxfp_intrin_info is not found"
-        assert func_name is not None, "mxfp_intrin_info is not found"
+        if import_source is None:
+            raise RuntimeError(f"MXFP intrinsic source not found for {source_format} {num_bits}-bit to {in_dtype}")
+        if func_name is None:
+            raise RuntimeError(f"MXFP intrinsic function name not found for {source_format} {num_bits}-bit to {in_dtype}")

70-70: Add more descriptive error message for divisibility constraint.

The assertion should provide clearer guidance on valid block_K and split values.

-        assert K % (block_K * split) == 0
+        assert K % (block_K * split) == 0, f"K ({K}) must be divisible by (block_K * split) ({block_K * split})"

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 09ed919 and e01f86a.

📒 Files selected for processing (1)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.12.2)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py

3-3: from tilelang.autotuner import * used; unable to detect undefined names

(F403)

coderabbitai · 2025-08-18T04:44:04Z

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py

+        for j in range(new_tensor.shape[1]):
+            new_tensor[i][j] = _convert(tensor[i][j // 4 * 2], tensor[i][j // 4 * 2 + 1], j % 4)


⚠️ Potential issue

Fix potential index out of bounds error in tensor conversion.

The indexing logic tensor[i][j // 4 * 2] and tensor[i][j // 4 * 2 + 1] assumes specific tensor dimensions but could cause out-of-bounds access. The logic should ensure j // 4 * 2 + 1 < tensor.shape[1].

for i in range(new_tensor.shape[0]): for j in range(new_tensor.shape[1]): - new_tensor[i][j] = _convert(tensor[i][j // 4 * 2], tensor[i][j // 4 * 2 + 1], j % 4) + byte_idx = (j // 4) * 2 + if byte_idx + 1 < tensor.shape[1]: + new_tensor[i][j] = _convert(tensor[i][byte_idx], tensor[i][byte_idx + 1], j % 4) + else: + # Handle edge case for incomplete bytes + new_tensor[i][j] = _convert(tensor[i][byte_idx], torch.zeros(1, dtype=torch.uint8, device=tensor.device)[0], j % 4)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for j in range(new_tensor.shape[1]):

new_tensor[i][j] = _convert(tensor[i][j // 4 * 2], tensor[i][j // 4 * 2 + 1], j % 4)

for i in range(new_tensor.shape[0]):

for j in range(new_tensor.shape[1]):

byte_idx = (j // 4) * 2

if byte_idx + 1 < tensor.shape[1]:

new_tensor[i][j] = _convert(

tensor[i][byte_idx],

tensor[i][byte_idx + 1],

j % 4

)

else:

# Handle edge case for incomplete bytes

new_tensor[i][j] = _convert(

tensor[i][byte_idx],

torch.zeros(1, dtype=torch.uint8, device=tensor.device)[0],

j % 4

)

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper_serial.py around lines 47-48, the current index computation new_tensor[i][j] = _convert(tensor[i][j // 4 * 2], tensor[i][j // 4 * 2 + 1], j % 4) can go out of bounds; compute src_idx = (j // 4) * 2 and guard it against tensor.shape[1] (e.g., if src_idx >= tensor.shape[1] set src_idx = tensor.shape[1] - 1, and if src_idx + 1 >= tensor.shape[1] set second_idx = tensor.shape[1] - 1 else second_idx = src_idx + 1), then call _convert with tensor[i][src_idx], tensor[i][second_idx], j % 4 so you never access beyond the row length.

…equantize_gemm directory.

…ency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (2)

tilelang/quantize/mxfp.py (2)

72-87: Guard against KeyError for unsupported configurations

The code will raise a KeyError if use_twiddling=False since only "fp4_to_bf16_twiddling" exists in import_c_map. Add validation to provide a clear error message.

     dtype_map = {"float16": "f16", "bfloat16": "bf16"}
+    assert source_bit == 4, f"Unsupported source_bit={source_bit}; only 4 is supported currently."
     key = f"fp{source_bit}_to_{dtype_map[out_dtype]}"
     if use_twiddling:
         key += "_twiddling"
 
     import_c_map = {
         "fp4_to_bf16_twiddling": decode_f4_to_bf16_twiddling,
     }
 
     func_name = f"decode_fp{source_bit}_to_{dtype_map[out_dtype]}"
     if use_twiddling:
         func_name += "_twiddling"
 
+    if key not in import_c_map:
+        raise NotImplementedError(
+            f"MXFP intrinsic '{key}' is not implemented. "
+            f"Available keys: {list(import_c_map.keys())}. "
+            f"Try use_twiddling=True or implement the missing intrinsic."
+        )
     return {
         "func_name": func_name,
         "c_source": import_c_map[key],
     }

58-62: Update misleading docstring

The docstring describes MXFP as a "three-input logic operation" which doesn't accurately describe FP4 dequantization functionality.

     """
-    This function is used to get the intrinsic group of the MXFP operation to avoid the overhead of fast decoding.
-    MXFP is a type of logic operation that takes three inputs. The intrinsic group refers to the set of
-    intrinsic operations that can be performed on these inputs. This function retrieves and returns this group.
+    This function returns an intrinsic group for MXFP dequantization.
+
+    It provides the C source code and function name for a fast decoding
+    intrinsic that converts from FP4 to bfloat16 format using bit-twiddling
+    operations when enabled.
     """

🧹 Nitpick comments (6)

examples/dequantize_gemm/utils.py (3)
27-28: Potential precision loss in bf16 multiplication

The operation bf16_new.item() * (2**126) converts to Python float and back, which could introduce precision issues. Consider performing the scaling in the tensor domain.
-        # Add bias for change from fp4 to bf16
-        bf16_new = bf16_new.item() * (2**126)
-        return bf16_new
+        # Add bias for change from fp4 to bf16
+        bf16_new = bf16_new * torch.tensor(2**126, dtype=torch.bfloat16, device=val0.device)
+        return bf16_new.item()
33-36: Improve performance with vectorized operations

The nested loops are inefficient for large tensors. Consider vectorizing the conversion logic or at least document why element-wise processing is necessary.

The current implementation uses nested Python loops which can be slow for large tensors. Would you like me to help create a more efficient vectorized version using PyTorch operations, or is element-wise processing required for validation purposes?

50-51: Use torch.clamp instead of Python min for tensor operations

Using Python's min function might not be the most efficient for tensor operations. Consider using torch.clamp for better performance.
-            e_f16 = min(e_f16 + scale, (1 << 8) - 1)
+            e_f16 = torch.clamp(e_f16 + scale, max=(1 << 8) - 1)
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (2)
75-75: Remove redundant assignment

Line 75 assigns import_source to itself, which is unnecessary.
-    import_source = import_source
90-90: Remove duplicate redundant assignment

Line 90 assigns import_source to itself again, which is unnecessary.
-    import_source = import_source
examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (1)
75-75: Remove redundant assignment

Line 75 assigns import_source to itself, which is unnecessary.
-    import_source = import_source

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e01f86a and 668100f.

📒 Files selected for processing (6)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (1 hunks)
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (1 hunks)
examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py (0 hunks)
examples/dequantize_gemm/utils.py (1 hunks)
maint/scripts/run_local_ci_test.sh (1 hunks)
tilelang/quantize/mxfp.py (1 hunks)

💤 Files with no reviewable changes (1)

examples/dequantize_gemm/example_dequant_gemm_mxfp4_hopper.py

🧰 Additional context used

🧬 Code Graph Analysis (3)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (4)

examples/dequantize_gemm/utils.py (2)

torch_convert_bit_twiddling (4-36)

torch_convert (39-66)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-87)

tilelang/language/__init__.py (1)

import_source (181-183)

tilelang/profiler/__init__.py (1)

assert_allclose (76-137)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (9)

examples/dequantize_gemm/utils.py (2)

torch_convert_bit_twiddling (4-36)

torch_convert (39-66)

tilelang/language/tir/op.py (1)

call_extern (172-194)

tilelang/autotuner/tuner.py (1)

autotune (697-790)

tilelang/jit/__init__.py (1)

jit (232-305)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-87)

tilelang/language/__init__.py (1)

import_source (181-183)

tilelang/language/proxy.py (1)

Tensor (210-211)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

tilelang/profiler/__init__.py (1)

assert_allclose (76-137)

tilelang/quantize/mxfp.py (1)

tilelang/language/ast/ir.py (1)

func_name (206-214)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-test-amd
GitHub Check: build-test-nvidia
GitHub Check: bot-task

🔇 Additional comments (2)

maint/scripts/run_local_ci_test.sh (1)

1-1: Executable permissions confirmed

The script maint/scripts/run_local_ci_test.sh already has the correct executable permissions (-rwxr-xr-x). No further changes are needed.

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (1)

153-153: Simple dequant path intentionally bypasses scaling
The simple_dequant_bf16_fp4 macro is a minimal example that hard-codes the scale to 0 (per the # No scale for test comment). There are no Scale or scale_size definitions in this file, and the full mxfp4 example already demonstrates how to apply the tensor’s scale. The suggested change to pull in Scale[i, j // scale_size] isn’t applicable here.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-08-18T06:31:00Z

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py

+            A: T.Tensor(A_shape, in_dtype),
+            B: T.Tensor(B_shape, storage_dtype),
+            C: T.Tensor((M, N), out_dtype),


⚠️ Potential issue

Missing Scale tensor parameter

The main function signature doesn't include a Scale tensor parameter, but the mxfp4 variant has it. This inconsistency between the two examples suggests missing functionality.

The function is missing the Scale tensor parameter that's present in the mxfp4 variant. Add it to support proper FP4 dequantization with scaling:

def main( A: T.Tensor(A_shape, in_dtype), B: T.Tensor(B_shape, storage_dtype), + Scale: T.Tensor(Scale_shape, storage_dtype), C: T.Tensor((M, N), out_dtype), ):

Also need to define Scale_shape earlier:

B_shape = (N, QK) + Scale_shape = (N, K // 32) # Assuming scale_size=32 as default A_shared_shape = (block_M, block_K)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

A: T.Tensor(A_shape, in_dtype),

B: T.Tensor(B_shape, storage_dtype),

C: T.Tensor((M, N), out_dtype),

B_shape = (N, QK)

Scale_shape = (N, K // 32) # Assuming scale_size=32 as default

A_shared_shape = (block_M, block_K)

def main(

A: T.Tensor(A_shape, in_dtype),

B: T.Tensor(B_shape, storage_dtype),

Scale: T.Tensor(Scale_shape, storage_dtype),

C: T.Tensor((M, N), out_dtype),

):

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py around lines 162-164, the function signature is missing the Scale tensor parameter present in the mxfp4 variant; add a Scale entry to the signature (e.g., Scale: T.Tensor(Scale_shape, storage_dtype or appropriate scale dtype) alongside A, B, C) and ensure you define Scale_shape earlier in the file (matching the expected scale dimensions for the FP4 dequantization) so both variants are consistent.

coderabbitai · 2025-08-18T06:31:00Z

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py

+    if fast_dequant:
+        profiler.assert_allclose(ref_program_twiddling, rtol=0.01, atol=0.01)
+    else:
+        profiler.assert_allclose(ref_program_simple, rtol=0.01, atol=0.01)


💡 Verification agent

🧩 Analysis chain

Inconsistent reference implementations between files

This file uses scaling with (2**(Scale[i][j // 32] - 127)) in both reference implementations, while example_dequant_gemm_bf16_fp4_hopper.py doesn't use scaling at all. This inconsistency could lead to validation failures.

🏁 Script executed:

#!/bin/bash # Compare the reference implementations between the two files echo "=== Reference implementations in bf16_mxfp4_hopper.py ===" rg -A5 "def ref_program" examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py echo -e "\n=== Reference implementations in bf16_fp4_hopper.py ===" rg -A5 "def ref_program" examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py

Length of output: 1315

Ensure consistent reference implementations across dequantize examples

The two example scripts use different reference‐program logic:

In examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py, both ref_program_twiddling and ref_program_simple accept a Scale argument and apply
B[i][j] *= 2**(Scale[i][j//32] - 127).

In examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py, the corresponding functions drop the Scale parameter entirely and perform only a plain matmul (C = torch.matmul(...)).

This mismatch will yield different outputs and can break your validation. To resolve, pick one of the following:

Add scaling to ref_program_twiddling and ref_program_simple in
example_dequant_gemm_bf16_fp4_hopper.py (accept Scale and apply the same factor), or

Remove scaling from the mxfp4 implementation so both examples compute the same reference.

Files needing fixes:

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (ref_program_twiddling, ref_program_simple)

(Optional) examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py if you decide to drop scaling there instead.

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (around the ref_program_twiddling and ref_program_simple definitions), the reference implementations currently omit the Scale-based dequantization and so diverge from the mxfp4 example; update both ref_program_twiddling and ref_program_simple to accept a Scale parameter and apply the same per-element scaling B[i][j] *= 2**(Scale[i][j//32] - 127) before/while performing the matmul, and ensure any calls to these functions in the file pass the Scale tensor; alternatively, if you prefer the opposite approach, remove the Scale handling from examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py so both examples compute the same reference—pick one consistent approach and apply it across the two files.

coderabbitai · 2025-08-18T06:31:01Z

maint/scripts/run_local_ci_test.sh

+cd $ROOT_DIR
+


🛠️ Refactor suggestion

Quote the variable to prevent word splitting

The $ROOT_DIR variable should be quoted to handle paths with spaces correctly.

-cd $ROOT_DIR +cd "$ROOT_DIR"

🤖 Prompt for AI Agents

In maint/scripts/run_local_ci_test.sh around lines 7 to 8, the cd command uses an unquoted $ROOT_DIR which can break on paths with spaces; update the command to quote the variable (use cd "$ROOT_DIR") so the shell treats the path as a single argument and avoid word-splitting or globbing issues.

coderabbitai · 2025-08-18T06:31:01Z

maint/scripts/run_local_ci_test.sh

+python -m pytest -n 4 .
+cd ..


⚠️ Potential issue

Add error handling for test failures

The script doesn't check the exit status of pytest commands. If tests fail, the script continues execution which could mask failures in CI.

# Run pytest in parallel (4 workers) for all tests in the examples directory cd examples python -m pytest -n 4 . +EXAMPLES_EXIT_CODE=$? cd .. # Run pytest in parallel (4 workers) for all tests in the testing/python directory cd testing/python python -m pytest -n 4 . +TESTING_EXIT_CODE=$? cd .. + +# Exit with non-zero if any tests failed +if [ $EXAMPLES_EXIT_CODE -ne 0 ] || [ $TESTING_EXIT_CODE -ne 0 ]; then + echo "Tests failed!" + exit 1 +fi + +echo "All tests passed!" +exit 0

Also applies to: 19-20

🤖 Prompt for AI Agents

In maint/scripts/run_local_ci_test.sh around lines 14-15 (and similarly 19-20), the script runs pytest but doesn't handle failures; update the script to detect pytest's exit status and stop the script with a non-zero exit code when tests fail — either enable strict error handling (e.g., set -e or set -euo pipefail at the top) or run pytest, capture its exit code ($?), and if non-zero call exit with that code so CI correctly fails on test failures.

…antization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tilelang/language/utils.py (1)
50-70: linear_index: improve validation message for even-argument inputs

Right now, even-argument inputs (e.g., 2 args) fail with a generic stride-count error. Provide a clearer error for this frequent misuse.

Apply this diff to fail earlier with a precise message:
     n = len(args)
     if n == 0:
         raise ValueError("At least one index is required")
 
     if n == 1:
         return args[0]
 
     # The first part is indices, the second part is strides (starting from the second dimension)
     # A simpler way: the number of strides = total number of arguments - number of indices
     # Actually, the args are designed as indices... + strides..., and the number of strides = number of indices - 1
+    if n % 2 == 0:
+        raise ValueError(
+            "Invalid arguments: for multiple coordinates, pass (coords..., strides...) "
+            "with total args = 2 * num_coords - 1 (e.g., linear_index(i, j, stride_j))."
+        )
     num_coords = (n + 1) // 2
     coords = args[:num_coords]
     strides = args[num_coords:]
 
     if len(strides) != len(coords) - 1:
         raise ValueError("Stride count must be one less than coordinate count")
examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py (1)

196-215: Optional: consider using linear_index for symmetry and readability

You could also express the reverse mapping (vi, vj) -> flat index with T.linear_index once it’s exported, for symmetry. Not necessary, but it can make index math more declarative when both directions are present in the same kernel.

If you prefer symmetry, an example pattern for the first loop (not a required change):

Keep T.index_to_coordinates for scatter as is.

For subsequent code that computes flat offsets from (i, j), consider:
linear = T.linear_index(i, j, block_K // num_elems_per_byte)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 668100f and 4def1e9.

📒 Files selected for processing (7)

examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py (2 hunks)
examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (1 hunks)
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (1 hunks)
tilelang/intrinsics/utils.py (0 hunks)
tilelang/language/__init__.py (1 hunks)
tilelang/language/print.py (1 hunks)
tilelang/language/utils.py (1 hunks)

💤 Files with no reviewable changes (1)

tilelang/intrinsics/utils.py

✅ Files skipped from review due to trivial changes (1)

tilelang/language/print.py

🚧 Files skipped from review as they are similar to previous changes (2)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py
examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py

🧰 Additional context used

🧬 Code Graph Analysis (2)

examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py (1)

tilelang/language/utils.py (1)

index_to_coordinates (5-32)

tilelang/language/__init__.py (1)

tilelang/language/utils.py (1)

index_to_coordinates (5-32)

🪛 GitHub Actions: CI Test on AMD

examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py

[error] 1-1: TileLang wheel build failed. cmake --build . --config Release -j returned non-zero exit status 1.

tilelang/language/utils.py

[error] 1-1: TileLang wheel build failed. cmake --build . --config Release -j returned non-zero exit status 1.

tilelang/language/__init__.py

[error] 1-1: TileLang wheel build failed. cmake --build . --config Release -j returned non-zero exit status 1.

🔇 Additional comments (3)

tilelang/language/__init__.py (1)

1-185: General note

No other issues spotted here related to the new import. The rest of the file remains unaffected.

examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py (2)

202-204: Good migration to T.index_to_coordinates

Switching to T.index_to_coordinates keeps usage consistent with the new public API and the function semantics fit this loop’s flattening pattern.

214-216: Good migration to T.index_to_coordinates (second site)

Likewise correct here for the dequantized buffer writeback.

coderabbitai · 2025-08-18T07:12:05Z

tilelang/language/__init__.py

 from .builtin import *  # noqa: F401

 from .memscope import *  # noqa: F401
+from .utils import index_to_coordinates  # noqa: F401


🛠️ Refactor suggestion

Also re-export linear_index to T namespace

You’ve introduced linear_index in language/utils.py; re-exporting it alongside index_to_coordinates makes it available as T.linear_index, keeping the API consistent.

Apply this diff:

-from .utils import index_to_coordinates # noqa: F401 +from .utils import index_to_coordinates, linear_index # noqa: F401

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from .utils import index_to_coordinates # noqa: F401

from .utils import index_to_coordinates, linear_index # noqa: F401

🤖 Prompt for AI Agents

In tilelang/language/__init__.py around line 71, re-export the newly added linear_index from language/utils.py so it’s available as T.linear_index; update the import on that line to import linear_index alongside index_to_coordinates (and keep the noqa comment) so the package API exposes linear_index the same way as index_to_coordinates.

coderabbitai · 2025-08-18T07:12:05Z

tilelang/language/utils.py

+def index_to_coordinates(index, shape) -> list[PrimExpr]:
+    """
+    Convert a flat (linear) index to multi-dimensional coordinates for a given shape.
+
+    Example:
+        shape = (4, 5, 6)
+        index = 53
+        index_to_coordinates(53, (4, 5, 6)) -> [1, 3, 5]
+        # Explanation:
+        # 53 // (5*6) = 1  (1st coordinate)
+        # 53 % (5*6) = 23
+        # 23 // 6 = 3      (2nd coordinate)
+        # 23 % 6 = 5       (3rd coordinate)
+
+    Args:
+        index (int): The flat index to convert.
+        shape (tuple or list of int): The shape of the multi-dimensional array.
+
+    Returns:
+        list: A list of coordinates corresponding to each dimension.
+    """
+    coordinates = []
+    dims = len(shape)
+    for i in range(dims):
+        coordinates.append(index % shape[dims - i - 1])
+        index = index // shape[dims - i - 1]
+    coordinates.reverse()
+    return coordinates
+


🛠️ Refactor suggestion

index_to_coordinates: clarify supported types and guard against empty shapes

The implementation works with both Python ints and PrimExpr; the docstring currently states int-only. Also, an empty shape silently returns [], which is unlikely desired.

Apply this diff to clarify the docstring and add an explicit check:

def index_to_coordinates(index, shape) -> list[PrimExpr]: """ - Convert a flat (linear) index to multi-dimensional coordinates for a given shape. + Convert a flat (linear) index to multi-dimensional coordinates for a given shape. + Works with Python integers and tvm.tir PrimExpr. @@ - Args: - index (int): The flat index to convert. - shape (tuple or list of int): The shape of the multi-dimensional array. + Args: + index (int | PrimExpr): The flat index to convert. + shape (tuple | list[int | PrimExpr]): The shape of the multi-dimensional array (row-major). @@ - coordinates = [] - dims = len(shape) + coordinates = [] + dims = len(shape) + if dims == 0: + raise ValueError("shape must have at least one dimension") for i in range(dims): coordinates.append(index % shape[dims - i - 1]) index = index // shape[dims - i - 1] coordinates.reverse() return coordinates

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def index_to_coordinates(index, shape) -> list[PrimExpr]:

"""

Convert a flat (linear) index to multi-dimensional coordinates for a given shape.

Example:

shape = (4, 5, 6)

index = 53

index_to_coordinates(53, (4, 5, 6)) -> [1, 3, 5]

# Explanation:

# 53 // (5*6) = 1 (1st coordinate)

# 53 % (5*6) = 23

# 23 // 6 = 3 (2nd coordinate)

# 23 % 6 = 5 (3rd coordinate)

Args:

index (int): The flat index to convert.

shape (tuple or list of int): The shape of the multi-dimensional array.

Returns:

list: A list of coordinates corresponding to each dimension.

"""

coordinates = []

dims = len(shape)

for i in range(dims):

coordinates.append(index % shape[dims - i - 1])

index = index // shape[dims - i - 1]

coordinates.reverse()

return coordinates

def index_to_coordinates(index, shape) -> list[PrimExpr]:

"""

Convert a flat (linear) index to multi-dimensional coordinates for a given shape.

Works with Python integers and tvm.tir PrimExpr.

Example:

shape = (4, 5, 6)

index = 53

index_to_coordinates(53, (4, 5, 6)) -> [1, 3, 5]

# Explanation:

# 53 // (5*6) = 1 (1st coordinate)

# 53 % (5*6) = 23

# 23 // 6 = 3 (2nd coordinate)

# 23 % 6 = 5 (3rd coordinate)

Args:

index (int | PrimExpr): The flat index to convert.

shape (tuple | list[int | PrimExpr]): The shape of the multi-dimensional array (row-major).

Returns:

list: A list of coordinates corresponding to each dimension.

"""

coordinates = []

dims = len(shape)

if dims == 0:

raise ValueError("shape must have at least one dimension")

for i in range(dims):

coordinates.append(index % shape[dims - i - 1])

index = index // shape[dims - i - 1]

coordinates.reverse()

return coordinates

🤖 Prompt for AI Agents

In tilelang/language/utils.py around lines 5 to 33, update the docstring to state that index may be an int or a PrimExpr and that shape must be a non-empty tuple or list of ints/PrimExprs, and add an explicit guard that raises a ValueError if shape is empty; keep the existing algorithm but validate inputs at the top (check shape is list/tuple and non-empty, optionally coerce or type-check elements) and document the supported types in the Args and Returns sections so callers know the function accepts both Python ints and PrimExprs.

coderabbitai · 2025-08-18T07:12:05Z

tilelang/language/utils.py

+    Usage examples:
+        linear_index(i)                         -> i
+        linear_index(i, j)                      -> i * stride + j
+        linear_index(i, j, stride_j)            -> i * stride_j + j
+        linear_index(i, j, k, stride_j, stride_k)
+                                                -> i * stride_j * stride_k + j * stride_k + k
+
+        Example for index = i * threads * local_size + tx * local_size + v:
+        Suppose you have i, tx, v as coordinates, and threads, local_size as strides:
+        linear_index(i, tx, v, threads, local_size) == i * threads * local_size + tx * local_size + v
+    """


⚠️ Potential issue

linear_index docstring contradicts implementation for 2-argument case

The example “linear_index(i, j) -> i * stride + j” is not supported by the current logic (function only succeeds for 1, 3, 5, … arguments). This is confusing and can lead to misuse.

Apply this diff to remove the unsupported example and clarify the expected calling convention:

Usage examples: linear_index(i) -> i - linear_index(i, j) -> i * stride + j - linear_index(i, j, stride_j) -> i * stride_j + j + # For n coordinates (n >= 2), provide (n - 1) strides immediately after coordinates: + linear_index(i, j, stride_j) -> i * stride_j + j linear_index(i, j, k, stride_j, stride_k) -> i * stride_j * stride_k + j * stride_k + k @@ - linear_index(i, tx, v, threads, local_size) == i * threads * local_size + tx * local_size + v + linear_index(i, tx, v, threads, local_size) == i * threads * local_size + tx * local_size + v + + Notes: + - The function expects an odd number of arguments for multi-dimensional indices: + total_args = 2 * num_coordinates - 1. If you pass only coordinates without strides, + the function will raise a ValueError.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Usage examples:

linear_index(i) -> i

linear_index(i, j) -> i * stride + j

linear_index(i, j, stride_j) -> i * stride_j + j

linear_index(i, j, k, stride_j, stride_k)

-> i * stride_j * stride_k + j * stride_k + k

Example for index = i * threads * local_size + tx * local_size + v:

Suppose you have i, tx, v as coordinates, and threads, local_size as strides:

linear_index(i, tx, v, threads, local_size) == i * threads * local_size + tx * local_size + v

"""

Usage examples:

linear_index(i) -> i

# For n coordinates (n >= 2), provide (n - 1) strides immediately after coordinates:

linear_index(i, j, stride_j) -> i * stride_j + j

linear_index(i, j, k, stride_j, stride_k)

-> i * stride_j * stride_k + j * stride_k + k

Example for index = i * threads * local_size + tx * local_size + v:

Suppose you have i, tx, v as coordinates, and threads, local_size as strides:

linear_index(i, tx, v, threads, local_size) == i * threads * local_size + tx * local_size + v

Notes:

- The function expects an odd number of arguments for multi-dimensional indices:

total_args = 2 * num_coordinates - 1. If you pass only coordinates without strides,

the function will raise a ValueError.

"""

🤖 Prompt for AI Agents

In tilelang/language/utils.py around lines 39-49, the docstring incorrectly claims a two-argument form linear_index(i, j) -> i * stride + j which the implementation does not support; update the docstring to remove that unsupported example and instead state the correct calling convention: the function accepts a single index or an odd number of arguments where you pass m indices plus m-1 stride values (total args = 2*m-1), e.g. linear_index(i) -> i and linear_index(i, j, stride_j) -> i * stride_j + j and linear_index(i, j, k, stride_j, stride_k) -> i * stride_j * stride_k + j * stride_k + k; adjust the example for threads/local_size to use the 5-argument form linear_index(i, tx, v, threads, local_size) accordingly.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

tilelang/language/__init__.py (1)
71-71: Re-export looks good; consider also re-exporting linear_index for consistency.

linear_index lives alongside index_to_coordinates in utils and is likely to be used by kernels; exporting both improves the T API ergonomics.

Apply:
-from .utils import index_to_coordinates  # noqa: F401
+from .utils import index_to_coordinates, linear_index  # noqa: F401
tilelang/language/utils.py (4)
1-1: Remove unused import to satisfy linters.

tvm isn’t used in this module.
-from tilelang import tvm as tvm
19-25: Doc type hints should reflect PrimExpr acceptance.

The function works with PrimExpr; the doc currently states plain int. Clarify to avoid confusion.
-    Args:
-        index (int): The flat index to convert.
-        shape (tuple or list of int): The shape of the multi-dimensional array.
+    Args:
+        index (int | PrimExpr): The flat index to convert.
+        shape (tuple | list[int | PrimExpr]): The shape of the multi-dimensional array.
35-49: Fix incorrect usage example for the two-argument form.

linear_index(i, j) without a stride raises ValueError (by design). Remove or correct the example.
-        linear_index(i, j)                      -> i * stride + j
-        linear_index(i, j, stride_j)            -> i * stride_j + j
+        linear_index(i, j, stride_j)            -> i * stride_j + j
35-71: API looks good; consider adding unit tests for invertibility and error paths.

Verify index_to_coordinates followed by recomposition equals original index for several shapes.

Verify linear_index computes the expected value and raises for invalid arg counts.

I can draft targeted tests (pytest) covering typical shapes, PrimExpr cases, and the ValueError branches if you’d like.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 668100f and 4def1e9.

📒 Files selected for processing (7)

examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py (2 hunks)
examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py (1 hunks)
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (1 hunks)
tilelang/intrinsics/utils.py (0 hunks)
tilelang/language/__init__.py (1 hunks)
tilelang/language/print.py (1 hunks)
tilelang/language/utils.py (1 hunks)

💤 Files with no reviewable changes (1)

tilelang/intrinsics/utils.py

✅ Files skipped from review due to trivial changes (1)

tilelang/language/print.py

🚧 Files skipped from review as they are similar to previous changes (2)

examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py

🧰 Additional context used

🧬 Code Graph Analysis (2)

examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py (1)

tilelang/language/utils.py (1)

index_to_coordinates (5-32)

tilelang/language/__init__.py (1)

tilelang/language/utils.py (1)

index_to_coordinates (5-32)

🔇 Additional comments (2)

examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py (2)

202-202: Switching to T.index_to_coordinates is correct and aligns with the new re-export.

The usage and row-major mapping are correct for B_shared_shape. Destructuring the returned list into (vi, vj) is fine.

214-214: Verified: All calls to index_to_coordinates are correctly namespaced
No deprecated imports of index_to_coordinates were found, and the example still uses T.index_to_coordinates consistently for both B_shared_shape and B_dequantize_shared_shape.

– examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py: lines 202 and 214
– Library code in tilelang/language/utils.py and tilelang/language/print.py uses raw calls internally as expected

No further changes needed.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (1)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (1)

257-260: Keep FP4 vs MXFP4 reference logic consistent across examples.

This file correctly applies exponent-based scaling in its reference implementations. Ensure the parallel FP4 Hopper example also applies the same scaling logic (or both drop it) to avoid validation mismatches.

I can open a follow-up PR to update examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py so ref_program_* accept/apply Scale consistently. Want me to proceed?

🧹 Nitpick comments (2)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (2)
7-7: Import path may break when running examples from repo root.

from utils import ... assumes examples/dequantize_gemm is on sys.path. Tests or users running from repo root will often lack that. Prefer a package-relative import or guarded fallback.

Option A (package import):

Turn examples/dequantize_gemm into a package by adding __init__.py, then:
-from utils import torch_convert_bit_twiddling, torch_convert
+from examples.dequantize_gemm.utils import torch_convert_bit_twiddling, torch_convert
Option B (runtime path tweak as a fallback before the import):
import os, sys
_here = os.path.dirname(__file__)
if _here not in sys.path:
    sys.path.append(_here)
from utils import torch_convert_bit_twiddling, torch_convert
Please confirm how your tests import the example; I can align the import accordingly.

86-91: Redundant self-assignment of import_source.

import_source = import_source is a no-op and can be removed for clarity.

Apply this diff:
-    import_source = import_source

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0149177 and c856ced.

📒 Files selected for processing (1)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (6)

examples/dequantize_gemm/utils.py (2)

torch_convert_bit_twiddling (4-36)

torch_convert (39-66)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-87)

tilelang/language/__init__.py (1)

import_source (182-184)

tilelang/language/allocate.py (3)

alloc_local (38-49)

alloc_fragment (52-63)

alloc_shared (20-35)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/profiler/__init__.py (1)

assert_allclose (76-137)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: format-check
GitHub Check: bot-task

coderabbitai · 2025-08-18T09:08:38Z

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py

+    # Scale is the exponential part, within the representation of uint8
+    # To handle the overflow, we use the max function to limit the exponential part to 8 bits
+    e_bf16 = T.min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))
+    m_f4 = f4 & tir.const(1, "uint16")
+    val_bf16 = tir.reinterpret("bfloat16",
+                               ((((s << tir.const(8, "uint16")) | e_bf16) << tir.const(7, "uint16"))
+                                | (m_f4 << tir.const(6, "uint16"))).astype("uint16"))


⚠️ Potential issue

Cast Scale to uint16 to avoid mixed-dtype arithmetic in TIR.

e_bf16 is uint16 while scale is uint8. Adding them without casting can cause TIR type errors. Cast scale to uint16 before addition.

Apply this diff:

- # Scale is the exponential part, within the representation of uint8 - # To handle the overflow, we use the max function to limit the exponential part to 8 bits - e_bf16 = T.min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16")) + # Scale is the exponential part, within the representation of uint8. + # Cast to uint16 to match e_bf16, then clamp to 8-bit exponent range. + e_bf16 = T.min(e_bf16 + T.cast(scale, "uint16"), tir.const((1 << 8) - 1, "uint16"))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Scale is the exponential part, within the representation of uint8

# To handle the overflow, we use the max function to limit the exponential part to 8 bits

e_bf16 = T.min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))

m_f4 = f4 & tir.const(1, "uint16")

val_bf16 = tir.reinterpret("bfloat16",

((((s << tir.const(8, "uint16")) | e_bf16) << tir.const(7, "uint16"))

| (m_f4 << tir.const(6, "uint16"))).astype("uint16"))

# Scale is the exponential part, within the representation of uint8.

# Cast to uint16 to match e_bf16, then clamp to 8-bit exponent range.

e_bf16 = T.min(e_bf16 + T.cast(scale, "uint16"), tir.const((1 << 8) - 1, "uint16"))

m_f4 = f4 & tir.const(1, "uint16")

val_bf16 = tir.reinterpret("bfloat16",

((((s << tir.const(8, "uint16")) | e_bf16) << tir.const(7, "uint16"))

| (m_f4 << tir.const(6, "uint16"))).astype("uint16"))

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py around lines 21-27, e_bf16 is uint16 while scale is uint8 causing mixed-dtype TIR arithmetic; cast scale to uint16 before the addition and ensure the T.min operands are both uint16 so the addition and min operate on the same dtype (also verify the constant passed to T.min is uint16).

coderabbitai · 2025-08-18T09:08:38Z

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py

+    assert K % (block_K * split) == 0
+


🛠️ Refactor suggestion

Guard against invalid scale tiling with explicit divisibility checks.

You index Scale using k * block_K // scale_size and compute per-thread scale based on (scale_size // num_elems_per_byte). Add asserts to avoid silent mis-indexing when shapes aren’t aligned.

Apply this diff:

- assert K % (block_K * split) == 0 + assert K % (block_K * split) == 0 + assert K % scale_size == 0, "K must be divisible by scale_size" + assert block_K % scale_size == 0, "block_K must be divisible by scale_size" + assert (scale_size % num_elems_per_byte) == 0, ( + "scale_size must be a multiple of elements per byte " + f"(num_bits={num_bits} -> {num_elems_per_byte} elems/byte)" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert K % (block_K * split) == 0

assert K % (block_K * split) == 0

assert K % scale_size == 0, "K must be divisible by scale_size"

assert block_K % scale_size == 0, "block_K must be divisible by scale_size"

assert (scale_size % num_elems_per_byte) == 0, (

"scale_size must be a multiple of elements per byte "

f"(num_bits={num_bits} -> {num_elems_per_byte} elems/byte)"

)

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py around lines 75-76, add explicit divisibility asserts to prevent silent mis-indexing of Scale: keep the existing assert that K is divisible by (block_K * split), and also assert that scale_size is divisible by num_elems_per_byte and that the number of K-blocks (K // block_K) is divisible by (scale_size // num_elems_per_byte); these checks ensure k * block_K // scale_size and per-thread scale computed from (scale_size // num_elems_per_byte) cannot produce out-of-range indices.

coderabbitai · 2025-08-18T09:08:38Z

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py

+                Scale_local_thread_exponent[0] = T.exp2(
+                    T.cast(Scale_local_thread[0] - 127, "float"))


⚠️ Potential issue

Use a concrete floating dtype for cast (float32) to avoid invalid TIR dtype.

"float" is not a standard dtype for TIR. Use "float32" to ensure proper codegen.

Apply this diff:

- Scale_local_thread_exponent[0] = T.exp2( - T.cast(Scale_local_thread[0] - 127, "float")) + Scale_local_thread_exponent[0] = T.exp2( + T.cast(Scale_local_thread[0] - 127, "float32"))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Scale_local_thread_exponent[0] = T.exp2(

T.cast(Scale_local_thread[0] - 127, "float"))

Scale_local_thread_exponent[0] = T.exp2(

T.cast(Scale_local_thread[0] - 127, "float32"))

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py around lines 125 to 126, the cast uses the nonstandard TIR dtype "float" which can cause invalid TIR; change the cast target to the concrete dtype "float32" so the expression becomes T.cast(Scale_local_thread[0] - 127, "float32") and keep the surrounding T.exp2(...) call unchanged.

coderabbitai · 2025-08-18T09:08:38Z

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py

+        @T.macro
+        def simple_dequant_bf16_fp4(B_shared, B_dequantize_shared, Scale, k):
+            B_local = T.alloc_fragment(B_shared_shape, storage_dtype)
+            B_dequantize_local = T.alloc_fragment(B_dequantize_shared_shape, in_dtype)
+
+            bx = T.get_block_binding(0)
+            T.copy(B_shared, B_local)
+            for i, j in T.Parallel(block_N, block_K):
+                B_dequantize_local[i, j] = _tir_u8_to_f4_to_bf16(
+                    num_bits,
+                    B_local[i, j // num_elems_per_byte],
+                    j % num_elems_per_byte,
+                    Scale[
+                        bx * block_N + i, k * block_K // scale_size + j //
+                        scale_size],  # Scale is the exponential part, within the representation of uint8
+                    dtype=in_dtype,
+                )
+            T.copy(B_dequantize_local, B_dequantize_shared)
+
+        return simple_dequant_bf16_fp4


🛠️ Refactor suggestion

Fix dtype mismatches in simple dequant path (alloc + converter arg).

B_dequantize_local should be bf16, not in_dtype.

_tir_u8_to_f4_to_bf16 asserts dtype == "bfloat16", but dtype=in_dtype currently passes "fp4" and will assert-fail.

Apply this diff:

@T.macro def simple_dequant_bf16_fp4(B_shared, B_dequantize_shared, Scale, k): B_local = T.alloc_fragment(B_shared_shape, storage_dtype) - B_dequantize_local = T.alloc_fragment(B_dequantize_shared_shape, in_dtype) + B_dequantize_local = T.alloc_fragment(B_dequantize_shared_shape, out_dtype) bx = T.get_block_binding(0) T.copy(B_shared, B_local) for i, j in T.Parallel(block_N, block_K): B_dequantize_local[i, j] = _tir_u8_to_f4_to_bf16( num_bits, B_local[i, j // num_elems_per_byte], j % num_elems_per_byte, Scale[ bx * block_N + i, k * block_K // scale_size + j // - scale_size], # Scale is the exponential part, within the representation of uint8 - dtype=in_dtype, + scale_size], # Scale is the exponential part, within the representation of uint8 + dtype=out_dtype, ) T.copy(B_dequantize_local, B_dequantize_shared)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@T.macro

def simple_dequant_bf16_fp4(B_shared, B_dequantize_shared, Scale, k):

B_local = T.alloc_fragment(B_shared_shape, storage_dtype)

B_dequantize_local = T.alloc_fragment(B_dequantize_shared_shape, in_dtype)

bx = T.get_block_binding(0)

T.copy(B_shared, B_local)

for i, j in T.Parallel(block_N, block_K):

B_dequantize_local[i, j] = _tir_u8_to_f4_to_bf16(

num_bits,

B_local[i, j // num_elems_per_byte],

j % num_elems_per_byte,

Scale[

bx * block_N + i, k * block_K // scale_size + j //

scale_size], # Scale is the exponential part, within the representation of uint8

dtype=in_dtype,

)

T.copy(B_dequantize_local, B_dequantize_shared)

return simple_dequant_bf16_fp4

@T.macro

def simple_dequant_bf16_fp4(B_shared, B_dequantize_shared, Scale, k):

B_local = T.alloc_fragment(B_shared_shape, storage_dtype)

B_dequantize_local = T.alloc_fragment(B_dequantize_shared_shape, out_dtype)

bx = T.get_block_binding(0)

T.copy(B_shared, B_local)

for i, j in T.Parallel(block_N, block_K):

B_dequantize_local[i, j] = _tir_u8_to_f4_to_bf16(

num_bits,

B_local[i, j // num_elems_per_byte],

j % num_elems_per_byte,

Scale[

bx * block_N + i, k * block_K // scale_size + j //

scale_size], # Scale is the exponential part, within the representation of uint8

dtype=out_dtype,

)

T.copy(B_dequantize_local, B_dequantize_shared)

return simple_dequant_bf16_fp4

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py around lines 152-171, the local dequantized fragment and the converter call use in_dtype ("fp4") which triggers an assertion in _tir_u8_to_f4_to_bf16; change B_dequantize_local to be allocated as bfloat16 (not in_dtype) and pass dtype="bfloat16" (or the bfloat16 variable) into the _tir_u8_to_f4_to_bf16 call so the fragment type and converter dtype match.

coderabbitai · 2025-08-18T09:08:38Z

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py

+def ref_program_twiddling(A, qB, Scale):
+    dtypeC = "bfloat16"
+    B = torch_convert_bit_twiddling(qB)
+    for i in range(B.shape[0]):
+        for j in range(B.shape[1]):
+            B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127))
+    C = torch.matmul(A.to(torch.float), B.T.to(torch.float))
+    C = C.to(torch.__getattribute__(dtypeC))
+    return C
+
+
+def ref_program_simple(A, qB, Scale):
+    dtypeC = "bfloat16"
+    B = torch_convert(qB)
+    for i in range(B.shape[0]):
+        for j in range(B.shape[1]):
+            B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127))
+    C = torch.matmul(A.to(torch.float), B.T.to(torch.float))
+    C = C.to(torch.__getattribute__(dtypeC))
+    return C


⚠️ Potential issue

Make reference scaling respect the runtime scale_size (avoid hard-coded 32).

Both reference functions hardcode j // 32. This breaks validation when scale_size != 32.

Apply these diffs:

@@ def ref_program_twiddling(A, qB, Scale): dtypeC = "bfloat16" B = torch_convert_bit_twiddling(qB) + # Derive scale_size from tensor shapes: Scale.shape[1] == K // scale_size + scale_size = B.shape[1] // Scale.shape[1] for i in range(B.shape[0]): for j in range(B.shape[1]): - B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127)) + B[i][j] = B[i][j] * (2**(Scale[i][j // scale_size] - 127)) C = torch.matmul(A.to(torch.float), B.T.to(torch.float)) C = C.to(torch.__getattribute__(dtypeC)) return C @@ def ref_program_simple(A, qB, Scale): dtypeC = "bfloat16" B = torch_convert(qB) + # Derive scale_size from tensor shapes + scale_size = B.shape[1] // Scale.shape[1] for i in range(B.shape[0]): for j in range(B.shape[1]): - B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127)) + B[i][j] = B[i][j] * (2**(Scale[i][j // scale_size] - 127)) C = torch.matmul(A.to(torch.float), B.T.to(torch.float)) C = C.to(torch.__getattribute__(dtypeC)) return C

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def ref_program_twiddling(A, qB, Scale):

dtypeC = "bfloat16"

B = torch_convert_bit_twiddling(qB)

for i in range(B.shape[0]):

for j in range(B.shape[1]):

B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127))

C = torch.matmul(A.to(torch.float), B.T.to(torch.float))

C = C.to(torch.__getattribute__(dtypeC))

return C

def ref_program_simple(A, qB, Scale):

dtypeC = "bfloat16"

B = torch_convert(qB)

for i in range(B.shape[0]):

for j in range(B.shape[1]):

B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127))

C = torch.matmul(A.to(torch.float), B.T.to(torch.float))

C = C.to(torch.__getattribute__(dtypeC))

return C

def ref_program_twiddling(A, qB, Scale):

dtypeC = "bfloat16"

B = torch_convert_bit_twiddling(qB)

# Derive scale_size from tensor shapes: Scale.shape[1] == K // scale_size

scale_size = B.shape[1] // Scale.shape[1]

for i in range(B.shape[0]):

for j in range(B.shape[1]):

B[i][j] = B[i][j] * (2**(Scale[i][j // scale_size] - 127))

C = torch.matmul(A.to(torch.float), B.T.to(torch.float))

C = C.to(torch.__getattribute__(dtypeC))

return C

def ref_program_simple(A, qB, Scale):

dtypeC = "bfloat16"

B = torch_convert(qB)

# Derive scale_size from tensor shapes

scale_size = B.shape[1] // Scale.shape[1]

for i in range(B.shape[0]):

for j in range(B.shape[1]):

B[i][j] = B[i][j] * (2**(Scale[i][j // scale_size] - 127))

C = torch.matmul(A.to(torch.float), B.T.to(torch.float))

C = C.to(torch.__getattribute__(dtypeC))

return C

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py around lines 210 to 229, the reference functions hard-code j // 32 when indexing Scale which breaks validation for other scale sizes; compute scale_size = Scale.shape[1] (or accept scale_size) at the start of each function and replace j // 32 with j // scale_size so scaling uses the runtime scale_size; keep the rest of the logic unchanged.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (5)

75-76: Add divisibility asserts for scale tiling to avoid mis-indexing.

Guard against invalid (scale_size, num_bits, block_K) combinations that can silently mis-index Scale.

-    assert K % (block_K * split) == 0
+    assert K % (block_K * split) == 0
+    assert K % scale_size == 0, "K must be divisible by scale_size"
+    assert block_K % scale_size == 0, "block_K must be divisible by scale_size"
+    assert (scale_size % num_elems_per_byte) == 0, (
+        "scale_size must be a multiple of elements per byte "
+        f"(num_bits={num_bits} -> {num_elems_per_byte} elems/byte)"
+    )

21-23: Fix mixed-dtype addition in TIR (uint8 + uint16) by casting scale to uint16.

e_bf16 is uint16 while scale is uint8. Adding them as-is can cause TIR type errors or implicit promotions. Cast scale to uint16 before the addition.

-    # Scale is the exponential part, within the representation of uint8
-    # To handle the overflow, we use the max function to limit the exponential part to 8 bits
-    e_bf16 = T.min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))
+    # Scale is the exponential part, within the representation of uint8.
+    # Cast to uint16 to match e_bf16, then clamp to 8-bit exponent range.
+    e_bf16 = T.min(e_bf16 + T.cast(scale, "uint16"), tir.const((1 << 8) - 1, "uint16"))

125-126: Use a concrete TIR dtype: "float32" instead of "float".

"float" is not a valid dtype; this can cause invalid TIR/codegen issues.

-                Scale_local_thread_exponent[0] = T.exp2(
-                    T.cast(Scale_local_thread[0] - 127, "float"))
+                Scale_local_thread_exponent[0] = T.exp2(
+                    T.cast(Scale_local_thread[0] - 127, "float32"))

148-171: Fix dtype mismatches in simple dequant path and avoid inner shadowing.

The local dequantized fragment must be BF16 (the dequant result), not "fp4".
_tir_u8_to_f4_to_bf16 asserts dtype == "bfloat16". Passing the inner in_dtype="fp4" will assert-fail at compile time.
Also rename the helper params to avoid shadowing and confusion.

-    def get_simple_dequant_func(in_dtype="fp4", out_dtype="bfloat16"):
-        assert in_dtype in ["fp4"]
-        assert out_dtype in ["bfloat16"]
+    def get_simple_dequant_func(src_quant_dtype="fp4", dequant_dtype="bfloat16"):
+        assert src_quant_dtype in ["fp4"]
+        assert dequant_dtype in ["bfloat16"]
@@
-        def simple_dequant_bf16_fp4(B_shared, B_dequantize_shared, Scale, k):
+        def simple_dequant_bf16_fp4(B_shared, B_dequantize_shared, Scale, k):
             B_local = T.alloc_fragment(B_shared_shape, storage_dtype)
-            B_dequantize_local = T.alloc_fragment(B_dequantize_shared_shape, in_dtype)
+            B_dequantize_local = T.alloc_fragment(B_dequantize_shared_shape, dequant_dtype)
@@
                 B_dequantize_local[i, j] = _tir_u8_to_f4_to_bf16(
                     num_bits,
                     B_local[i, j // num_elems_per_byte],
                     j % num_elems_per_byte,
                     Scale[
                         bx * block_N + i, k * block_K // scale_size + j //
-                        scale_size],  # Scale is the exponential part, within the representation of uint8
-                    dtype=in_dtype,
+                        scale_size],  # Scale is the exponential part, within the representation of uint8
+                    dtype=dequant_dtype,
                 )

212-231: Derive scale_size at runtime in reference functions (avoid hard-coded 32).

Hard-coding j // 32 breaks validation for other scale_size values. Compute scale_size from tensor shapes.

 def ref_program_twiddling(A, qB, Scale):
     dtypeC = "bfloat16"
     B = torch_convert_bit_twiddling(qB)
+    # Derive scale_size from tensor shapes
+    scale_size = B.shape[1] // Scale.shape[1]
     for i in range(B.shape[0]):
         for j in range(B.shape[1]):
-            B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127))
+            B[i][j] = B[i][j] * (2**(Scale[i][j // scale_size] - 127))
     C = torch.matmul(A.to(torch.float), B.T.to(torch.float))
     C = C.to(torch.__getattribute__(dtypeC))
     return C
@@
 def ref_program_simple(A, qB, Scale):
     dtypeC = "bfloat16"
     B = torch_convert(qB)
+    # Derive scale_size from tensor shapes
+    scale_size = B.shape[1] // Scale.shape[1]
     for i in range(B.shape[0]):
         for j in range(B.shape[1]):
-            B[i][j] = B[i][j] * (2**(Scale[i][j // 32] - 127))
+            B[i][j] = B[i][j] * (2**(Scale[i][j // scale_size] - 127))
     C = torch.matmul(A.to(torch.float), B.T.to(torch.float))
     C = C.to(torch.__getattribute__(dtypeC))
     return C

🧹 Nitpick comments (2)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (2)
90-90: Remove no-op self-assignment.

import_source = import_source is a no-op and can be removed.
-    import_source = import_source
92-111: Avoid shadowing in/out dtype names in helper; clarify parameter intent.

The inner helper’s parameters in_dtype/out_dtype shadow the outer kernel’s in_dtype/out_dtype, which makes it easy to misuse. Rename for clarity and derive local_size from the dequantized dtype.
-def get_fast_dequant_twiddling_func(in_dtype="fp4", out_dtype="bfloat16"):
-    assert in_dtype in ["fp4"]
-    assert out_dtype in ["bfloat16"]
+def get_fast_dequant_twiddling_func(src_quant_dtype="fp4", dequant_dtype="bfloat16"):
+    assert src_quant_dtype in ["fp4"]
+    assert dequant_dtype in ["bfloat16"]
@@
-        local_size = MAX_TRANSACTION_SIZE_BITS // DataType(out_dtype).bits
+        local_size = MAX_TRANSACTION_SIZE_BITS // DataType(dequant_dtype).bits
@@
-            B_dequantize_local_thread = T.alloc_local((local_size,), out_dtype)
+            B_dequantize_local_thread = T.alloc_local((local_size,), dequant_dtype)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c856ced and f604ede.

📒 Files selected for processing (1)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (9)

examples/dequantize_gemm/utils.py (2)

torch_convert_bit_twiddling (4-36)

torch_convert (39-66)

tilelang/language/tir/op.py (1)

call_extern (172-194)

tilelang/quantize/mxfp.py (1)

get_mxfp_intrin_group (51-87)

tilelang/language/__init__.py (1)

import_source (182-184)

tilelang/language/allocate.py (2)

alloc_local (38-49)

alloc_shared (20-35)

tilelang/language/builtin.py (1)

no_set_max_nreg (145-148)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

tilelang/profiler/__init__.py (1)

assert_allclose (76-137)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-test-amd
GitHub Check: build-test-nvidia
GitHub Check: bot-task

🔇 Additional comments (2)

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (2)

1-9: AI summary is inconsistent with the actual file contents.

This file is not a placeholder; it contains a full BF16×MXFP4 dequantized GEMM example with kernels, macros, and references. Please disregard the per-file AI summary that mentions a placeholder docstring.

180-208: Good kernel structure and memory orchestration.

Solid use of shared fragments, swizzled layouts, pipelining, and a clear separation of dequant and GEMM paths. The transpose_B=True path aligns with the dequantized B layout.

coderabbitai · 2025-08-18T10:32:05Z

examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py

+                T.call_extern(
+                    func_name,
+                    T.address_of(B_local_thread[0]),
+                    T.address_of(B_dequantize_local_thread[0]),
+                    1,
+                    dtype=out_dtype,
+                )


⚠️ Potential issue

Fix T.call_extern signature order; dtype must be the first positional arg.

The current call passes func_name as the first positional argument, and also passes dtype as a keyword, which will raise a Python TypeError. Put the return dtype first (use "void" for side-effect-only calls), then the function name, then args.

- T.call_extern( - func_name, - T.address_of(B_local_thread[0]), - T.address_of(B_dequantize_local_thread[0]), - 1, - dtype=out_dtype, - ) + T.call_extern( + "void", # return dtype for extern call + func_name, + T.address_of(B_local_thread[0]), + T.address_of(B_dequantize_local_thread[0]), + 1, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

T.call_extern(

func_name,

T.address_of(B_local_thread[0]),

T.address_of(B_dequantize_local_thread[0]),

1,

dtype=out_dtype,

)

T.call_extern(

"void", # return dtype for extern call

func_name,

T.address_of(B_local_thread[0]),

T.address_of(B_dequantize_local_thread[0]),

1,

)

🤖 Prompt for AI Agents

In examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py around lines 129 to 135, the T.call_extern invocation uses the wrong argument order and a dtype keyword which will raise a TypeError; change the call to pass the return dtype first (use "void" for a side-effect-only call), then func_name, then the argument addresses and other positional args, and remove the dtype= keyword so the call signature is: dtype, func_name, <args>.

@LeiWang1999

Docstrings generation was requested by @LeiWang1999. * #725 (comment) The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py`

coderabbitai · 2025-08-19T04:28:06Z

Note

Generated docstrings for this pull request at #732

@LeiWang1999

* 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * #725 (comment) The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

@LeiWang1999

* 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * tile-ai/tilelang#725 (comment) The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

@LeiWang1999

* 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * tile-ai/tilelang#725 (comment) The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes. - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management. - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body. - Removed obsolete code and improved overall code clarity and maintainability. * lint fix * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves. - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations. * test fix * Enhance global read detection in pipeline planning - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses. - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis. - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code. * Add IndexLegalizer to enforce int64 for out-of-bound indices - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds. - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability. - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations. * [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow. * Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job. * fix: NVRTC backend (#717) * fix: NVRTC backend * fix: CI --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CUDA] Init support for sm_120 (#716) * Init support for sm120 * fmt * resolve comments * unify mma gemm * fmt --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CI] fix docs ci (#720) * [Chore] fix typos (#719) * chore: fix typos * chore: fix ruff * chore: fix clang-format * [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Update AMD FlashAttention example and TVM submodule - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support. - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads. - Updated the TVM submodule to the latest commit for improved functionality. - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters. * Add CI workflow for automated format checking and testing - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests. - The workflow includes steps for setting up a Python environment, running format checks, and executing tests. - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory. * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity. * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management. * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management. * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment. * Add installation of ROCm in AMD CI workflow - Included a step to execute the `install_rocm.sh` script for improved setup. - Removed unnecessary blank line for better readability in the workflow script. * Remove installation step for ROCm in AMD CI workflow to simplify the setup process. * Update AMD CI workflow to run specific test file with verbose output instead of all tests. * Add new tilelang built-in operations for AMD architecture - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang. - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects. * Enhance autotuner configurations and GEMM operations in AMD example - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning. - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications. * Update autotuner configurations in AMD example for enhanced performance - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning. - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns. * Enhance autotuner configurations and memory handling in AMD example - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities. - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations. * Refine autotuner configurations and memory usage in AMD example - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning. - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations. * Update autotuner configurations in AMD example for enhanced performance - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities. - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations. * Enhance autotuner configurations and GEMM operations in AMD example - Expanded thread counts in `get_configs` to include higher values for improved autotuning. - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications. * Update AMD CI workflow and remove obsolete test script - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu. - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure. * Remove TVM subproject from 3rdparty directory * Refactor configuration generation and accumulation logic in AMD example - Reformatted the `get_configs` function for improved readability by aligning parameters. - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions. * Enhance AMD CI workflow with additional logging and setup steps - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements. - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed. * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps * Enhance BuildTileLangHIP function by adding whitespace for improved readability * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * lint fix * Update AMD CI workflow to use requirements-rocm.txt for dependency installation * fix ci * Remove dependency on format-check from AMD CI workflow * fix ci * fix ci * fix ci * Remove format-check job from AMD CI workflow * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow * Add dependency on format-check job in AMD CI workflow * Add format-check job to AMD CI workflow * Update format-check job in AMD CI workflow to run on self-hosted environment * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes * Update amd_ci.yml --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724) * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy * [Typo] Correct architecture selection for CUDA and CDNA * [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor CUDA code generation to simplify eviction policy handling - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity. - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations. * lint fix * [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Support strided tensors * Refactor target attribute helper functions for improved clarity * No code changes made in proxy.py and setup.py * lint fix * lint fix via gemini * lint fix * test fix * test fix * lint fix * Update wrapper.py * test fix * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity. * lint fix --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com> * [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712) * bug fix and support gemm_sr fallback for hopper * Update gemm.cc --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `fix` (#726) Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851 The following files were modified: * `src/op/gemm.cc` * `src/tl_templates/cuda/gemm_sm90.h` * `src/transform/warp_specialized_rewriter.cc` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [CI] Fix AMD CI (#729) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> * [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `mxfp4` (#732) * 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561 The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> * [Refactor] Refactor env into a more flexible version (#740) * Fix environment variable name for compilation print setting in `env.py` * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity. * lint fix * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py` * [Enhancement] Add stride index validation in CythonKernelWrapper (#743) * Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`. * This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code. * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742) * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error * Update atomicadd_vectorize.cc * format * 📝 Add docstrings to PR #744 (#745) * 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559 The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Refactor] Refactor barrier management (#744) * Introduce Barrier * Enhance CUDA kernel with new barrier management and post-processing support - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers. - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure. - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency. - Introduced additional print statements for debugging in the lowering phase of the TileLang engine. - Enhanced the overall structure and readability of the codebase. * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic. * Enhance barrier management in TileLang - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework. - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory. - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code. - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine. - Removed deprecated memory scope handling code to enhance clarity and maintainability. * lint fix * lint fix * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability. * Refactor logging in JITKernel to improve kernel compilation tracking - Removed unused import of `torch.backends` in the example file. - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging. - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function. * Refactor dequantization tests and update barrier function - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite. - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management. * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed. * Fix typos in rasterization parameters and update import path for cached module - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage. - Updated the import statement for the `cached` module to reflect the new path in the cache submodule. - Added `StridedTensor` import in the language module for enhanced tensor functionality. * Update ci.yml * [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable. * lint fix * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency. * remove bulk copy * Refactor copy and atomic add operations to support TMA lower configuration - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach. * Enhance TMA bulk copy logic in `LowerBulkCopy` method - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities. * lint fix * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions. * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling. * Update `native_sparse_attention` function to include TMA configuration options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1. * Refactor JIT decorator formatting in `native_sparse_attention` function - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability. * Enhance thread management and logging in TileLang compilation - Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability. * Add warp specialization scope and refactor register management in TileLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process. * Refactor test for InjectSetMaxNReg pass in TileLang - Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability. * Enhance bulk copy and store checks in `Copy` class - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging. * lint fix * [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency. * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency. * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability. * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality. * typo fix * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues. * lint fix * test fix * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation. * use ninja instead of make * Add CMake configuration step for Ninja build system in setup.py * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja. * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility. * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks. * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly. * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes. * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions. * bugfix * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp. * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts. * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality. * [Enhancement] Optimize loop body handling in IR (#749) - Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable. - This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [MXFP4] Fix bugs and optimize exponential operation (#750) * [MXFP4] Fix bugs - Optimize exp2 with shift operation to boost performance - Fix bug of simple dequantization function call - Fix bug of scaling factor with bias * [Lint] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751) - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. * [Enhancement] Add shape checking for reduce options (#748) * Add shape checking for reduce options * lint fix * Handle special case reducing into shape-1 tensor Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix] Add missing FP8 header include (#752) * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Include cuda_fp8.h in gemm_sm90.h - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * lint fix * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase. - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations. - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability. * lint fix * [Refactor] Update header inclusions in common.h and gemm_sm90.h - Removed the inclusion of "intrin.h" from common.h to streamline dependencies. - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability. * bug fix * [MXFP4] Add bias to MXFP4 GEMM kernel (#753) * [MXFP4] Add bias to gemm kernel * [Lint] * [Lint] Rename "bias" to "Bias" * [Bugfix][WS] Consider loop min extent when computing phase id (#754) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * [Typo] Remove `disable_cache` in some tests (#755) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability. * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Feature] Add 1D TMA support (#761) * [Feature] Add 1D TMA support - Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example * [Lint] * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors * [Lint] * [BugFix] 1D TMA load * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Lint] * [Lint] * [MXFP4] Add test for bf16&mxfp4 gemm * [BugFix] * [Lint] --------- Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> * [Example] Add vertical slash sparse attention pattern (#762) * upd sparse attn * lint * rename * update test file * update benchmark * lint * update benchmark * [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767) * fix ci and pass bug * fix * try * lint * [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766) * [TMA] Add 1D TMA copy for Scale tensor * [Lint] * [Test] Add test for kernel * [BugFix] * hot fix blackwell (#768) * [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763) * Refactor operator classes to inherit from TileOperator and update layout inference methods - Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations. - Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency. - Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization. - Added missing layout inference implementations for Fill and Conv2DIm2ColOp. - Removed deprecated op.cc and op.h files to streamline the codebase. * lint fix * Refactor operator classes to use Node pattern and improve memory management - Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation. - Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access. - Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design. - Refactored InferLayout and Lower methods to ensure consistency across operator implementations. - Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase. * Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning - Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects. - Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations. - Made minor adjustments in layout inference and other related methods for consistency and clarity. * Refactor FillNode::Lower method to remove unused global function call - Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity. - Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies. * [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) * [Enhancement] Introduce finalize_reducer operator and layout reducer support - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations. - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management. - Updated `setup.py` to include logging for build directory paths, improving build process visibility. - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations. - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts. - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance. * Refactor code formatting and improve readability in multiple files - Cleaned up whitespace in `setup.py` to enhance logging clarity. - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability. - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability. - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity. * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability. * [Enhancement] Disable reuse of small arrays in shared memory allocation - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management. * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability. * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability. * bug fix * Add thread checks workaround for replicated cases * Remove the is_one check * fix lint error * lint fix * Update autotune tests to use smaller matrix sizes for improved performance and reliability * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure. - Enhanced layout inference and cloning methods in FinalizeReducerOpNode. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main. - Adjusted header inclusions for improved organization and clarity across multiple files. * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization. * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations. - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus. --------- Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com> Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com> * 📝 Add docstrings to `pytile_0826` (#770) * 📝 Add docstrings to `pytile_0826` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814 The following files were modified: * `src/op/atomic_add.cc` * `src/op/atomic_add.h` * `src/op/copy.cc` * `src/op/copy.h` * `src/op/elem.cc` * `src/op/elem.h` * `src/op/gemm.cc` * `src/op/gemm.h` * `src/op/gemm_sp.cc` * `src/op/gemm_sp.h` * `src/op/operator.cc` * `src/op/operator.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/op/reduce.h` * `src/op/region.cc` * `src/op/region.h` * `src/transform/layout_inference.cc` * `src/transform/lower_tile_op.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix]:Fix atomic add auto vectorize negative optimization (#765) * [Bugfix]:Fix atomic add auto vectorize negative optimization * fixbug * format * fix bug * 📝 Add docstrings to `reducer_0825` (#772) * 📝 Add docstrings to `reducer_0825` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118 The following files were modified: * `setup.py` * `src/op/builtin.h` * `src/op/finalize_reducer.cc` * `src/op/finalize_reducer.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/target/codegen_cuda.cc` * `src/tl_templates/cuda/common.h` * `src/transform/layout_inference.cc` * `src/transform/layout_reducer.cc` * `src/transform/layout_reducer.h` * `src/transform/merge_shared_memory_allocations.cc` * `src/transform/storage_access.cc` * `src/transform/warp_specialized_rewriter.cc` * `testing/python/autotune/test_tilelang_autotune_with_inputs.py` * `tilelang/engine/phase.py` * `tilelang/language/customize.py` * `tilelang/language/reduce.py` * `tilelang/transform/__init__.py` * lint fix * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * Allow fill global buffer (#774) * Allow fill global buffer * fix lint error * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771) * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match * [Lint] * add bf16 exp fallback (#776) * [Lint] Introduce clang-tidy into format.sh (#777) * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization. - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity. - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions. - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code. - General code cleanup and adherence to best practices for better maintainability. * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration - Updated .clang-tidy configuration to include additional checks for improved code quality and performance. - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity. - Replaced size checks with `empty()` method calls in various locations for clearer intent. - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency - Added clang-tidy checks to the format script for improved code quality assurance. - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity. - Updated the requirements-lint.txt file to include clang-tidy as a dependency. - General code cleanup to adhere to best practices and improve maintainability. * [CI] Update AMD CI Workflow to Include Build Directory Creation - Added steps to create a build directory and configure CMake with ROCm support during the format check process. - Ensured cleanup of the build directory after the format check to maintain a clean workspace. * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members. - This change enhances code clarity and maintainability by focusing on relevant attributes for each class. * [Refactor] Update Clang-Tidy Integration and Code Improvements - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes. - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures. - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity. - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency. - General code cleanup to maintain consistency and adhere to best practices. * [CI] Add Git Submodule Initialization to CI Workflow - Included a step to initialize and update git submodules recursively in the CI workflow. - This change ensures that all necessary submodules are available during the format check process, improving build reliability. * [CI] Add Git Submodule Update Step to Format Check - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process. - This enhancement ensures that all required submodules are available, contributing to improved build reliability. * [Refactor] Update Function Signatures in AtomicAddVectorize - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity. - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase. * [Cache] Introduce detailed target information for the disk kernel cache (#780) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * [Example]Adds example for top-k operation (#775) * [Example]Adds example for top-k operation Adds an example demonstrating the top-k operation using tilelang * format * Adds topk tilelang example test * fix lint * [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h * lint fix * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages * lint fix * Add cmake dependency to pyproject.toml and improve build logging in setup.py * [CI] Adds pytest-durations for test timing (#782) * [Ci] Adds pytest-durations for test timing Adds `pytest-durations` to the test requirements and configures pytest to display test durations. This helps in identifying slow-running tests and optimizing the test suite for faster feedback. * add amd ci durations * Removes flash_attn installation from CI * [Refactor] Support python reflection for tile operators (#783) * Implement Fill operator and related reflection methods in TileLang - Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers. - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities. - Updated relevant files to register reflection methods and ensure proper initialization in static blocks. - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability. - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly. * Refactor operator reflection methods and improve code clarity - Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks. - Consolidated static initialization blocks for various operators to a single line for improved consistency. - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability. - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports. * Refactor GEMM and AtomicAdd operations for improved clarity - Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety. - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method. - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity. - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports. * Remove deprecated operator files from the tilelang IR module - Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase. - This cleanup enhances maintainability by removing unused code and improving overall organization of the module. * Refactor imports in tilelang IR module for improved organization - Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase. * lint fix * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability. - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code. - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity. - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities. - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities. * Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure. - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization. - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability. - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability. * comment * Refactor operator header files for improved readability - Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity. - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions. * Refactor MakeReduce method in ReduceOpNode for clarity - Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability. - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation. * Update Reduce operation type checks for consistency - Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity. - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase. * [AMD] Fix amd tir&add examples (#784) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. * Add new AMD FlashAttention example and test script - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability. * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner - Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings. * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217 * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c * Add example for AMD Flash Attention backward pass implementation - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications. * Enhance AMD Flash Attention example with additional testing capabilities - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications. * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * Refactor HIP intrinsic rules to CUDA - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure. * Update AMD CI workflow to uninstall specific PyTorch packages before installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts. * Remove unused shared memory allocations in AMD Flash Attention backward example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead. * Remove unnecessary pip uninstall command from AMD CI workflow - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management. * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues. * Refactor formatting of HIP intrinsic rule registrations - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability. * Update file na…

tzj-fxz added 8 commits August 13, 2025 11:52

[Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16

d2293ad

[Dequant] Add extern call and serial dequantization

47cacb1

[Dequant] Parallel Dequant wait for fence debug.

12e25f5

[Scale] Add scale matrix to mxfp4 gemm

e60ab76

[Remove] Remove fence-buggy example and some generated source cuda code

6d1fffd

[MXFP4] Update initial version of MXFP4 GEMM

0dea3db

[Scale] Add scale to latest mxfp4 gemm

3aae12c

[Lint]

0cb0b59

tzj-fxz requested a review from LeiWang1999 August 15, 2025 14:37

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

tzj-fxz added 3 commits August 15, 2025 14:50

[BugFix] Load Scale, disabe TMA to recover performance

5e575f9

[Lint]

49e54af

[Lint]

07752a9

coderabbitai bot reviewed Aug 15, 2025

View reviewed changes

[Scale] Use L2 to hold Scale and enable TMA will slightly boost perfo…

b86874b

…rmance

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

[Lint]

09ed919

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

Update example_dequant_gemm_bf16_fp4_hopper_serial.py

e01f86a

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

LeiWang1999 added 2 commits August 18, 2025 14:10

Remove deprecated dequantization examples for BF16 and MXFP4 in the d…

a6ae3c3

…equantize_gemm directory.

Refactor dequantization examples for improved readability and consist…

668100f

…ency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.

LeiWang1999 previously approved these changes Aug 18, 2025

View reviewed changes

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

Refactor index_to_coordinates usage in bitnet example and update dequ…

cc88146

…antization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.

LeiWang1999 dismissed their stale review via cc88146 August 18, 2025 06:36

LeiWang1999 added 2 commits August 18, 2025 14:38

lint fix

a51569d

ci fix

4def1e9

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

tzj-fxz added 2 commits August 18, 2025 07:25

Remove non-existent example

0149177

[BugFix] Add smem swizzle to recover performance of TMA

c856ced

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

[BugFix] Enough reg for producer when threads=512

f604ede

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

LeiWang1999 approved these changes Aug 18, 2025

View reviewed changes

LeiWang1999 approved these changes Aug 19, 2025

View reviewed changes

LeiWang1999 merged commit 24603e4 into tile-ai:main Aug 19, 2025
6 of 10 checks passed

coderabbitai bot mentioned this pull request Aug 19, 2025

📝 Add docstrings to mxfp4 #732

Merged

coderabbitai bot mentioned this pull request Aug 22, 2025

[Refactor] Merge bulk copy into copy and improve layout inference for bulk copy #746

Merged

coderabbitai bot mentioned this pull request Sep 15, 2025

[Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE #811

Merged

	A_shared: tilelang.layout.make_swizzled_layout(A_shared),
	T.copy(B[bx * block_N, k * block_K // num_elems_per_byte], B_shared)
	T.copy(Scale[bx * block_N, k * block_K // scale_size], Scale_shared)

	e_bf16 = min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))
	e_bf16 = tir.min(e_bf16 + scale, tir.const((1 << 8) - 1, "uint16"))

	def kernel_func(block_M, block_N, block_K, num_stages, threads, split=1):
	# local_compress_size is used for allocating local thread buffer for compressed data

		sj = index_scale % (block_K // scale_size)
		Scale_local_thread[0] = Scale_shared[si, sj]

		@@ -0,0 +1,94 @@
		from typing import Literal, Dict

		# Implementation asm for fp4 to bf16, using twiddling

		val_concat = (val0.item() << 8) \| val1.item()
		mask = 0b1000000111000000

		for j in range(new_tensor.shape[1]):
		new_tensor[i][j] = _convert(tensor[i][j // 4 * 2], tensor[i][j // 4 * 2 + 1], j % 4)

-        for j in range(new_tensor.shape[1]):
-            new_tensor[i][j] = _convert(tensor[i][j // 4 * 2], tensor[i][j // 4 * 2 + 1], j % 4)
+    for i in range(new_tensor.shape[0]):
+        for j in range(new_tensor.shape[1]):
+            byte_idx = (j // 4) * 2
+            if byte_idx + 1 < tensor.shape[1]:
+                new_tensor[i][j] = _convert(
+                    tensor[i][byte_idx],
+                    tensor[i][byte_idx + 1],
+                    j % 4
+                )
+            else:
+                # Handle edge case for incomplete bytes
+                new_tensor[i][j] = _convert(
+                    tensor[i][byte_idx],
+                    torch.zeros(1, dtype=torch.uint8, device=tensor.device)[0],
+                    j % 4
+                )

	from .utils import index_to_coordinates # noqa: F401
	from .utils import index_to_coordinates, linear_index # noqa: F401

		Scale_local_thread_exponent[0] = T.exp2(
		T.cast(Scale_local_thread[0] - 127, "float"))

[Feature] Low-bit twiddling dequantization and FP4 GEMM #725

[Feature] Low-bit twiddling dequantization and FP4 GEMM #725

Uh oh!

Conversation

tzj-fxz commented Aug 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tzj-fxz commented Aug 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 15, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)