Skip to content

Conversation

@Rachmanino
Copy link
Collaborator

@Rachmanino Rachmanino commented Oct 19, 2025

Fix #1002

  • Added support for wgmma async intrinsics in IsAsyncIntrinsic function.
  • Changed handling of unknown externs to treat them as Generic instead of Async, improving accuracy in proxy kind determination.
    This pull request updates the logic for identifying and handling asynchronous intrinsic calls in the inject_fence_proxy transformation. The main improvements are to more accurately classify certain intrinsic operations and to refine the handling of unknown external calls, reducing unnecessary conservatism.

Key changes include:

Intrinsic classification improvements:

  • Added explicit recognition of tl_gemm and tl_gemm_sp as asynchronous intrinsics in IsAsyncIntrinsic, ensuring these are properly treated as async operations.

Extern call handling refinement:

  • Changed the default handling of unknown external calls from Async to Generic in ProxyFenceInjector, based on the guarantee that GEMM operations are always represented as intrinsic calls and not as externs. This prevents over-insertion of fences for extern calls that are not actually asynchronous.

Summary by CodeRabbit

  • Bug Fixes

    • Improved compiler handling of certain matrix-multiply intrinsics: refined async classification and mapping of unknown external calls for more reliable builds.
  • Tests

    • Updated tests to match intrinsic-call behavior changes, added debug output of transformed modules, and removed an obsolete test case.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 19, 2025

Walkthrough

Marks tile-level wgmma intrinsics tl_gemm and tl_gemm_sp as async, changes unknown call_extern fallback mapping from Async to Generic, and updates tests to use the tl.tl_gemm intrinsic instead of extern calls. No public API changes.

Changes

Cohort / File(s) Summary
Async intrinsic & extern handling
src/transform/inject_fence_proxy.cc
Extended IsAsyncIntrinsic to treat wgmma intrinsics tl_gemm and tl_gemm_sp as async; changed unknown call_extern fallback from Async to Generic with comments explaining gemm/gemm_sp are represented as call_intrin; no exported API changes.
TileLang transform tests
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py
Replaced external gemm call sites with intrinsic tl.tl_gemm via tir.op.Op.get("tl.tl_gemm"), updated GEMM template and call args accordingly, removed test_wgmma_after_descriptor and related scaffolding, and adjusted minor formatting.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Traversal as IR traversal
    participant Intrinsic as IsAsyncIntrinsic
    participant ExternMap as Extern mapper

    Note over Traversal: Inspect call nodes (call_intrin / call_extern)
    Traversal->>Intrinsic: is call_intrin name == tl_gemm / tl_gemm_sp?
    alt intrinsic is wgmma
        Intrinsic-->>Traversal: Async
    else not wgmma
        Traversal->>ExternMap: is call_extern known?
        alt known extern
            ExternMap-->>Traversal: MappedKind (Async/Generic/etc.)
        else unknown extern
            ExternMap-->>Traversal: Generic (fallback)
        end
    end
    Note over Traversal: Resolved kind drives fence-proxy insertion decision
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

Poem

🐰 I hop through code with whiskers bright,
tl_gemm hops async into the night.
Unknown externs find a gentler lane,
fences placed where logic deigns.
Carrots, prints, and tests take flight 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "[Enhancement] Update async intrinsic handling in inject_fence_proxy" directly and accurately reflects the main objective of the pull request. It clearly specifies the component being modified (inject_fence_proxy) and the primary change (updating async intrinsic handling). The title is concise, avoids vague terminology, and provides sufficient clarity for someone reviewing the commit history to understand the nature of the change without needing to examine the full diff.
Linked Issues Check ✅ Passed The pull request successfully addresses the core coding requirements from issue #1002. The changes to IsAsyncIntrinsic to explicitly recognize tl_gemm and tl_gemm_sp as async intrinsics directly address the need to properly classify GEMM operations [#1002]. The modification to treat unknown external calls as Generic instead of Async directly implements the solution to prevent spurious fence insertion for non-async externs [#1002]. The test file updates reflect the new approach of representing GEMM as intrinsic calls rather than extern calls, which aligns with the stated objective of consistent GEMM representation [#1002].
Out of Scope Changes Check ✅ Passed The primary code changes in inject_fence_proxy.cc are directly aligned with the PR objectives of improving async intrinsic identification and reducing unnecessary fence insertion. The test file modifications update GEMM call representations from extern to intrinsic, which is a necessary adjustment to validate the new behavior. The removal of the test_wgmma_after_descriptor test function appears to be a related cleanup of legacy test scaffolding that tested the previous extern-based GEMM handling, as noted in the PR comments about legacy tests still representing GEMM as extern. All material changes are focused on achieving the stated goal of fixing issue #1002.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3858df and 0360e4c.

📒 Files selected for processing (1)
  • testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (2)
tilelang/language/tir/op.py (1)
  • call_intrin (119-144)
src/transform/inject_fence_proxy.cc (16)
  • op (145-159)
  • op (145-145)
  • op (176-201)
  • op (176-176)
  • op (203-225)
  • op (203-203)
  • op (227-231)
  • op (227-227)
  • op (233-242)
  • op (233-233)
  • op (244-250)
  • op (244-244)
  • op (252-257)
  • op (252-252)
  • op (259-269)
  • op (259-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (2)
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (2)

49-50: Consistent intrinsic usage in expected output.

The after() function correctly shows the same intrinsic call with matching template parameters. The test properly validates that T.fence_proxy_async() (line 48) is inserted before the async GEMM intrinsic, which is the expected behavior per the PR objectives.


34-35: Template parameter change raises concerns about PR scope and intentionality.

The conversion to call_intrin is correct for the PR objectives; however, verification reveals a critical inconsistency:

  • This file only: Uses call_intrin with template <128, 128, 32, ...> (13 parameters)
  • All other test files (test_tilelang_transform_warp_specialized.py, test_tilelang_transform_multi_version_buffer.py, test_tilelang_transform_inject_set_max_nreg.py): Still use call_extern with template <64, 64, 32, 4, 1, 0, 0> (7 parameters)

This dual difference suggests either:

  1. The PR is incomplete—other test files also need call_externcall_intrin migration and template updates, or
  2. The template parameter change (64→128 dimensions, 7→13 parameters) is deliberate and specific to the fence proxy test

Please clarify which scenario applies and confirm the template parameter change is intentional.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

* Added support for wgmma async intrinsics in IsAsyncIntrinsic function.
* Changed handling of unknown externs to treat them as Generic instead of Async, improving accuracy in proxy kind determination.
@Rachmanino
Copy link
Collaborator Author

It seems that we have to fix legacy tests, where gemm is still represented as extern🤔

@LeiWang1999
Copy link
Member

I'll take a look

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (1)

34-38: Use T.evaluate for consistency (optional).

The switch to tl.tl_gemm intrinsic looks right for async handling. For consistency with other spots in this file (e.g., Lines 97–101 and 131), consider wrapping the intrinsic in T.evaluate.

Apply this minimal change:

-            T.call_intrin("handle", tir.op.Op.get("tl.tl_gemm"),
+            T.evaluate(T.call_intrin("handle", tir.op.Op.get("tl.tl_gemm"),
                           "tl::gemm_ss<128, 128, 32, 4, 1, 0, 0, 0, 32, 128, 0, 0, true>",
                           T.tvm_access_ptr(T.type_annotation("float16"), A_shared.data, 0, 2048, 1),
                           T.tvm_access_ptr(T.type_annotation("float16"), B_shared.data, 0, 2048, 1),
-                          T.tvm_access_ptr(T.type_annotation("float32"), C_local.data, 0, 32, 3))
+                          T.tvm_access_ptr(T.type_annotation("float32"), C_local.data, 0, 32, 3)))

Optionally add a sibling test using tl_gemm_sp to lock in its async classification as well.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aad2051 and c3858df.

📒 Files selected for processing (1)
  • testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (2)
tilelang/language/tir/op.py (1)
  • call_intrin (119-144)
src/transform/inject_fence_proxy.cc (16)
  • op (145-159)
  • op (145-145)
  • op (176-201)
  • op (176-176)
  • op (203-225)
  • op (203-203)
  • op (227-231)
  • op (227-227)
  • op (233-242)
  • op (233-233)
  • op (244-250)
  • op (244-244)
  • op (252-257)
  • op (252-252)
  • op (259-269)
  • op (259-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Quick Lint
🔇 Additional comments (1)
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (1)

49-53: After-path matches intended fence insertion before async tl_gemm.

The updated intrinsic call in the “after” IR matches the transform’s new behavior and keeps the expected fence placement. LGTM.

@Rachmanino
Copy link
Collaborator Author

I'll take a look

really appreciate your help

…proxy.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@LeiWang1999 LeiWang1999 merged commit bb8b3cd into tile-ai:main Oct 20, 2025
6 checks passed
@Rachmanino Rachmanino deleted the proxy branch October 20, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Generating useless tl::fence_proxy_async()

2 participants