[Feat] Support warp reduce #1316

Rachmanino · 2025-11-22T10:45:30Z

as titiled

Summary by CodeRabbit

New Features
- Added five warp-level reduction ops: warp_reduce_sum, warp_reduce_max, warp_reduce_min, warp_reduce_bitand, warp_reduce_bitor.
- Exposed these ops in the public Python API for use in kernels.
- CUDA code generation now emits warp-level reduction calls for these ops.
Tests
- New unit tests validating sum, max, min, bitwise-AND, and bitwise-OR warp reductions on CUDA.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2025-11-22T10:45:39Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-11-22T10:45:40Z

Walkthrough

Adds five warp-level reduction intrinsics (warp_reduce_sum, warp_reduce_max, warp_reduce_min, warp_reduce_bitand, warp_reduce_bitor) across C++ op registration, CUDA codegen, CUDA device templates, and the Python tilelang API.

Changes

Cohort / File(s)	Summary
Op declaration & registration `src/op/builtin.h`, `src/op/builtin.cc`	Added five new public Op accessors and registration entries: `tl.warp_reduce_sum`, `tl.warp_reduce_max`, `tl.warp_reduce_min`, `tl.warp_reduce_bitand`, `tl.warp_reduce_bitor`. Each is defined with a single input and marked opaque via TCallEffectKind.
CUDA codegen `src/target/codegen_cuda.cc`	Extended CallNode handling to emit calls to `tl::warp_reduce_*` functions for the new TL intrinsics; falls back to base CodeGenC when unmatched.
CUDA device templates `src/tl_templates/cuda/reduce.h`	Added a generic `warp_reduce<T, ReduceOp>` using xor-based shuffles and five specialized wrappers: `warp_reduce_sum`, `warp_reduce_max`, `warp_reduce_min`, `warp_reduce_bitand`, `warp_reduce_bitor` (TL_DEVICE).
Python API `tilelang/language/reduce.py`, `tilelang/language/__init__.py`	Added Python helpers `warp_reduce_sum`, `warp_reduce_max`, `warp_reduce_min`, `warp_reduce_bitand`, `warp_reduce_bitor` (accepting `tir.PrimExpr`) and exported them from the package.
Tests `testing/python/language/test_tilelang_language_warp_reduce.py`	New tests exercising all five warp reductions via a Torch CUDA kernel, including correctness checks against reference reductions.

Sequence Diagram(s)

sequenceDiagram
    participant Py as Python API
    participant TL as TL Intrinsic (Op)
    participant CG as CodeGenCUDA
    participant Dev as CUDA device template
    participant GPU as GPU warp lanes

    Note over Py,TL: high-level flow for a warp reduction intrinsic
    Py->>TL: construct CallNode for tl.warp_reduce_*
    TL->>CG: lowering of CallNode
    CG->>Dev: emit call to tl::warp_reduce_* in generated CUDA
    Dev->>GPU: perform xor-shuffle reduction across lanes
    GPU-->>Dev: reduced lane result
    Dev-->>CG: emitted expression/value
    CG-->>TL: lowered result
    TL-->>Py: value used in Python-level expression

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

Pay attention to:
- src/tl_templates/cuda/reduce.h — correctness of shuffle offsets, initial value/op identity, and type handling.
- src/target/codegen_cuda.cc — matching intrinsic names/arity and generated call syntax.
- src/op/builtin.* — consistent registration metadata and effect kind.
- Tests — verify kernel correctness and edge cases for bitwise ops.

Possibly related PRs

[Feature] Support Reduce operators for bitwise and/or/xor #1074 — adds/supports bitwise reduction functors/wrappers used by the new warp_reduce_bitand/warp_reduce_bitor wrappers.

Suggested reviewers

LeiWang1999

Poem

🐰
I hop through lanes of CUDA light,
XOR whispers through the night,
Sum and max, min and bit,
I nudge each lane to share a fit,
Rabbit cheers — reductions bright! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 46.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[Feat] Support warp reduce' directly and clearly describes the main change: adding support for warp reduction operations across five new reduction functions (sum, max, min, bitand, bitor).

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/target/codegen_cuda.cc (1)

2612-2621: CUDA mapping for tl.warp_reduce_* intrinsics looks consistent

The new cases correctly lower the TIR intrinsics to tl::warp_reduce_* device helpers and match the one-argument registration in builtin.cc. If you want additional safety, you could ICHECK_EQ(op->args.size(), 1U); in each branch, but it’s not strictly necessary given set_num_inputs(1).

tilelang/language/__init__.py (1)

56-73: Warp-reduce helpers are correctly re-exported

The additional imports from .reduce cleanly expose the new warp_reduce_* helpers at the tilelang.language level and match the implementations in reduce.py. Ruff’s RUF100 about # noqa: F401 is just a config mismatch; if you care about it, you could switch these to bare # noqa (or enable F401 in Ruff), but keeping them as-is is consistent with the rest of this module.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 470eb74 and 86658ec.

📒 Files selected for processing (6)

src/op/builtin.cc (1 hunks)
src/op/builtin.h (1 hunks)
src/target/codegen_cuda.cc (1 hunks)
src/tl_templates/cuda/reduce.h (1 hunks)
tilelang/language/__init__.py (1 hunks)
tilelang/language/reduce.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

src/target/codegen_cuda.cc (1)

tilelang/language/reduce.py (5)

warp_reduce_sum (330-347)

warp_reduce_max (350-367)

warp_reduce_min (370-387)

warp_reduce_bitand (390-407)

warp_reduce_bitor (410-427)

src/tl_templates/cuda/reduce.h (1)

tilelang/language/reduce.py (5)

warp_reduce_sum (330-347)

warp_reduce_max (350-367)

warp_reduce_min (370-387)

warp_reduce_bitand (390-407)

warp_reduce_bitor (410-427)

src/op/builtin.h (1)

tilelang/language/reduce.py (5)

warp_reduce_sum (330-347)

warp_reduce_max (350-367)

warp_reduce_min (370-387)

warp_reduce_bitand (390-407)

warp_reduce_bitor (410-427)

tilelang/language/__init__.py (1)

tilelang/language/reduce.py (5)

warp_reduce_sum (330-347)

warp_reduce_max (350-367)

warp_reduce_min (370-387)

warp_reduce_bitand (390-407)

warp_reduce_bitor (410-427)

tilelang/language/reduce.py (1)

tilelang/language/tir/op.py (1)

call_intrin (120-145)

src/op/builtin.cc (1)

tilelang/language/reduce.py (5)

warp_reduce_sum (330-347)

warp_reduce_max (350-367)

warp_reduce_min (370-387)

warp_reduce_bitand (390-407)

warp_reduce_bitor (410-427)

🪛 Ruff (0.14.5)

tilelang/language/__init__.py

68-68: Unused noqa directive (non-enabled: F401)