[Feat][Doc] Enhance TileLang with NVSHMEM support and build system improvements #18

chengyupku · 2025-09-02T03:40:39Z

Added support for NVSHMEM in the environment configuration, enabling distributed memory operations.
Introduced custom build targets for CUDA extensions in the CMake configuration.
Updated installation documentation to reflect the new repository name and NVSHMEM usage.
Created a new guide for running examples with and without NVSHMEM.
Refactored various components to utilize NVSHMEM for distributed tensor management and memory allocation.
Improved error handling in the allocator for better memory management.

Summary by CodeRabbit

New Features
- Optional NVSHMEM runtime flag and explicit NVSHMEM workflows for distributed runs.
- Launcher now runs distributed jobs via the Python module runner and enables NVSHMEM mode.
- Distributed init improvements for host-table setup and NVSHMEM-driven tensor creation.
Bug Fixes
- Allocator pre-allocation boundary checks to prevent GPU memory overruns with clear errors.
- Examples robust to missing args by defaulting M.
Documentation
- Installation reorganized into three methods, renamed to TileScale, and adds NVSHMEM setup.
- New guide for running distributed examples (with and without NVSHMEM).
Chores
- Build now auto-builds required PyTorch extensions as part of the main build.
Tests
- Added CUDA-guarded distributed example tests.

…provements - Added support for NVSHMEM in the environment configuration, enabling distributed memory operations. - Introduced custom build targets for CUDA extensions in the CMake configuration. - Updated installation documentation to reflect the new repository name and NVSHMEM usage. - Created a new guide for running examples with and without NVSHMEM. - Refactored various components to utilize NVSHMEM for distributed tensor management and memory allocation. - Improved error handling in the allocator for better memory management.

coderabbitai · 2025-09-02T03:40:47Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds CMake custom targets to build two PyTorch extensions in-place and wires them into the main build. Updates installation and distributed run docs, introduces TILELANG_USE_NVSHMEM and USE_NVSHMEM, refactors NVSHMEM gating across components, updates the distributed launcher, adjusts CUDA binding imports, adds a host_table init helper, adds an allocator overflow check, and adds distributed example tests and minor example defaults.

Changes

Cohort / File(s)	Summary
Build system: PyTorch extensions `CMakeLists.txt`	Adds `ALLOC_CUDA_EXT_DIR` and `IPC_EXT_DIR`. Adds `build_alloc_cuda_ext` and `build_ipc_ext` custom targets that run `${Python_EXECUTABLE} setup.py build_ext --inplace` (WORKING_DIRECTORY set). Makes `tilelang` depend on both targets.
Docs: Installation and distributed examples `docs/get_started/Installation.md`, `docs/get_started/run_example.md`	Rebrands to TileScale/tilescale; restructures installation into three methods; adds NVSHMEM build/install steps and `pynvshmem` import test; documents distributed example runs and launcher usage.
Launcher updates `tilelang/distributed/launch.sh`	Exports `TILELANG_USE_NVSHMEM=1`; introduces `PYTHON_EXEC="$(which python)"`; invokes `${PYTHON_EXEC} -m torch.distributed.run` instead of `torchrun`.
CUDA binding import refactor `tilelang/distributed/pynvshmem/.../__init__.py`, `tilelang/distributed/utils.py`	Replaces `from cuda import cuda, cudart` with `from cuda.bindings import driver as cuda` and `from cuda.bindings import runtime as cudart`.
Environment flag `tilelang/env.py`	Adds `USE_NVSHMEM = EnvVar("TILELANG_USE_NVSHMEM", "0").get().lower() in ("1", "true", "on")`.
Cython NVSHMEM gating `tilelang/jit/adapter/cython/cython_wrapper.pyx`	Switches NVSHMEM gating from `env.USE_DISTRIBUTED` to `env.USE_NVSHMEM` for importing `pynvshmem` and for creating output tensors via `pynvshmem.nvshmem_create_tensor(...)` (fallback to `torch.empty` otherwise).
CUDA lib generation flags `tilelang/jit/adapter/libgen.py`	Gates NVSHMEM include/link flags under `env.USE_NVSHMEM`, asserts `NVSHMEM_INCLUDE_DIR` and `NVSHMEM_LIB_PATH`, retains diag-suppress, conditionally adds `-rdc=true` when RDC not disabled, and links `-lnvshmem_host -lnvshmem_device`.
JIT wrapper: init helpers and flags `tilelang/jit/adapter/wrapper.py`	Adds `PREDEF_INIT_TABLE_FUNC` C helper (init_table). Introduces `use_distributed` (from `env.USE_DISTRIBUTED`) and switches `use_nvshmem` to `env.USE_NVSHMEM`. Appends `init_table` logic when distributed is enabled.
Profiler gating `tilelang/profiler/__init__.py`	Changes gating to `env.USE_NVSHMEM` for importing `pynvshmem`; removes prior module-load logger status messages and moves pynvshmem import into runtime init when NVSHMEM is enabled.
Allocator bounds check `tilelang/utils/allocator.py`	Adds pre-allocation boundary check in `BaseAllocator._allocate_tensor` that computes `current_offset` and raises `MemoryError` with requested/available/total sizes when allocation would overflow the pre-allocated buffer.
Examples: defaults and tests `examples/distributed/example_pull_warp.py`, `examples/distributed/example_push_warp.py`, `examples/distributed/test_pull_warp.py`, `examples/distributed/test_push_warp.py`	Adds default fallback for `M` when `args` is falsy (default 65536) in example files; adds two distributed CUDA-guarded tests that spawn processes via `torch.multiprocessing.spawn` and include `__main__` entrypoints.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant CMake as CMake
  participant AllocExt as alloc_cuda setup.py
  participant IPCExt as ipc_ext setup.py
  participant Tile as tilelang target

  Dev->>CMake: cmake --build
  CMake->>AllocExt: ${Python_EXECUTABLE} setup.py build_ext --inplace\n(WORKDIR ALLOC_CUDA_EXT_DIR)
  CMake->>IPCExt: ${Python_EXECUTABLE} setup.py build_ext --inplace\n(WORKDIR IPC_EXT_DIR)
  Note over AllocExt,IPCExt: in-place build of PyTorch extensions
  AllocExt-->>CMake: success
  IPCExt-->>CMake: success
  CMake->>Tile: build tilelang after deps

sequenceDiagram
  autonumber
  actor User
  participant Sh as launch.sh
  participant Py as Python
  participant TorchRun as torch.distributed.run
  participant App as user script

  User->>Sh: GPUS=N ./launch.sh python app.py
  Sh->>Sh: export TILELANG_USE_NVSHMEM=1
  Sh->>Py: ${PYTHON_EXEC} -m torch.distributed.run ...
  Py->>TorchRun: start launcher
  TorchRun->>App: spawn ranks

sequenceDiagram
  autonumber
  participant Env as env.py
  participant Wrapper as TLCUDASourceWrapper
  participant Libgen as libgen
  participant Cython as cython_wrapper
  participant NV as NVSHMEM

  Env-->>Wrapper: provide USE_DISTRIBUTED and USE_NVSHMEM
  alt use_distributed
    Wrapper->>Wrapper: append PREDEF_INIT_TABLE_FUNC (init_table)
  end
  alt USE_NVSHMEM
    Libgen->>NV: add NVSHMEM include/link flags
    Cython->>NV: create nvshmem tensor at runtime
  else
    Cython->>Torch: use torch.empty(...) fallback
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[Feat][Doc] Enhance TileLang with NVSHMEM support and build system improvements #18 — Mirrors many same changes (CMake extension targets, USE_NVSHMEM, allocator checks, distributed launcher), likely the same or companion PR.
[Feat]support sp all2all without transpose example #16 — Modifies distributed initialization and env variables; closely related to the USE_NVSHMEM additions and launcher updates.
[Feat] Refactor UVA mechanism and introduce metadata table to support remote device-to-device copy #15 — Touches allocator and extension packaging (alloc_cuda), related to the new CMake build targets and allocator changes.

Poem

Hoppity hop, I build and link,
Two CMake spells in just a wink.
NVSHMEM flags line up in tune,
Launchers call and GPUs commune.
Bound checks snug — no buffer fright,
A rabbit coder hops by night. 🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 13e16ef and c293a0c.

📒 Files selected for processing (1)

tilelang/profiler/__init__.py (1 hunks)

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yu/dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-09-02T03:40:49Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)

tilelang/distributed/pynvshmem/python/pynvshmem/__init__.py (1)

20-28: Fix CUDA error-string extraction (avoid printing raw tuples).

cuda-python returns (res, value) for cuGetErrorName/cuGetErrorString in many versions; current code prints the tuple object.

 def _CUDA_CHECK(err):
     if isinstance(err, cuda.CUresult):
         if err != cuda.CUresult.CUDA_SUCCESS:
-            raise RuntimeError(f"Cuda Error: {err}: {cuda.cuGetErrorName(err)}")
+            _res, name = cuda.cuGetErrorName(err)
+            _res2, desc = cuda.cuGetErrorString(err)
+            name = name.decode() if isinstance(name, (bytes, bytearray)) else name
+            desc = desc.decode() if isinstance(desc, (bytes, bytearray)) else desc
+            raise RuntimeError(f"CUDA Driver Error {int(err)} ({name}): {desc}")
     elif isinstance(err, cudart.cudaError_t):
         if err != cudart.cudaError_t.cudaSuccess:
-            raise RuntimeError(f"Cuda Error: {err}: {cudart.cudaGetErrorString(err)}")
+            msg = cudart.cudaGetErrorString(err)
+            if isinstance(msg, tuple):  # some bindings return (res, msg)
+                msg = msg[1]
+            msg = msg.decode() if isinstance(msg, (bytes, bytearray)) else msg
+            raise RuntimeError(f"CUDA Runtime Error {int(err)}: {msg}")
     else:
         raise RuntimeError(f"Unknown error type: {err}")

tilelang/distributed/utils.py (1)

231-239: Fix CUDA error-string extraction (avoid printing raw tuples).

Same issue as in pynvshmem._CUDA_CHECK.

 def CUDA_CHECK(err):
     if isinstance(err, cuda.CUresult):
         if err != cuda.CUresult.CUDA_SUCCESS:
-            raise RuntimeError(f"Cuda Error: {err}: {cuda.cuGetErrorName(err)}")
+            _res, name = cuda.cuGetErrorName(err)
+            _res2, desc = cuda.cuGetErrorString(err)
+            name = name.decode() if isinstance(name, (bytes, bytearray)) else name
+            desc = desc.decode() if isinstance(desc, (bytes, bytearray)) else desc
+            raise RuntimeError(f"CUDA Driver Error {int(err)} ({name}): {desc}")
     elif isinstance(err, cudart.cudaError_t):
         if err != cudart.cudaError_t.cudaSuccess:
-            raise RuntimeError(f"Cuda Error: {err}: {cudart.cudaGetErrorString(err)}")
+            msg = cudart.cudaGetErrorString(err)
+            if isinstance(msg, tuple):
+                msg = msg[1]
+            msg = msg.decode() if isinstance(msg, (bytes, bytearray)) else msg
+            raise RuntimeError(f"CUDA Runtime Error {int(err)}: {msg}")
     else:
         raise RuntimeError(f"Unknown error type: {err}")

tilelang/env.py (1)

212-226: Gate NVSHMEM paths by USE_NVSHMEM (not USE_DISTRIBUTED).

Currently NVSHMEM include/lib paths won’t be set when only TILELANG_USE_NVSHMEM=1 (and TILELANG_USE_DISTRIBUTED=0), breaking NVSHMEM builds/runs.

-    USE_DISTRIBUTED = EnvVar("TILELANG_USE_DISTRIBUTED", "0").get().lower() in ("1", "true", "on")
-    USE_NVSHMEM = EnvVar("TILELANG_USE_NVSHMEM", "0").get().lower() in ("1", "true", "on")
-    if USE_DISTRIBUTED:
+    USE_DISTRIBUTED = EnvVar("TILELANG_USE_DISTRIBUTED", "0").get().lower() in ("1", "true", "on")
+    USE_NVSHMEM = EnvVar("TILELANG_USE_NVSHMEM", "0").get().lower() in ("1", "true", "on")
+    if USE_NVSHMEM:
         if EnvVar("NVSHMEM_SRC", None).get() is not None:
             NVSHMEM_SRC = EnvVar("NVSHMEM_SRC", None).get()
         else:
             NVSHMEM_SRC = os.path.join(
                 os.path.dirname(os.path.abspath(__file__)), "..", "3rdparty", "nvshmem_src")
         NVSHMEM_INCLUDE_DIR: str = NVSHMEM_SRC + "/build/src/include"
         NVSHMEM_LIB_PATH: str = NVSHMEM_SRC + "/build/src/lib"
     else:
         NVSHMEM_INCLUDE_DIR = None
         NVSHMEM_LIB_PATH = None

tilelang/profiler/__init__.py (1)

158-166: Use USE_NVSHMEM consistently for distributed/NVSHMEM paths.

These branches still use USE_DISTRIBUTED; switch to USE_NVSHMEM to match the new gating.

-        if env.USE_DISTRIBUTED:
+        if env.USE_NVSHMEM:
@@
-        if env.USE_DISTRIBUTED:
+        if env.USE_NVSHMEM:
@@
-        if env.USE_DISTRIBUTED:  # noqa: SIM108
+        if env.USE_NVSHMEM:  # noqa: SIM108
@@
-            if env.USE_DISTRIBUTED:
+            if env.USE_NVSHMEM:
@@
-            if env.USE_DISTRIBUTED:
+            if env.USE_NVSHMEM:

Also applies to: 246-251, 262-271, 310-320, 330-337

tilelang/jit/adapter/libgen.py (1)

138-147: Scope NVSHMEM flags to CUDA only and fix link-flag splitting

Applying NVSHMEM flags unconditionally breaks HIP/CPU builds when TILELANG_USE_NVSHMEM=1. Also, passing both -lnvshmem_* in a single list element prevents proper arg splitting.

Apply:

-        if env.USE_NVSHMEM:
+        if is_cuda_target(target) and env.USE_NVSHMEM:
             assert env.NVSHMEM_INCLUDE_DIR is not None, "env.NVSHMEM_INCLUDE_DIR is not set"
             assert env.NVSHMEM_LIB_PATH is not None, "env.NVSHMEM_LIB_PATH is not set"
-            command += ["-diag-suppress=20013"]
+            command += ["-diag-suppress=20013"]
             if not disable_rdc:
                 command += ["-rdc=true"]
             command += [
-                "-I" + env.NVSHMEM_INCLUDE_DIR, "-L" + env.NVSHMEM_LIB_PATH,
-                "-lnvshmem_host -lnvshmem_device"
+                "-I" + env.NVSHMEM_INCLUDE_DIR,
+                "-L" + env.NVSHMEM_LIB_PATH,
+                "-lnvshmem_host",
+                "-lnvshmem_device",
             ]

tilelang/distributed/launch.sh (1)

36-49: Build the command with an array and exec; preserve "$@" quoting and avoid fragile strings

Current string-based CMD drops original argument boundaries and can mis-handle spaces. Use an argv array and exec.

-PYTHON_EXEC="$(which python)"
-CMD="${PYTHON_EXEC} -m torch.distributed.run \
-  --node_rank=${node_rank} \
-  --nproc_per_node=${nproc_per_node} \
-  --nnodes=${nnodes} \
-  ${TILELANG_EXTRA_TORCHRUN_ARGS} ${additional_args} $@"
-
-if [ ${memcheck} -eq 1 ]; then
-    CMD="compute-sanitizer --tool memcheck ${CMD}"
-fi
-
-echo ${CMD}
-${CMD}
+PYTHON_EXEC="$(which python)"
+# Build argv
+args=( -m torch.distributed.run
+  --node_rank="${node_rank}"
+  --nproc_per_node="${nproc_per_node}"
+  --nnodes="${nnodes}"
+)
+# shell expansion for optional args (best-effort)
+args+=( ${TILELANG_EXTRA_TORCHRUN_ARGS} ${additional_args} )
+
+set -x
+if [ "${memcheck}" -eq 1 ]; then
+  exec compute-sanitizer --tool memcheck "${PYTHON_EXEC}" "${args[@]}" "$@"
+else
+  exec "${PYTHON_EXEC}" "${args[@]}" "$@"
+fi

tilelang/jit/adapter/wrapper.py (1)

44-64: PREDEF_INIT_TABLE_FUNC: remove always-true checks and std:: namespace; align with existing error handling

if (error_buf) is always true for a static array; drop it to avoid -Waddress.
Use snprintf (as elsewhere) to avoid depending on <cstdio>/std::snprintf.
Keep style consistent with other predefs.

Apply:

-PREDEF_INIT_TABLE_FUNC = """
-extern "C" int init_table(const void* host_table, size_t n) {{
-    if (error_buf) error_buf[0] = '\\0';
-
-    if (host_table == nullptr) {{
-        if (error_buf) std::snprintf(error_buf, 256, "host_table is null");
-        return -1;
-    }}
-    if (n == 0) {{
-        return 0;
-    }}
-
-    size_t bytes = n * sizeof(uint64_t);
-    cudaError_t err = cudaMemcpyToSymbol(meta_data, host_table, bytes, 0, cudaMemcpyHostToDevice);
-    if (err != cudaSuccess) {{
-        if (error_buf) std::snprintf(error_buf, 256, "cudaMemcpyToSymbol failed: %s", cudaGetErrorString(err));
-        return static_cast<int>(err);
-    }}
-    return 0;
-}}
-"""
+PREDEF_INIT_TABLE_FUNC = """
+extern "C" int init_table(const void* host_table, size_t n) {
+    error_buf[0] = '\\0';
+    if (host_table == nullptr) {
+        snprintf(error_buf, 256, "host_table is null");
+        return -1;
+    }
+    if (n == 0) {
+        return 0;
+    }
+    size_t bytes = n * sizeof(uint64_t);
+    cudaError_t err = cudaMemcpyToSymbol(meta_data, host_table, bytes, 0, cudaMemcpyHostToDevice);
+    if (err != cudaSuccess) {
+        snprintf(error_buf, 256, "cudaMemcpyToSymbol failed: %s", cudaGetErrorString(err));
+        return static_cast<int>(err);
+    }
+    return 0;
+}
+"""

🧹 Nitpick comments (9)

tilelang/distributed/pynvshmem/python/pynvshmem/__init__.py (2)

4-5: Make CUDA imports backward-compatible.

cuda-python exposes both top-level (cuda, cudart) and bindings.* layouts across versions. Add a fallback to avoid import errors on older installs.

-from cuda.bindings import driver as cuda
-from cuda.bindings import runtime as cudart
+try:
+    from cuda.bindings import driver as cuda
+    from cuda.bindings import runtime as cudart
+except ImportError:
+    # Fallback for older cuda-python versions
+    from cuda import cuda, cudart

60-79: Guard against host tensors for stream writes.

cuStreamWriteValue{32,64} requires a device pointer; add a fast check.

-    assert isinstance(tensor, torch.Tensor) and tensor.dtype in (torch.int32, torch.uint32), \
+    assert isinstance(tensor, torch.Tensor) and tensor.dtype in (torch.int32, torch.uint32), \
         f"tensor must be a torch.Tensor with 32-bit dtype, but got {tensor.dtype}"
+    assert tensor.is_cuda, "tensor must reside on a CUDA device"
...
-    assert isinstance(tensor, torch.Tensor) and tensor.dtype in (torch.int64, torch.uint64), \
+    assert isinstance(tensor, torch.Tensor) and tensor.dtype in (torch.int64, torch.uint64), \
         f"tensor must be a torch.Tensor with 64-bit dtype, but got {tensor.dtype}"
+    assert tensor.is_cuda, "tensor must reside on a CUDA device"

Also applies to: 82-101

tilelang/distributed/utils.py (2)

8-9: Make CUDA imports backward-compatible.

Mirror the pynvshmem import fallback to cover older cuda-python versions.

-from cuda.bindings import driver as cuda
-from cuda.bindings import runtime as cudart
+try:
+    from cuda.bindings import driver as cuda
+    from cuda.bindings import runtime as cudart
+except ImportError:
+    from cuda import cuda, cudart

11-11: Default NVSHMEM init to env.USE_NVSHMEM.

Keep behavior aligned with the new flag without forcing callers to pass it.

+from tilelang import env
...
-def init_distributed(return_tp_group=False, init_nvshmem=True):
+def init_distributed(return_tp_group=False, init_nvshmem: bool | None = None):
@@
-    if init_nvshmem:
+    if init_nvshmem is None:
+        init_nvshmem = env.USE_NVSHMEM
+    if init_nvshmem:
         import pynvshmem
         pynvshmem.init_nvshmem_by_uniqueid(TP_GROUP)

Also applies to: 49-68

tilelang/utils/allocator.py (1)

179-182: Shorten error message and clamp available bytes (minor).

Keeps logs tight and avoids negative “available” values on overflow.

-        if current_offset + bytes_alloc > self.size:
-            bytes_available = self.size - current_offset
-            raise MemoryError(f"Allocation failed: Requesting {bytes_alloc} bytes, but only "
-                              f"{bytes_available} bytes are available in the pre-allocated buffer "
-                              f"(total size: {self.size} bytes).")
+        if current_offset + bytes_alloc > self.size:
+            bytes_available = max(0, self.size - current_offset)
+            raise MemoryError(
+                f"Out of pre-allocated memory: need {bytes_alloc}B, have {bytes_available}B of {self.size}B."
+            )

tilelang/jit/adapter/libgen.py (1)

139-140: Prefer explicit exceptions over asserts for env validation

Asserts can be skipped with Python -O. Raise a RuntimeError/ValueError with actionable guidance.

-            assert env.NVSHMEM_INCLUDE_DIR is not None, "env.NVSHMEM_INCLUDE_DIR is not set"
-            assert env.NVSHMEM_LIB_PATH is not None, "env.NVSHMEM_LIB_PATH is not set"
+            if not env.NVSHMEM_INCLUDE_DIR:
+                raise RuntimeError("NVSHMEM enabled but NVSHMEM_INCLUDE_DIR is not set")
+            if not env.NVSHMEM_LIB_PATH:
+                raise RuntimeError("NVSHMEM enabled but NVSHMEM_LIB_PATH is not set")

tilelang/distributed/launch.sh (1)

4-4: Don’t force NVSHMEM; default only if undefined

Let users override TILELANG_USE_NVSHMEM from the environment.
-export TILELANG_USE_NVSHMEM=1  # enable TileLang distributed mode
+export TILELANG_USE_NVSHMEM=${TILELANG_USE_NVSHMEM:=1}  # enable NVSHMEM by default; allow override

docs/get_started/run_example.md (2)

17-21: Add multi-node usage note (NODES, NODE_RANK, MASTER_ADDR/PORT)

A short snippet will help users scale beyond single node.

 You can change GPUS to the number of local GPUs you want to use. The launcher will set the required environment variables and invoke `torch.distributed.run`.
+For multi-node runs, set:
+
+```bash
+NODES=<num_nodes> NODE_RANK=<this_node_rank> MASTER_ADDR=<host_of_rank0> \
+GPUS=<gpus_per_node> ./tilelang/distributed/launch.sh <your_script.py> [args...]
+```

1-21: Call out NVSHMEM prerequisites explicitly

Briefly state that NVSHMEM must be installed and NVSHMEM_INCLUDE_DIR/NVSHMEM_LIB_PATH set (or provided by default install paths).

 ## Examples using NVSHMEM APIs
 
 Use the provided launcher `tilelang/distributed/launch.sh` to start programs that use the NVSHMEM API. For example, to run with 2 GPUs:
+Prerequisites: NVSHMEM installed on the system and environment variables `NVSHMEM_INCLUDE_DIR` and `NVSHMEM_LIB_PATH` set if not in default locations.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ab56a2f and 7f0cd29.

📒 Files selected for processing (12)

CMakeLists.txt (1 hunks)
docs/get_started/Installation.md (5 hunks)
docs/get_started/run_example.md (1 hunks)
tilelang/distributed/launch.sh (2 hunks)
tilelang/distributed/pynvshmem/python/pynvshmem/__init__.py (1 hunks)
tilelang/distributed/utils.py (1 hunks)
tilelang/env.py (1 hunks)
tilelang/jit/adapter/cython/cython_wrapper.pyx (2 hunks)
tilelang/jit/adapter/libgen.py (1 hunks)
tilelang/jit/adapter/wrapper.py (3 hunks)
tilelang/profiler/__init__.py (1 hunks)
tilelang/utils/allocator.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tilelang/jit/adapter/wrapper.py (1)

src/target/codegen_cuda.h (1)

use_distributed (23-29)

🪛 Ruff (0.12.2)

tilelang/utils/allocator.py

180-182: Avoid specifying long messages outside the exception class

(TRY003)

🪛 LanguageTool

docs/get_started/Installation.md

[grammar] ~117-~117: There might be a mistake here.
Context: ...ted/example_allgather.py)), you need to build NVSHMEM library for device-side code ge...

(QB_NEW_EN)

[grammar] ~124-~124: There might be a mistake here.
Context: ... install the pynvshmem package, which provides wrapped host-side Python API for NVSHME...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: bot-task

🔇 Additional comments (6)

tilelang/utils/allocator.py (1)

177-183: Good pre-allocation bounds check.

Prevents buffer overruns and yields a clear failure mode.
tilelang/jit/adapter/libgen.py (2)
141-141: Verify tool routing for diagnostic suppression

Elsewhere you use “-Xcudafe --diag_suppress=…”. Here it’s “-diag-suppress=20013” alone. Confirm nvcc actually honors this form; otherwise switch to the consistent form.

Option if needed:
-            command += ["-diag-suppress=20013"]
+            command += ["-Xcudafe", "--diag_suppress=20013"]
142-143: NVSHMEM typically requires RDC; consider forcing or failing fast

If NVSHMEM device libs are used, -rdc=true is usually required. Either force it under USE_NVSHMEM or raise a clear error when TL_DISABLE_RDC is set.
CMakeLists.txt (1)
225-247: Guard PyTorch extension targets behind USE_CUDA and enable incremental-style builds

Wrap build_alloc_cuda_ext and build_ipc_ext in if(USE_CUDA)…endif() instead of ALL, add USES_TERMINAL so they only run when CUDA is enabled and display output. Manually verify that USE_CUDA is declared (e.g. via option(USE_CUDA "Enable CUDA extensions" ON) or in an included config) and defaults as expected.

Longer-term, convert to add_custom_command(OUTPUT <built-.so> BYPRODUCTS <.so> …) and have the custom target depend on those outputs to get true incremental rebuilds.

Replace the execute_process(COMMAND which python…) Python lookup with
find_package(Python3 COMPONENTS Interpreter REQUIRED) and use ${Python3_EXECUTABLE} for more robust, CMake-native discovery.
-if(…)
-  set(ALLOC_CUDA_EXT_DIR ${PROJECT_SOURCE_DIR}/tilelang/utils/cpp)
-  add_custom_target(
-    build_alloc_cuda_ext ALL
-    COMMAND ${Python_EXECUTABLE} setup.py build_ext --inplace
-    WORKING_DIRECTORY ${ALLOC_CUDA_EXT_DIR}
-    COMMENT "Building alloc_cuda PyTorch extension (in-place)"
-  )
-  add_dependencies(tilelang build_alloc_cuda_ext)
-
-  set(IPC_EXT_DIR ${PROJECT_SOURCE_DIR}/tilelang/distributed/common)
-  add_custom_target(
-    build_ipc_ext ALL
-    COMMAND ${Python_EXECUTABLE} setup.py build_ext --inplace
-    WORKING_DIRECTORY ${IPC_EXT_DIR}
-    COMMENT "Building ipc_ext PyTorch extension (in-place)"
-  )
-  add_dependencies(tilelang build_ipc_ext)
-endif()
+if(USE_CUDA)
+  set(ALLOC_CUDA_EXT_DIR ${PROJECT_SOURCE_DIR}/tilelang/utils/cpp)
+  add_custom_target(
+    build_alloc_cuda_ext
+    COMMAND ${Python_EXECUTABLE} setup.py build_ext --inplace
+    WORKING_DIRECTORY ${ALLOC_CUDA_EXT_DIR}
+    COMMENT "Building alloc_cuda PyTorch extension (in-place)"
+    USES_TERMINAL
+  )
+  add_dependencies(tilelang build_alloc_cuda_ext)
+
+  set(IPC_EXT_DIR ${PROJECT_SOURCE_DIR}/tilelang/distributed/common)
+  add_custom_target(
+    build_ipc_ext
+    COMMAND ${Python_EXECUTABLE} setup.py build_ext --inplace
+    WORKING_DIRECTORY ${IPC_EXT_DIR}
+    COMMENT "Building ipc_ext PyTorch extension (in-place)"
+    USES_TERMINAL
+  )
+  add_dependencies(tilelang build_ipc_ext)
+endif()
tilelang/jit/adapter/cython/cython_wrapper.pyx (1)

191-202: Convert dtype and device for tensor creation

Map self.param_dtypes[i] (from param.dtype in __cinit__) to a torch.dtype via map_torch_type before calling pynvshmem.nvshmem_create_tensor or torch.empty (docs.nvidia.com).

Replace torch.cuda.current_device() (int) with torch.device("cuda", self.get_current_device()) to supply a valid device to torch.empty (docs.pytorch.org).

Optionally normalize self.param_dtypes in __cinit__:
self.param_dtypes = [map_torch_type(param.dtype) for param in params].

Confirm the signature of pynvshmem.nvshmem_create_tensor(shape, dtype) and whether it requires a device parameter.

tilelang/jit/adapter/wrapper.py (1)

552-554: Gate init_table injection on availability of meta_data
The generated init_table code (wrapper.py :552–554) unconditionally calls cudaMemcpyToSymbol(meta_data,…) in distributed mode, but there’s no check that meta_data is actually defined in every CUDA unit. Either:

Add a codegen flag (e.g. host_mod.has_meta_data_symbol) to inject PREDEF_INIT_TABLE_FUNC only when meta_data is present

Or emit the extern __device__ __constant__ uint64_t meta_data[…] declaration/definition in the generated module only when used

Verify that all distributed-mode CUDA units define meta_data, or implement one of these guards.

coderabbitai · 2025-09-02T03:54:13Z

docs/get_started/Installation.md

+## To use NVSHMEM APIs
+
+Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build NVSHMEM library for device-side code generation.
+
+```bash 
+export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src
+cd tilelang/distributed
+source build_nvshmem.sh
+```
+You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM.

 ```bash
-pip install tilelang -f https://tile-ai.github.io/whl/nightly/cu121/
-# or pip install tilelang --find-links https://tile-ai.github.io/whl/nightly/cu121/
+cd ./pynvshmem
+python setup.py install
+export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH"
 ```

-> **Note:** Nightly builds contain the most recent code changes but may be less stable than official releases. They're ideal for testing new features or if you need a specific bugfix that hasn't been released yet.
+Then you can test python import:
+```bash
+python -c "import pynvshmem"
+```


💡 Verification agent

🧩 Analysis chain

NVSHMEM setup: fix path/install steps and tighten wording

Paths/commands likely break as written:

cd ./pynvshmem is ambiguous from tilelang/distributed; the Python package appears to live under tilelang/distributed/pynvshmem/python. Prefer editable install via pip.

The default LD_LIBRARY_PATH subdir is frequently build/lib (not build/src/lib) in NVSHMEM’s CMake builds; please verify your build layout.

Prefer bash build_nvshmem.sh unless the script exports env vars intentionally.

Wording nits flagged by LanguageTool.

Apply:

-## To use NVSHMEM APIs +## To use NVSHMEM APIs -Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build NVSHMEM library for device-side code generation. +Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build the NVSHMEM library for device-side code generation. -```bash -export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src -cd tilelang/distributed -source build_nvshmem.sh -``` +```bash +# default to 3rdparty/nvshmem_src if not provided +export NVSHMEM_SRC="${NVSHMEM_SRC:-3rdparty/nvshmem_src}" +cd tilelang/distributed +bash build_nvshmem.sh +``` -You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM. +You also need to install the `pynvshmem` package, which provides a host-side Python API wrapper for NVSHMEM. ```bash -cd ./pynvshmem -python setup.py install -export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH" +cd pynvshmem/python +python -m pip install -e . +# Adjust the lib dir as produced by your NVSHMEM build: +export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/lib:$LD_LIBRARY_PATH"

-Then you can test python import:
+Then test the Python import:

python -c "import pynvshmem"

Would you confirm: - the exact location of the Python package (does `tilelang/distributed/pynvshmem/python/setup.py` exist?), and - the NVSHMEM build’s library output dir (`build/lib` vs `build/src/lib`) on your CI environment? --- 🏁 Script executed: ```shell #!/usr/bin/env bash # Locate the pynvshmem package directory and its setup.py fd --max-depth 4 -t d pynvshmem fd --max-depth 5 -t f setup.py | grep -i pynvshmem

Length of output: 158

docs/get_started/Installation.md: correct NVSHMEM paths and install steps

Default NVSHMEM_SRC to ${NVSHMEM_SRC:-3rdparty/nvshmem_src}

Run the build script with bash in tilelang/distributed

Install the Python API from tilelang/distributed/pynvshmem via editable pip

Point LD_LIBRARY_PATH at the actual NVSHMEM build output (build/lib)

## To use NVSHMEM APIs @@ -118,9 +118,12 @@ -export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src +export NVSHMEM_SRC="${NVSHMEM_SRC:-3rdparty/nvshmem_src}" cd tilelang/distributed -source build_nvshmem.sh +bash build_nvshmem.sh You also need to install the `pynvshmem` package… -```bash - cd ./pynvshmem - python setup.py install - export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH" -``` +```bash + cd pynvshmem + python -m pip install -e . + # Adjust according to your NVSHMEM build output: + export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/lib:$LD_LIBRARY_PATH" +``` Then test the import: ```bash python -c "import pynvshmem"

🧰 Tools

🪛 LanguageTool

[grammar] ~117-~117: There might be a mistake here.
Context: ...ted/example_allgather.py)), you need to build NVSHMEM library for device-side code ge...

(QB_NEW_EN)

[grammar] ~124-~124: There might be a mistake here.
Context: ... install the pynvshmem package, which provides wrapped host-side Python API for NVSHME...

(QB_NEW_EN)

🤖 Prompt for AI Agents

In docs/get_started/Installation.md around lines 115 to 135, update the NVSHMEM instructions: default NVSHMEM_SRC to ${NVSHMEM_SRC:-3rdparty/nvshmem_src}, run the build script with bash from tilelang/distributed (bash build_nvshmem.sh), install the Python API using an editable pip install from tilelang/distributed/pynvshmem (python -m pip install -e .) instead of python setup.py install, and set LD_LIBRARY_PATH to the actual NVSHMEM runtime output directory (use $NVSHMEM_SRC/build/lib) so the example import works.

coderabbitai · 2025-09-02T03:54:13Z

tilelang/jit/adapter/cython/cython_wrapper.pyx

+if env.USE_NVSHMEM:
    import pynvshmem



🛠️ Refactor suggestion

Import pynvshmem defensively to surface actionable errors

If NVSHMEM is enabled but pynvshmem is missing, the import will crash at module import time without context. Wrap in try/except.

-if env.USE_NVSHMEM: - import pynvshmem +if env.USE_NVSHMEM: + try: + import pynvshmem + except Exception as e: + raise ImportError("env.USE_NVSHMEM=1 but pynvshmem is not importable. " + "Ensure you installed tilelang/distributed/pynvshmem and NVSHMEM libs are on LD_LIBRARY_PATH.") from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if env.USE_NVSHMEM:

import pynvshmem

if env.USE_NVSHMEM:

try:

import pynvshmem

except Exception as e:

raise ImportError("env.USE_NVSHMEM=1 but pynvshmem is not importable. "

"Ensure you installed tilelang/distributed/pynvshmem and NVSHMEM libs are on LD_LIBRARY_PATH.") from e

🤖 Prompt for AI Agents

In tilelang/jit/adapter/cython/cython_wrapper.pyx around lines 11-13, the unconditional import of pynvshmem will cause an opaque crash at module import time if env.USE_NVSHMEM is true but the package is not installed; wrap the import in a try/except ImportError block, and on failure raise a clear error (RuntimeError or re-raised ImportError) that explains NVSHMEM is enabled, names the missing package (pynvshmem), and instructs the user to install it or disable USE_NVSHMEM so the failure is actionable.

tilelang/profiler/__init__.py

…HMEM support

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

docs/get_started/Installation.md (2)
119-124: Fix NVSHMEM setup commands: avoid pip install mpich, default NVSHMEM_SRC, call the build script with bash.

mpich is a system package; installing via pip is incorrect and will confuse users.

Default NVSHMEM_SRC via parameter expansion.

Prefer bash build_nvshmem.sh over sourcing unless the script sets env that must persist.
-```bash 
-pip install mpich
-export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src
-cd tilelang/distributed
-source build_nvSHMEM.sh
-```
+```bash
+# Install an MPI implementation if your launcher requires it (choose ONE):
+#   Debian/Ubuntu: sudo apt-get update && sudo apt-get install -y mpich
+#   Conda:        conda install -c conda-forge mpich
+export NVSHMEM_SRC="${NVSHMEM_SRC:-3rdparty/nvshmem_src}"
+cd tilelang/distributed
+bash build_nvshmem.sh
+```
Run this script to verify the build script path and NVSHMEM lib dir guidance in the repo:
#!/usr/bin/env bash
set -euo pipefail
echo "Build script:"; fd -n '^build_nvshmem\.sh$' -a -H -I tilelang | sed 's/^/  /' || true
echo "pynvshmem dirs:"; fd -t d -n '^pynvshmem$' -a tilelang | sed 's/^/  /' || true
128-131: Install path and runtime library path are likely wrong; switch to editable pip and point LD_LIBRARY_PATH to build/lib.

Prefer python -m pip install -e . over setup.py.

The NVSHMEM runtime libs typically land under build/lib, not build/src/lib.

Some repos place setup.py under pynvshmem/ or pynvshmem/python/. Detect and cd accordingly.
-```bash
-cd ./pynvshmem
-python setup.py install
-export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH"
-```
+```bash
+# From tilelang/distributed, pick the correct package root:
+if [ -f pynvshmem/setup.py ]; then
+  cd pynvshmem
+elif [ -f pynvshmem/python/setup.py ]; then
+  cd pynvshmem/python
+fi
+python -m pip install -e .
+# Adjust to your NVSHMEM build output:
+export LD_LIBRARY_PATH="${NVSHMEM_SRC}/build/lib:${LD_LIBRARY_PATH}"
+```
Quick check script to confirm the correct package path and pick the right LD_LIBRARY_PATH hint:
#!/usr/bin/env bash
set -euo pipefail
echo "Looking for pynvshmem setup.py candidates..."
fd -n -a -t f '^setup\.py$' tilelang | grep -i pynvshmem -n || true
echo "Probe NVSHMEM lib candidates under \$NVSHMEM_SRC:"
for d in build/lib build/src/lib lib; do
  [ -d "${NVSHMEM_SRC:-3rdparty/nvshmem_src}/$d" ] && echo "  found: $NVSHMEM_SRC/$d"
done

🧹 Nitpick comments (4)

docs/get_started/Installation.md (4)
27-28: Repo clone instructions look correct; optional speed-up.

Commands are correct. Optionally add --depth 1 for faster clones in CI/docs.
-git clone --recursive https://github.com/tile-ai/tilescale
+git clone --recursive --depth 1 https://github.com/tile-ai/tilescale
Also applies to: 63-64, 103-104

115-117: Grammar/clarity: add article before “NVSHMEM library”.

Small English fix.
-Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build NVSHMEM library for device-side code generation.
+Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build the NVSHMEM library for device-side code generation.
125-125: Wording: tighten description of pynvshmem.

Minor English fix.
-You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM.
+You also need to install the `pynvshmem` package, which provides a host-side Python API wrapper for NVSHMEM.
133-136: Grammar: streamline test instruction.
-Then you can test python import:
+Then test the Python import:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0cd29 and af455ef.

📒 Files selected for processing (1)

docs/get_started/Installation.md (5 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/get_started/Installation.md

[grammar] ~117-~117: There might be a mistake here.
Context: ...ted/example_allgather.py)), you need to build NVSHMEM library for device-side code ge...

(QB_NEW_EN)

[grammar] ~125-~125: There might be a mistake here.
Context: ... install the pynvshmem package, which provides wrapped host-side Python API for NVSHME...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: bot-task

🔇 Additional comments (1)

docs/get_started/Installation.md (1)

12-12: Section intro reads well.

No issues; clear structure for the three install methods.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

docs/get_started/Installation.md (4)
125-125: Wording: clarify pynvshmem description.
-You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM.
+You also need to install the `pynvshmem` package, which provides a host-side Python API wrapper for NVSHMEM.
133-136: Tighten phrasing of the import test.
-Then you can test python import:
+Then test the Python import:
115-124: NVSHMEM section: command correctness and defaults.

Add article “the” (“build the NVSHMEM library”).

Default NVSHMEM_SRC with parameter expansion.

Prefer “bash build_nvshmem.sh” over sourcing unless the script exports env vars.
-## To use NVSHMEM APIs
+## To use NVSHMEM APIs
@@
-Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build NVSHMEM library for device-side code generation.
+Before running examples with NVSHMEM (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build the NVSHMEM library for device-side code generation.
@@
-pip install mpich
-export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src
-cd tilelang/distributed
-source build_nvshmem.sh
+pip install mpich
+# Default to 3rdparty/nvshmem_src if not provided:
+export NVSHMEM_SRC="${NVSHMEM_SRC:-3rdparty/nvshmem_src}"
+cd tilelang/distributed
+bash build_nvshmem.sh
Note: if the script intentionally sets env vars, keep “source” and state why.

128-131: pynvshmem install path and NVSHMEM lib dir are likely wrong.

The Python package usually lives under pynvshmem/python; prefer editable pip install.

NVSHMEM’s CMake output is typically build/lib, not build/src/lib.
-cd ./pynvshmem
-python setup.py install
-export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH"
+cd pynvshmem/python
+python -m pip install -e .
+# Adjust to your NVSHMEM build output directory:
+export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/lib:$LD_LIBRARY_PATH"
If your build layout differs, document the exact lib path used in CI.

🧹 Nitpick comments (6)

docs/get_started/Installation.md (6)
9-11: Clarify supported toolchain versions (CUDA/Python/LLVM).

Python 3.7 is EOL; recommend stating ≥3.8 or your tested matrix.

“LLVM < 20 if you are using the bundled TVM submodule” needs an explicit tested version range.

Specify minimal and tested CUDA versions (e.g., “min 12.1; tested 12.2/12.4”).

26-29: Repo clone: looks good; suggest shallow clone for speed (optional).
You can add --depth 1 for faster CI/dev setup.
-git clone --recursive https://github.com/tile-ai/tilescale
+git clone --recursive --depth 1 https://github.com/tile-ai/tilescale
63-65: Duplicate clone block: mirror the shallow clone suggestion.
-git clone --recursive https://github.com/tile-ai/tilescale
+git clone --recursive --depth 1 https://github.com/tile-ai/tilescale
78-82: Config edits: avoid echoing commented examples into config.

Current snippet appends only USE_CUDA; consider showing both options clearly and ensuring newline safety.
-# echo "set(USE_LLVM ON)"  # set USE_LLVM to ON if using LLVM
-echo "set(USE_CUDA ON)" >> config.cmake 
+# Enable desired backends:
+printf '%s\n' "set(USE_LLVM ON)" "set(USE_CUDA ON)" >> config.cmake
103-105: Mirror shallow clone (optional).
-git clone --recursive https://github.com/tile-ai/tilescale
+git clone --recursive --depth 1 https://github.com/tile-ai/tilescale
110-112: Fix stray backticks in command.
-bash install_cuda.sh
-# or bash `install_amd.sh` if you want to enable ROCm runtime
+bash install_cuda.sh
+# or:
+bash install_amd.sh  # enable ROCm runtime

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between af455ef and 646eeb4.

📒 Files selected for processing (1)

docs/get_started/Installation.md (5 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/get_started/Installation.md

[grammar] ~9-~9: There might be a mistake here.
Context: ...on**: >= 3.7 - CUDA Version: >= 12.1 - LLVM: < 20 if you are using the bundle...

(QB_NEW_EN)

[grammar] ~117-~117: There might be a mistake here.
Context: ...ted/example_allgather.py)), you need to build NVSHMEM library for device-side code ge...

(QB_NEW_EN)

[grammar] ~125-~125: There might be a mistake here.
Context: ... install the pynvshmem package, which provides wrapped host-side Python API for NVSHME...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: bot-task

🔇 Additional comments (1)

docs/get_started/Installation.md (1)
46-52: Standardize module name to “tilelang”
- Update `PYTHONPATH` to include the `tile-lang` Python module:
+ Update `PYTHONPATH` to include the `tilelang` Python module:
Likely an incorrect or invalid review comment.

…r pull and push warp examples - Modified `example_pull_warp.py` and `example_push_warp.py` to set a default value of 65536 for M if not provided in args. - Added new test scripts `test_pull_warp.py` and `test_push_warp.py` to validate functionality using CUDA and multiprocessing.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

examples/distributed/example_pull_warp.py (1)
40-40: Make args handling robust when args lacks attribute M

If args is an empty Namespace (truthy) without M, this will raise AttributeError. Use getattr to cover None and missing attributes.
-    M = args.M if args else 65536
+    M = getattr(args, "M", 65536)
Optional: validate M > 0 to avoid empty tensors.
examples/distributed/test_pull_warp.py (1)
1-16: Guard for multi-GPU availability to avoid spurious CI failures

The example asserts 2 ranks; skip when fewer than 2 CUDA devices are present.
 import tilelang.testing
 import torch
 import torch.multiprocessing
+import pytest
 
 import example_pull_warp
 
 @tilelang.testing.requires_cuda
 @tilelang.testing.requires_cuda_compute_version_ge(9, 0)
 def test_example_pull_warp():
+    if torch.cuda.device_count() < 2:
+        pytest.skip("requires >= 2 CUDA devices")
     torch.multiprocessing.spawn(example_pull_warp.main, args=(2, None), nprocs=2)
examples/distributed/example_push_warp.py (1)
40-40: Harden defaulting of M to handle None or missing attributes

Same rationale as pull_warp: avoid AttributeError when args is truthy but lacks M.
-    M = args.M if args else 65536
+    M = getattr(args, "M", 65536)
examples/distributed/test_push_warp.py (1)
1-16: Skip test when <2 GPUs available

Prevents failures on single-GPU runners while preserving coverage where applicable.
 import tilelang.testing
 import torch
 import torch.multiprocessing
+import pytest
 
 import example_push_warp
 
 @tilelang.testing.requires_cuda
 @tilelang.testing.requires_cuda_compute_version_ge(9, 0)
 def test_example_push_warp():
+    if torch.cuda.device_count() < 2:
+        pytest.skip("requires >= 2 CUDA devices")
     torch.multiprocessing.spawn(example_push_warp.main, args=(2, None), nprocs=2)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 646eeb4 and 13e16ef.

📒 Files selected for processing (4)

examples/distributed/example_pull_warp.py (1 hunks)
examples/distributed/example_push_warp.py (1 hunks)
examples/distributed/test_pull_warp.py (1 hunks)
examples/distributed/test_push_warp.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

examples/distributed/test_pull_warp.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (95-96)

examples/distributed/example_pull_warp.py (2)

main (17-34)

main (39-78)

examples/distributed/test_push_warp.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (95-96)

examples/distributed/example_push_warp.py (2)

main (17-34)

main (39-78)

- Moved NVSHMEM import and initialization inside a conditional block to ensure it only executes when `env.USE_NVSHMEM` is set to True, improving code clarity and preventing unnecessary imports.

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

Update installation documentation to include MPICH dependency for NVS…

af455ef

…HMEM support

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

Update CUDA version requirement in installation documentation to >= 12.1

646eeb4

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

Refactor NVSHMEM initialization in Profiler class

c293a0c

- Moved NVSHMEM import and initialization inside a conditional block to ensure it only executes when `env.USE_NVSHMEM` is set to True, improving code clarity and preventing unnecessary imports.

chengyupku merged commit 1449e3c into main Sep 2, 2025
1 of 4 checks passed

This was referenced Sep 4, 2025

[Feat] Implement device and system level barrier for blocks based on atomic operations #9

Merged

[Benchmark] Add benchmark for IPC communication and support for fence operators #22

Merged

coderabbitai bot mentioned this pull request Sep 25, 2025

[Feature][Example] Support return_peers in _allocate_tensor and add ag_gemm_ipc example #25

Merged

coderabbitai bot mentioned this pull request Oct 11, 2025

[Feat]support internode copy with intranode copy #26

Merged

coderabbitai bot mentioned this pull request Oct 24, 2025

Merge remote-tracking branch 'upstream/main' into main #31

Merged

[Feat][Doc] Enhance TileLang with NVSHMEM support and build system improvements #18

[Feat][Doc] Enhance TileLang with NVSHMEM support and build system improvements #18

Uh oh!

Conversation

chengyupku commented Sep 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chengyupku commented Sep 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)