- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2
[Feat][Doc] Enhance TileLang with NVSHMEM support and build system improvements #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…provements - Added support for NVSHMEM in the environment configuration, enabling distributed memory operations. - Introduced custom build targets for CUDA extensions in the CMake configuration. - Updated installation documentation to reflect the new repository name and NVSHMEM usage. - Created a new guide for running examples with and without NVSHMEM. - Refactored various components to utilize NVSHMEM for distributed tensor management and memory allocation. - Improved error handling in the allocator for better memory management.
| Caution Review failedThe pull request is closed. WalkthroughAdds CMake custom targets to build two PyTorch extensions in-place and wires them into the main build. Updates installation and distributed run docs, introduces TILELANG_USE_NVSHMEM and USE_NVSHMEM, refactors NVSHMEM gating across components, updates the distributed launcher, adjusts CUDA binding imports, adds a host_table init helper, adds an allocator overflow check, and adds distributed example tests and minor example defaults. Changes
 Sequence Diagram(s)sequenceDiagram
  autonumber
  actor Dev as Developer
  participant CMake as CMake
  participant AllocExt as alloc_cuda setup.py
  participant IPCExt as ipc_ext setup.py
  participant Tile as tilelang target
  Dev->>CMake: cmake --build
  CMake->>AllocExt: ${Python_EXECUTABLE} setup.py build_ext --inplace\n(WORKDIR ALLOC_CUDA_EXT_DIR)
  CMake->>IPCExt: ${Python_EXECUTABLE} setup.py build_ext --inplace\n(WORKDIR IPC_EXT_DIR)
  Note over AllocExt,IPCExt: in-place build of PyTorch extensions
  AllocExt-->>CMake: success
  IPCExt-->>CMake: success
  CMake->>Tile: build tilelang after deps
sequenceDiagram
  autonumber
  actor User
  participant Sh as launch.sh
  participant Py as Python
  participant TorchRun as torch.distributed.run
  participant App as user script
  User->>Sh: GPUS=N ./launch.sh python app.py
  Sh->>Sh: export TILELANG_USE_NVSHMEM=1
  Sh->>Py: ${PYTHON_EXEC} -m torch.distributed.run ...
  Py->>TorchRun: start launcher
  TorchRun->>App: spawn ranks
sequenceDiagram
  autonumber
  participant Env as env.py
  participant Wrapper as TLCUDASourceWrapper
  participant Libgen as libgen
  participant Cython as cython_wrapper
  participant NV as NVSHMEM
  Env-->>Wrapper: provide USE_DISTRIBUTED and USE_NVSHMEM
  alt use_distributed
    Wrapper->>Wrapper: append PREDEF_INIT_TABLE_FUNC (init_table)
  end
  alt USE_NVSHMEM
    Libgen->>NV: add NVSHMEM include/link flags
    Cython->>NV: create nvshmem tensor at runtime
  else
    Cython->>Torch: use torch.empty(...) fallback
  end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
 Poem
 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration: 
 You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
 ✨ Finishing Touches
 🧪 Generate unit tests
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type  Other keywords and placeholders
 CodeRabbit Configuration File ( | 
| 👋 Hi! Thank you for contributing to the TileLang project. Please remember to run  We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (7)
tilelang/distributed/pynvshmem/python/pynvshmem/__init__.py (1)
20-28: Fix CUDA error-string extraction (avoid printing raw tuples).cuda-python returns (res, value) for cuGetErrorName/cuGetErrorString in many versions; current code prints the tuple object.
def _CUDA_CHECK(err): if isinstance(err, cuda.CUresult): if err != cuda.CUresult.CUDA_SUCCESS: - raise RuntimeError(f"Cuda Error: {err}: {cuda.cuGetErrorName(err)}") + _res, name = cuda.cuGetErrorName(err) + _res2, desc = cuda.cuGetErrorString(err) + name = name.decode() if isinstance(name, (bytes, bytearray)) else name + desc = desc.decode() if isinstance(desc, (bytes, bytearray)) else desc + raise RuntimeError(f"CUDA Driver Error {int(err)} ({name}): {desc}") elif isinstance(err, cudart.cudaError_t): if err != cudart.cudaError_t.cudaSuccess: - raise RuntimeError(f"Cuda Error: {err}: {cudart.cudaGetErrorString(err)}") + msg = cudart.cudaGetErrorString(err) + if isinstance(msg, tuple): # some bindings return (res, msg) + msg = msg[1] + msg = msg.decode() if isinstance(msg, (bytes, bytearray)) else msg + raise RuntimeError(f"CUDA Runtime Error {int(err)}: {msg}") else: raise RuntimeError(f"Unknown error type: {err}")tilelang/distributed/utils.py (1)
231-239: Fix CUDA error-string extraction (avoid printing raw tuples).Same issue as in pynvshmem._CUDA_CHECK.
def CUDA_CHECK(err): if isinstance(err, cuda.CUresult): if err != cuda.CUresult.CUDA_SUCCESS: - raise RuntimeError(f"Cuda Error: {err}: {cuda.cuGetErrorName(err)}") + _res, name = cuda.cuGetErrorName(err) + _res2, desc = cuda.cuGetErrorString(err) + name = name.decode() if isinstance(name, (bytes, bytearray)) else name + desc = desc.decode() if isinstance(desc, (bytes, bytearray)) else desc + raise RuntimeError(f"CUDA Driver Error {int(err)} ({name}): {desc}") elif isinstance(err, cudart.cudaError_t): if err != cudart.cudaError_t.cudaSuccess: - raise RuntimeError(f"Cuda Error: {err}: {cudart.cudaGetErrorString(err)}") + msg = cudart.cudaGetErrorString(err) + if isinstance(msg, tuple): + msg = msg[1] + msg = msg.decode() if isinstance(msg, (bytes, bytearray)) else msg + raise RuntimeError(f"CUDA Runtime Error {int(err)}: {msg}") else: raise RuntimeError(f"Unknown error type: {err}")tilelang/env.py (1)
212-226: Gate NVSHMEM paths by USE_NVSHMEM (not USE_DISTRIBUTED).Currently NVSHMEM include/lib paths won’t be set when only TILELANG_USE_NVSHMEM=1 (and TILELANG_USE_DISTRIBUTED=0), breaking NVSHMEM builds/runs.
- USE_DISTRIBUTED = EnvVar("TILELANG_USE_DISTRIBUTED", "0").get().lower() in ("1", "true", "on") - USE_NVSHMEM = EnvVar("TILELANG_USE_NVSHMEM", "0").get().lower() in ("1", "true", "on") - if USE_DISTRIBUTED: + USE_DISTRIBUTED = EnvVar("TILELANG_USE_DISTRIBUTED", "0").get().lower() in ("1", "true", "on") + USE_NVSHMEM = EnvVar("TILELANG_USE_NVSHMEM", "0").get().lower() in ("1", "true", "on") + if USE_NVSHMEM: if EnvVar("NVSHMEM_SRC", None).get() is not None: NVSHMEM_SRC = EnvVar("NVSHMEM_SRC", None).get() else: NVSHMEM_SRC = os.path.join( os.path.dirname(os.path.abspath(__file__)), "..", "3rdparty", "nvshmem_src") NVSHMEM_INCLUDE_DIR: str = NVSHMEM_SRC + "/build/src/include" NVSHMEM_LIB_PATH: str = NVSHMEM_SRC + "/build/src/lib" else: NVSHMEM_INCLUDE_DIR = None NVSHMEM_LIB_PATH = Nonetilelang/profiler/__init__.py (1)
158-166: Use USE_NVSHMEM consistently for distributed/NVSHMEM paths.These branches still use USE_DISTRIBUTED; switch to USE_NVSHMEM to match the new gating.
- if env.USE_DISTRIBUTED: + if env.USE_NVSHMEM: @@ - if env.USE_DISTRIBUTED: + if env.USE_NVSHMEM: @@ - if env.USE_DISTRIBUTED: # noqa: SIM108 + if env.USE_NVSHMEM: # noqa: SIM108 @@ - if env.USE_DISTRIBUTED: + if env.USE_NVSHMEM: @@ - if env.USE_DISTRIBUTED: + if env.USE_NVSHMEM:Also applies to: 246-251, 262-271, 310-320, 330-337
tilelang/jit/adapter/libgen.py (1)
138-147: Scope NVSHMEM flags to CUDA only and fix link-flag splittingApplying NVSHMEM flags unconditionally breaks HIP/CPU builds when TILELANG_USE_NVSHMEM=1. Also, passing both -lnvshmem_* in a single list element prevents proper arg splitting.
Apply:
- if env.USE_NVSHMEM: + if is_cuda_target(target) and env.USE_NVSHMEM: assert env.NVSHMEM_INCLUDE_DIR is not None, "env.NVSHMEM_INCLUDE_DIR is not set" assert env.NVSHMEM_LIB_PATH is not None, "env.NVSHMEM_LIB_PATH is not set" - command += ["-diag-suppress=20013"] + command += ["-diag-suppress=20013"] if not disable_rdc: command += ["-rdc=true"] command += [ - "-I" + env.NVSHMEM_INCLUDE_DIR, "-L" + env.NVSHMEM_LIB_PATH, - "-lnvshmem_host -lnvshmem_device" + "-I" + env.NVSHMEM_INCLUDE_DIR, + "-L" + env.NVSHMEM_LIB_PATH, + "-lnvshmem_host", + "-lnvshmem_device", ]tilelang/distributed/launch.sh (1)
36-49: Build the command with an array and exec; preserve "$@" quoting and avoid fragile stringsCurrent string-based CMD drops original argument boundaries and can mis-handle spaces. Use an argv array and exec.
-PYTHON_EXEC="$(which python)" -CMD="${PYTHON_EXEC} -m torch.distributed.run \ - --node_rank=${node_rank} \ - --nproc_per_node=${nproc_per_node} \ - --nnodes=${nnodes} \ - ${TILELANG_EXTRA_TORCHRUN_ARGS} ${additional_args} $@" - -if [ ${memcheck} -eq 1 ]; then - CMD="compute-sanitizer --tool memcheck ${CMD}" -fi - -echo ${CMD} -${CMD} +PYTHON_EXEC="$(which python)" +# Build argv +args=( -m torch.distributed.run + --node_rank="${node_rank}" + --nproc_per_node="${nproc_per_node}" + --nnodes="${nnodes}" +) +# shell expansion for optional args (best-effort) +args+=( ${TILELANG_EXTRA_TORCHRUN_ARGS} ${additional_args} ) + +set -x +if [ "${memcheck}" -eq 1 ]; then + exec compute-sanitizer --tool memcheck "${PYTHON_EXEC}" "${args[@]}" "$@" +else + exec "${PYTHON_EXEC}" "${args[@]}" "$@" +fitilelang/jit/adapter/wrapper.py (1)
44-64: PREDEF_INIT_TABLE_FUNC: remove always-true checks and std:: namespace; align with existing error handling
if (error_buf)is always true for a static array; drop it to avoid -Waddress.- Use
snprintf(as elsewhere) to avoid depending on<cstdio>/std::snprintf.- Keep style consistent with other predefs.
Apply:
-PREDEF_INIT_TABLE_FUNC = """ -extern "C" int init_table(const void* host_table, size_t n) {{ - if (error_buf) error_buf[0] = '\\0'; - - if (host_table == nullptr) {{ - if (error_buf) std::snprintf(error_buf, 256, "host_table is null"); - return -1; - }} - if (n == 0) {{ - return 0; - }} - - size_t bytes = n * sizeof(uint64_t); - cudaError_t err = cudaMemcpyToSymbol(meta_data, host_table, bytes, 0, cudaMemcpyHostToDevice); - if (err != cudaSuccess) {{ - if (error_buf) std::snprintf(error_buf, 256, "cudaMemcpyToSymbol failed: %s", cudaGetErrorString(err)); - return static_cast<int>(err); - }} - return 0; -}} -""" +PREDEF_INIT_TABLE_FUNC = """ +extern "C" int init_table(const void* host_table, size_t n) { + error_buf[0] = '\\0'; + if (host_table == nullptr) { + snprintf(error_buf, 256, "host_table is null"); + return -1; + } + if (n == 0) { + return 0; + } + size_t bytes = n * sizeof(uint64_t); + cudaError_t err = cudaMemcpyToSymbol(meta_data, host_table, bytes, 0, cudaMemcpyHostToDevice); + if (err != cudaSuccess) { + snprintf(error_buf, 256, "cudaMemcpyToSymbol failed: %s", cudaGetErrorString(err)); + return static_cast<int>(err); + } + return 0; +} +"""
🧹 Nitpick comments (9)
tilelang/distributed/pynvshmem/python/pynvshmem/__init__.py (2)
4-5: Make CUDA imports backward-compatible.cuda-python exposes both top-level (cuda, cudart) and bindings.* layouts across versions. Add a fallback to avoid import errors on older installs.
-from cuda.bindings import driver as cuda -from cuda.bindings import runtime as cudart +try: + from cuda.bindings import driver as cuda + from cuda.bindings import runtime as cudart +except ImportError: + # Fallback for older cuda-python versions + from cuda import cuda, cudart
60-79: Guard against host tensors for stream writes.cuStreamWriteValue{32,64} requires a device pointer; add a fast check.
- assert isinstance(tensor, torch.Tensor) and tensor.dtype in (torch.int32, torch.uint32), \ + assert isinstance(tensor, torch.Tensor) and tensor.dtype in (torch.int32, torch.uint32), \ f"tensor must be a torch.Tensor with 32-bit dtype, but got {tensor.dtype}" + assert tensor.is_cuda, "tensor must reside on a CUDA device" ... - assert isinstance(tensor, torch.Tensor) and tensor.dtype in (torch.int64, torch.uint64), \ + assert isinstance(tensor, torch.Tensor) and tensor.dtype in (torch.int64, torch.uint64), \ f"tensor must be a torch.Tensor with 64-bit dtype, but got {tensor.dtype}" + assert tensor.is_cuda, "tensor must reside on a CUDA device"Also applies to: 82-101
tilelang/distributed/utils.py (2)
8-9: Make CUDA imports backward-compatible.Mirror the pynvshmem import fallback to cover older cuda-python versions.
-from cuda.bindings import driver as cuda -from cuda.bindings import runtime as cudart +try: + from cuda.bindings import driver as cuda + from cuda.bindings import runtime as cudart +except ImportError: + from cuda import cuda, cudart
11-11: Default NVSHMEM init to env.USE_NVSHMEM.Keep behavior aligned with the new flag without forcing callers to pass it.
+from tilelang import env ... -def init_distributed(return_tp_group=False, init_nvshmem=True): +def init_distributed(return_tp_group=False, init_nvshmem: bool | None = None): @@ - if init_nvshmem: + if init_nvshmem is None: + init_nvshmem = env.USE_NVSHMEM + if init_nvshmem: import pynvshmem pynvshmem.init_nvshmem_by_uniqueid(TP_GROUP)Also applies to: 49-68
tilelang/utils/allocator.py (1)
179-182: Shorten error message and clamp available bytes (minor).Keeps logs tight and avoids negative “available” values on overflow.
- if current_offset + bytes_alloc > self.size: - bytes_available = self.size - current_offset - raise MemoryError(f"Allocation failed: Requesting {bytes_alloc} bytes, but only " - f"{bytes_available} bytes are available in the pre-allocated buffer " - f"(total size: {self.size} bytes).") + if current_offset + bytes_alloc > self.size: + bytes_available = max(0, self.size - current_offset) + raise MemoryError( + f"Out of pre-allocated memory: need {bytes_alloc}B, have {bytes_available}B of {self.size}B." + )tilelang/jit/adapter/libgen.py (1)
139-140: Prefer explicit exceptions over asserts for env validationAsserts can be skipped with Python -O. Raise a RuntimeError/ValueError with actionable guidance.
- assert env.NVSHMEM_INCLUDE_DIR is not None, "env.NVSHMEM_INCLUDE_DIR is not set" - assert env.NVSHMEM_LIB_PATH is not None, "env.NVSHMEM_LIB_PATH is not set" + if not env.NVSHMEM_INCLUDE_DIR: + raise RuntimeError("NVSHMEM enabled but NVSHMEM_INCLUDE_DIR is not set") + if not env.NVSHMEM_LIB_PATH: + raise RuntimeError("NVSHMEM enabled but NVSHMEM_LIB_PATH is not set")tilelang/distributed/launch.sh (1)
4-4: Don’t force NVSHMEM; default only if undefinedLet users override TILELANG_USE_NVSHMEM from the environment.
-export TILELANG_USE_NVSHMEM=1 # enable TileLang distributed mode +export TILELANG_USE_NVSHMEM=${TILELANG_USE_NVSHMEM:=1} # enable NVSHMEM by default; allow overridedocs/get_started/run_example.md (2)
17-21: Add multi-node usage note (NODES, NODE_RANK, MASTER_ADDR/PORT)A short snippet will help users scale beyond single node.
You can change GPUS to the number of local GPUs you want to use. The launcher will set the required environment variables and invoke `torch.distributed.run`. +For multi-node runs, set: + +```bash +NODES=<num_nodes> NODE_RANK=<this_node_rank> MASTER_ADDR=<host_of_rank0> \ +GPUS=<gpus_per_node> ./tilelang/distributed/launch.sh <your_script.py> [args...] +```
1-21: Call out NVSHMEM prerequisites explicitlyBriefly state that NVSHMEM must be installed and NVSHMEM_INCLUDE_DIR/NVSHMEM_LIB_PATH set (or provided by default install paths).
## Examples using NVSHMEM APIs Use the provided launcher `tilelang/distributed/launch.sh` to start programs that use the NVSHMEM API. For example, to run with 2 GPUs: +Prerequisites: NVSHMEM installed on the system and environment variables `NVSHMEM_INCLUDE_DIR` and `NVSHMEM_LIB_PATH` set if not in default locations.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
- CMakeLists.txt(1 hunks)
- docs/get_started/Installation.md(5 hunks)
- docs/get_started/run_example.md(1 hunks)
- tilelang/distributed/launch.sh(2 hunks)
- tilelang/distributed/pynvshmem/python/pynvshmem/__init__.py(1 hunks)
- tilelang/distributed/utils.py(1 hunks)
- tilelang/env.py(1 hunks)
- tilelang/jit/adapter/cython/cython_wrapper.pyx(2 hunks)
- tilelang/jit/adapter/libgen.py(1 hunks)
- tilelang/jit/adapter/wrapper.py(3 hunks)
- tilelang/profiler/__init__.py(1 hunks)
- tilelang/utils/allocator.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/jit/adapter/wrapper.py (1)
src/target/codegen_cuda.h (1)
use_distributed(23-29)
🪛 Ruff (0.12.2)
tilelang/utils/allocator.py
180-182: Avoid specifying long messages outside the exception class
(TRY003)
🪛 LanguageTool
docs/get_started/Installation.md
[grammar] ~117-~117: There might be a mistake here.
Context: ...ted/example_allgather.py)), you need to build NVSHMEM library for device-side code ge...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ... install the pynvshmem package, which provides wrapped host-side Python API for NVSHME...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (6)
tilelang/utils/allocator.py (1)
177-183: Good pre-allocation bounds check.Prevents buffer overruns and yields a clear failure mode.
tilelang/jit/adapter/libgen.py (2)
141-141: Verify tool routing for diagnostic suppressionElsewhere you use “-Xcudafe --diag_suppress=…”. Here it’s “-diag-suppress=20013” alone. Confirm nvcc actually honors this form; otherwise switch to the consistent form.
Option if needed:
- command += ["-diag-suppress=20013"] + command += ["-Xcudafe", "--diag_suppress=20013"]
142-143: NVSHMEM typically requires RDC; consider forcing or failing fastIf NVSHMEM device libs are used, -rdc=true is usually required. Either force it under USE_NVSHMEM or raise a clear error when TL_DISABLE_RDC is set.
CMakeLists.txt (1)
225-247: Guard PyTorch extension targets behind USE_CUDA and enable incremental-style builds
- Wrap
build_alloc_cuda_extandbuild_ipc_extinif(USE_CUDA)…endif()instead ofALL, addUSES_TERMINALso they only run when CUDA is enabled and display output. Manually verify thatUSE_CUDAis declared (e.g. viaoption(USE_CUDA "Enable CUDA extensions" ON)or in an included config) and defaults as expected.- Longer-term, convert to
add_custom_command(OUTPUT <built-.so> BYPRODUCTS <.so> …)and have the custom target depend on those outputs to get true incremental rebuilds.- Replace the
execute_process(COMMAND which python…)Python lookup with
find_package(Python3 COMPONENTS Interpreter REQUIRED)and use${Python3_EXECUTABLE}for more robust, CMake-native discovery.-if(…) - set(ALLOC_CUDA_EXT_DIR ${PROJECT_SOURCE_DIR}/tilelang/utils/cpp) - add_custom_target( - build_alloc_cuda_ext ALL - COMMAND ${Python_EXECUTABLE} setup.py build_ext --inplace - WORKING_DIRECTORY ${ALLOC_CUDA_EXT_DIR} - COMMENT "Building alloc_cuda PyTorch extension (in-place)" - ) - add_dependencies(tilelang build_alloc_cuda_ext) - - set(IPC_EXT_DIR ${PROJECT_SOURCE_DIR}/tilelang/distributed/common) - add_custom_target( - build_ipc_ext ALL - COMMAND ${Python_EXECUTABLE} setup.py build_ext --inplace - WORKING_DIRECTORY ${IPC_EXT_DIR} - COMMENT "Building ipc_ext PyTorch extension (in-place)" - ) - add_dependencies(tilelang build_ipc_ext) -endif() +if(USE_CUDA) + set(ALLOC_CUDA_EXT_DIR ${PROJECT_SOURCE_DIR}/tilelang/utils/cpp) + add_custom_target( + build_alloc_cuda_ext + COMMAND ${Python_EXECUTABLE} setup.py build_ext --inplace + WORKING_DIRECTORY ${ALLOC_CUDA_EXT_DIR} + COMMENT "Building alloc_cuda PyTorch extension (in-place)" + USES_TERMINAL + ) + add_dependencies(tilelang build_alloc_cuda_ext) + + set(IPC_EXT_DIR ${PROJECT_SOURCE_DIR}/tilelang/distributed/common) + add_custom_target( + build_ipc_ext + COMMAND ${Python_EXECUTABLE} setup.py build_ext --inplace + WORKING_DIRECTORY ${IPC_EXT_DIR} + COMMENT "Building ipc_ext PyTorch extension (in-place)" + USES_TERMINAL + ) + add_dependencies(tilelang build_ipc_ext) +endif()tilelang/jit/adapter/cython/cython_wrapper.pyx (1)
191-202: Convert dtype and device for tensor creation
- Map
self.param_dtypes[i](fromparam.dtypein__cinit__) to atorch.dtypeviamap_torch_typebefore callingpynvshmem.nvshmem_create_tensorortorch.empty(docs.nvidia.com).- Replace
torch.cuda.current_device()(int) withtorch.device("cuda", self.get_current_device())to supply a validdevicetotorch.empty(docs.pytorch.org).- Optionally normalize
self.param_dtypesin__cinit__:
self.param_dtypes = [map_torch_type(param.dtype) for param in params].- Confirm the signature of
pynvshmem.nvshmem_create_tensor(shape, dtype)and whether it requires adeviceparameter.tilelang/jit/adapter/wrapper.py (1)
552-554: Gate init_table injection on availability ofmeta_data
The generatedinit_tablecode (wrapper.py :552–554) unconditionally callscudaMemcpyToSymbol(meta_data,…)in distributed mode, but there’s no check thatmeta_datais actually defined in every CUDA unit. Either:
- Add a codegen flag (e.g.
host_mod.has_meta_data_symbol) to injectPREDEF_INIT_TABLE_FUNConly whenmeta_datais present- Or emit the
extern __device__ __constant__ uint64_t meta_data[…]declaration/definition in the generated module only when usedVerify that all distributed-mode CUDA units define
meta_data, or implement one of these guards.
| ## To use NVSHMEM APIs | ||
|  | ||
| Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build NVSHMEM library for device-side code generation. | ||
|  | ||
| ```bash | ||
| export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src | ||
| cd tilelang/distributed | ||
| source build_nvshmem.sh | ||
| ``` | ||
| You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM. | ||
|  | ||
| ```bash | ||
| pip install tilelang -f https://tile-ai.github.io/whl/nightly/cu121/ | ||
| # or pip install tilelang --find-links https://tile-ai.github.io/whl/nightly/cu121/ | ||
| cd ./pynvshmem | ||
| python setup.py install | ||
| export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH" | ||
| ``` | ||
|  | ||
| > **Note:** Nightly builds contain the most recent code changes but may be less stable than official releases. They're ideal for testing new features or if you need a specific bugfix that hasn't been released yet. | ||
| Then you can test python import: | ||
| ```bash | ||
| python -c "import pynvshmem" | ||
| ``` | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
NVSHMEM setup: fix path/install steps and tighten wording
- Paths/commands likely break as written:
- cd ./pynvshmemis ambiguous from- tilelang/distributed; the Python package appears to live under- tilelang/distributed/pynvshmem/python. Prefer editable install via pip.
- The default LD_LIBRARY_PATHsubdir is frequentlybuild/lib(notbuild/src/lib) in NVSHMEM’s CMake builds; please verify your build layout.
- Prefer bash build_nvshmem.shunless the script exports env vars intentionally.
 
- Wording nits flagged by LanguageTool.
Apply:
-## To use NVSHMEM APIs
+## To use NVSHMEM APIs
-Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build NVSHMEM library for device-side code generation.
+Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build the NVSHMEM library for device-side code generation.
-```bash 
-export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src
-cd tilelang/distributed
-source build_nvshmem.sh
-```
+```bash
+# default to 3rdparty/nvshmem_src if not provided
+export NVSHMEM_SRC="${NVSHMEM_SRC:-3rdparty/nvshmem_src}"
+cd tilelang/distributed
+bash build_nvshmem.sh
+```
-You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM.
+You also need to install the `pynvshmem` package, which provides a host-side Python API wrapper for NVSHMEM.
 ```bash
-cd ./pynvshmem
-python setup.py install
-export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH"
+cd pynvshmem/python
+python -m pip install -e .
+# Adjust the lib dir as produced by your NVSHMEM build:
+export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/lib:$LD_LIBRARY_PATH"-Then you can test python import:
+Then test the Python import:
python -c "import pynvshmem"
Would you confirm:
- the exact location of the Python package (does `tilelang/distributed/pynvshmem/python/setup.py` exist?), and
- the NVSHMEM build’s library output dir (`build/lib` vs `build/src/lib`) on your CI environment?
---
🏁 Script executed:
```shell
#!/usr/bin/env bash
# Locate the pynvshmem package directory and its setup.py
fd --max-depth 4 -t d pynvshmem
fd --max-depth 5 -t f setup.py | grep -i pynvshmem
Length of output: 158
docs/get_started/Installation.md: correct NVSHMEM paths and install steps
- Default NVSHMEM_SRCto${NVSHMEM_SRC:-3rdparty/nvshmem_src}
- Run the build script with bashintilelang/distributed
- Install the Python API from tilelang/distributed/pynvshmemvia editable pip
- Point LD_LIBRARY_PATHat the actual NVSHMEM build output (build/lib)
## To use NVSHMEM APIs
@@ -118,9 +118,12 @@
-export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src
+export NVSHMEM_SRC="${NVSHMEM_SRC:-3rdparty/nvshmem_src}"
 cd tilelang/distributed
-source build_nvshmem.sh
+bash build_nvshmem.sh
 You also need to install the `pynvshmem` package…
 
-```bash
- cd ./pynvshmem
- python setup.py install
- export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH"
-```
+```bash
+ cd pynvshmem
+ python -m pip install -e .
+ # Adjust according to your NVSHMEM build output:
+ export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/lib:$LD_LIBRARY_PATH"
+```
 Then test the import:
```bash
python -c "import pynvshmem"🧰 Tools
🪛 LanguageTool
[grammar] ~117-~117: There might be a mistake here.
Context: ...ted/example_allgather.py)), you need to build NVSHMEM library for device-side code ge...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ... install the pynvshmem package, which provides wrapped host-side Python API for NVSHME...
(QB_NEW_EN)
🤖 Prompt for AI Agents
In docs/get_started/Installation.md around lines 115 to 135, update the NVSHMEM
instructions: default NVSHMEM_SRC to ${NVSHMEM_SRC:-3rdparty/nvshmem_src}, run
the build script with bash from tilelang/distributed (bash build_nvshmem.sh),
install the Python API using an editable pip install from
tilelang/distributed/pynvshmem (python -m pip install -e .) instead of python
setup.py install, and set LD_LIBRARY_PATH to the actual NVSHMEM runtime output
directory (use $NVSHMEM_SRC/build/lib) so the example import works.
| if env.USE_NVSHMEM: | ||
| import pynvshmem | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Import pynvshmem defensively to surface actionable errors
If NVSHMEM is enabled but pynvshmem is missing, the import will crash at module import time without context. Wrap in try/except.
-if env.USE_NVSHMEM:
-    import pynvshmem
+if env.USE_NVSHMEM:
+    try:
+        import pynvshmem
+    except Exception as e:
+        raise ImportError("env.USE_NVSHMEM=1 but pynvshmem is not importable. "
+                          "Ensure you installed tilelang/distributed/pynvshmem and NVSHMEM libs are on LD_LIBRARY_PATH.") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if env.USE_NVSHMEM: | |
| import pynvshmem | |
| if env.USE_NVSHMEM: | |
| try: | |
| import pynvshmem | |
| except Exception as e: | |
| raise ImportError("env.USE_NVSHMEM=1 but pynvshmem is not importable. " | |
| "Ensure you installed tilelang/distributed/pynvshmem and NVSHMEM libs are on LD_LIBRARY_PATH.") from e | 
🤖 Prompt for AI Agents
In tilelang/jit/adapter/cython/cython_wrapper.pyx around lines 11-13, the
unconditional import of pynvshmem will cause an opaque crash at module import
time if env.USE_NVSHMEM is true but the package is not installed; wrap the
import in a try/except ImportError block, and on failure raise a clear error
(RuntimeError or re-raised ImportError) that explains NVSHMEM is enabled, names
the missing package (pynvshmem), and instructs the user to install it or disable
USE_NVSHMEM so the failure is actionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
docs/get_started/Installation.md (2)
119-124: Fix NVSHMEM setup commands: avoidpip install mpich, default NVSHMEM_SRC, call the build script with bash.
- mpich is a system package; installing via pip is incorrect and will confuse users.
- Default NVSHMEM_SRC via parameter expansion.
- Prefer
bash build_nvshmem.shover sourcing unless the script sets env that must persist.-```bash -pip install mpich -export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src -cd tilelang/distributed -source build_nvSHMEM.sh -``` +```bash +# Install an MPI implementation if your launcher requires it (choose ONE): +# Debian/Ubuntu: sudo apt-get update && sudo apt-get install -y mpich +# Conda: conda install -c conda-forge mpich +export NVSHMEM_SRC="${NVSHMEM_SRC:-3rdparty/nvshmem_src}" +cd tilelang/distributed +bash build_nvshmem.sh +```Run this script to verify the build script path and NVSHMEM lib dir guidance in the repo:
#!/usr/bin/env bash set -euo pipefail echo "Build script:"; fd -n '^build_nvshmem\.sh$' -a -H -I tilelang | sed 's/^/ /' || true echo "pynvshmem dirs:"; fd -t d -n '^pynvshmem$' -a tilelang | sed 's/^/ /' || true
128-131: Install path and runtime library path are likely wrong; switch to editable pip and point LD_LIBRARY_PATH to build/lib.
- Prefer
python -m pip install -e .over setup.py.- The NVSHMEM runtime libs typically land under build/lib, not build/src/lib.
- Some repos place setup.py under pynvshmem/ or pynvshmem/python/. Detect and cd accordingly.-```bash -cd ./pynvshmem -python setup.py install -export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH" -``` +```bash +# From tilelang/distributed, pick the correct package root: +if [ -f pynvshmem/setup.py ]; then + cd pynvshmem +elif [ -f pynvshmem/python/setup.py ]; then + cd pynvshmem/python +fi +python -m pip install -e . +# Adjust to your NVSHMEM build output: +export LD_LIBRARY_PATH="${NVSHMEM_SRC}/build/lib:${LD_LIBRARY_PATH}" +```
Quick check script to confirm the correct package path and pick the right LD_LIBRARY_PATH hint:
#!/usr/bin/env bash set -euo pipefail echo "Looking for pynvshmem setup.py candidates..." fd -n -a -t f '^setup\.py$' tilelang | grep -i pynvshmem -n || true echo "Probe NVSHMEM lib candidates under \$NVSHMEM_SRC:" for d in build/lib build/src/lib lib; do [ -d "${NVSHMEM_SRC:-3rdparty/nvshmem_src}/$d" ] && echo " found: $NVSHMEM_SRC/$d" done
🧹 Nitpick comments (4)
docs/get_started/Installation.md (4)
27-28: Repo clone instructions look correct; optional speed-up.Commands are correct. Optionally add --depth 1 for faster clones in CI/docs.
-git clone --recursive https://github.com/tile-ai/tilescale +git clone --recursive --depth 1 https://github.com/tile-ai/tilescaleAlso applies to: 63-64, 103-104
115-117: Grammar/clarity: add article before “NVSHMEM library”.Small English fix.
-Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build NVSHMEM library for device-side code generation. +Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build the NVSHMEM library for device-side code generation.
125-125: Wording: tighten description of pynvshmem.Minor English fix.
-You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM. +You also need to install the `pynvshmem` package, which provides a host-side Python API wrapper for NVSHMEM.
133-136: Grammar: streamline test instruction.-Then you can test python import: +Then test the Python import:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
- docs/get_started/Installation.md(5 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/get_started/Installation.md
[grammar] ~117-~117: There might be a mistake here.
Context: ...ted/example_allgather.py)), you need to build NVSHMEM library for device-side code ge...
(QB_NEW_EN)
[grammar] ~125-~125: There might be a mistake here.
Context: ... install the pynvshmem package, which provides wrapped host-side Python API for NVSHME...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (1)
docs/get_started/Installation.md (1)
12-12: Section intro reads well.No issues; clear structure for the three install methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
docs/get_started/Installation.md (4)
125-125: Wording: clarify pynvshmem description.-You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM. +You also need to install the `pynvshmem` package, which provides a host-side Python API wrapper for NVSHMEM.
133-136: Tighten phrasing of the import test.-Then you can test python import: +Then test the Python import:
115-124: NVSHMEM section: command correctness and defaults.
- Add article “the” (“build the NVSHMEM library”).
- Default NVSHMEM_SRC with parameter expansion.
- Prefer “bash build_nvshmem.sh” over sourcing unless the script exports env vars.-## To use NVSHMEM APIs +## To use NVSHMEM APIs @@ -Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build NVSHMEM library for device-side code generation. +Before running examples with NVSHMEM (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to build the NVSHMEM library for device-side code generation. @@ -pip install mpich -export NVSHMEM_SRC="your_custom_nvshmem_dir" # default to 3rdparty/nvshmem_src -cd tilelang/distributed -source build_nvshmem.sh +pip install mpich +# Default to 3rdparty/nvshmem_src if not provided: +export NVSHMEM_SRC="${NVSHMEM_SRC:-3rdparty/nvshmem_src}" +cd tilelang/distributed +bash build_nvshmem.sh
Note: if the script intentionally sets env vars, keep “source” and state why.
128-131: pynvshmem install path and NVSHMEM lib dir are likely wrong.
- The Python package usually lives under pynvshmem/python; prefer editable pip install.
- NVSHMEM’s CMake output is typically build/lib, not build/src/lib.-cd ./pynvshmem -python setup.py install -export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH" +cd pynvshmem/python +python -m pip install -e . +# Adjust to your NVSHMEM build output directory: +export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/lib:$LD_LIBRARY_PATH"
If your build layout differs, document the exact lib path used in CI.
🧹 Nitpick comments (6)
docs/get_started/Installation.md (6)
9-11: Clarify supported toolchain versions (CUDA/Python/LLVM).
- Python 3.7 is EOL; recommend stating ≥3.8 or your tested matrix.
- “LLVM < 20 if you are using the bundled TVM submodule” needs an explicit tested version range.
- Specify minimal and tested CUDA versions (e.g., “min 12.1; tested 12.2/12.4”).
26-29: Repo clone: looks good; suggest shallow clone for speed (optional).
You can add --depth 1 for faster CI/dev setup.-git clone --recursive https://github.com/tile-ai/tilescale +git clone --recursive --depth 1 https://github.com/tile-ai/tilescale
63-65: Duplicate clone block: mirror the shallow clone suggestion.-git clone --recursive https://github.com/tile-ai/tilescale +git clone --recursive --depth 1 https://github.com/tile-ai/tilescale
78-82: Config edits: avoid echoing commented examples into config.Current snippet appends only USE_CUDA; consider showing both options clearly and ensuring newline safety.
-# echo "set(USE_LLVM ON)" # set USE_LLVM to ON if using LLVM -echo "set(USE_CUDA ON)" >> config.cmake +# Enable desired backends: +printf '%s\n' "set(USE_LLVM ON)" "set(USE_CUDA ON)" >> config.cmake
103-105: Mirror shallow clone (optional).-git clone --recursive https://github.com/tile-ai/tilescale +git clone --recursive --depth 1 https://github.com/tile-ai/tilescale
110-112: Fix stray backticks in command.-bash install_cuda.sh -# or bash `install_amd.sh` if you want to enable ROCm runtime +bash install_cuda.sh +# or: +bash install_amd.sh # enable ROCm runtime
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
- docs/get_started/Installation.md(5 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/get_started/Installation.md
[grammar] ~9-~9: There might be a mistake here.
Context: ...on**: >= 3.7 - CUDA Version: >= 12.1 - LLVM: < 20 if you are using the bundle...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ...ted/example_allgather.py)), you need to build NVSHMEM library for device-side code ge...
(QB_NEW_EN)
[grammar] ~125-~125: There might be a mistake here.
Context: ... install the pynvshmem package, which provides wrapped host-side Python API for NVSHME...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (1)
docs/get_started/Installation.md (1)
46-52: Standardize module name to “tilelang”- Update `PYTHONPATH` to include the `tile-lang` Python module: + Update `PYTHONPATH` to include the `tilelang` Python module:Likely an incorrect or invalid review comment.
…r pull and push warp examples - Modified `example_pull_warp.py` and `example_push_warp.py` to set a default value of 65536 for M if not provided in args. - Added new test scripts `test_pull_warp.py` and `test_push_warp.py` to validate functionality using CUDA and multiprocessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
examples/distributed/example_pull_warp.py (1)
40-40: Make args handling robust when args lacks attribute MIf args is an empty Namespace (truthy) without M, this will raise AttributeError. Use getattr to cover None and missing attributes.
- M = args.M if args else 65536 + M = getattr(args, "M", 65536)Optional: validate M > 0 to avoid empty tensors.
examples/distributed/test_pull_warp.py (1)
1-16: Guard for multi-GPU availability to avoid spurious CI failuresThe example asserts 2 ranks; skip when fewer than 2 CUDA devices are present.
import tilelang.testing import torch import torch.multiprocessing +import pytest import example_pull_warp @tilelang.testing.requires_cuda @tilelang.testing.requires_cuda_compute_version_ge(9, 0) def test_example_pull_warp(): + if torch.cuda.device_count() < 2: + pytest.skip("requires >= 2 CUDA devices") torch.multiprocessing.spawn(example_pull_warp.main, args=(2, None), nprocs=2)examples/distributed/example_push_warp.py (1)
40-40: Harden defaulting of M to handle None or missing attributesSame rationale as pull_warp: avoid AttributeError when args is truthy but lacks M.
- M = args.M if args else 65536 + M = getattr(args, "M", 65536)examples/distributed/test_push_warp.py (1)
1-16: Skip test when <2 GPUs availablePrevents failures on single-GPU runners while preserving coverage where applicable.
import tilelang.testing import torch import torch.multiprocessing +import pytest import example_push_warp @tilelang.testing.requires_cuda @tilelang.testing.requires_cuda_compute_version_ge(9, 0) def test_example_push_warp(): + if torch.cuda.device_count() < 2: + pytest.skip("requires >= 2 CUDA devices") torch.multiprocessing.spawn(example_push_warp.main, args=(2, None), nprocs=2)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
- examples/distributed/example_pull_warp.py(1 hunks)
- examples/distributed/example_push_warp.py(1 hunks)
- examples/distributed/test_pull_warp.py(1 hunks)
- examples/distributed/test_push_warp.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/distributed/test_pull_warp.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version_ge(95-96)examples/distributed/example_pull_warp.py (2)
main(17-34)
main(39-78)
examples/distributed/test_push_warp.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version_ge(95-96)examples/distributed/example_push_warp.py (2)
main(17-34)
main(39-78)
- Moved NVSHMEM import and initialization inside a conditional block to ensure it only executes when `env.USE_NVSHMEM` is set to True, improving code clarity and preventing unnecessary imports.
Summary by CodeRabbit