Make a first pass at using NVSHMEM4Py for host-side library management, etc. #4

benhg · 2025-11-11T00:06:18Z

Posting this here just for discussion and out of my own interest. This PR migrates from custom C/Python bindings to using nvshmem4py where it's easy/simple. I'll leave some comments with questions/comments around certain specific areas of the code.

First pass at replacing custom bindings with nvshmem4py versions

benhg · 2025-11-11T00:07:01Z

benchmarks/benchmark_compare_allreduce.py

-        uid_bytes = nvshmem_comm_cuda.NVSHMEMCommWrapper.get_unique_id_bytes()
-        uid_gpu = uid_bytes.to(device)
-        dist.broadcast(uid_gpu, src=0)
+        # Set device current


This should really be a helper function because it's used in the benchmarks and the library itself. I couldn't think of where the best place to put it would be.

benhg · 2025-11-11T00:08:04Z

benchmarks/benchmark_compare_allreduce.py

        if comm_wrapper is not None:
-            nvrar_tensor, nvrar_tensor_id = comm_wrapper.allocate_tensor(num_elems, dtype, device, nvshmem_comm_cuda.Protocol.LL8)
+            # Allocate symmetric tensor via nvshmem4py and register with wrapper
+            nvrar_tensor = nvshmem.tensor((num_elems,), dtype=dtype)


This is the first major difference. I couldn't think of a good way to handle the tensor_id stuff purely in python, so what I did is:

Replace tensor allocation with the nvshmem.core wrapper

keep the other parts of the process in your C code (and rename it to register_tensor instead of allocate_tensor)

benhg · 2025-11-11T00:08:15Z

nvrar/comm.py

+        # This should be idempotent
+        cuda_dev.set_current()
+        stream = torch.cuda.current_stream()
+


Here's the same boilerplate

benhg · 2025-11-11T00:09:38Z

nvrar/csrc/CMakeLists.txt

+# Allow user override via -DCUDA_CCCL_INCLUDE_DIR
+set(CUDA_CCCL_INCLUDE_DIR "" CACHE PATH "Path to CUDA CCCL include directory (contains cuda/std)")
+set(_CUDA_ROOT "")
+if(DEFINED ENV{CUDA_HOME})


This is hacky and terrible and there is a better way to do it. In NVSHMEM's source, we handle it like this: https://github.com/NVIDIA/nvshmem/blob/2d7d25f0816235e3c2b51779571ec032606ea0dd/src/device/CMakeLists.txt#L188

benhg · 2025-11-11T00:09:52Z

nvrar/csrc/include/coll.h

-  virtual void free_tensor(uint64_t id) = 0;
+  // Register an externally-allocated symmetric tensor (e.g., via nvshmem4py)
+  // Returns a newly assigned tensor id
+  virtual uint64_t register_external_tensor(torch::Tensor& t) = 0;


Here's the renaming I mentioned above.

benhg · 2025-11-11T00:10:38Z

nvrar/csrc/src/ll8_coll.cu

-    throw std::runtime_error("Failed to allocate signal memory");
+  uint64_t* seq_num_signal = nullptr;
+  // TODO: 
+  if (steps_inter_ > 0) {


This is just here so my tests would pass on 1 node. If it's 1 node but we don't have this check, the calloc will fail because steps_inter_ is 0 so we allocate nothing.

benhg · 2025-11-11T00:10:43Z

nvrar/csrc/src/ll8_coll.cu


 void RecursiveLL8Coll::deregister_tensor(uint64_t id) {
-  // TODO: Implement
+  // TODO: Adding this so that I can test on 1-node. Is this valuable?


benhg · 2025-11-11T00:11:06Z

nvrar/csrc/src/simple_coll.cu

-  if (!chunk_signal_) {
-    throw std::runtime_error("Failed to allocate chunk signal memory");
+  // TODO: Adding this so that I can test on 1-node. Is this valuable?
+  if (steps_inter_ > 0) {


benhg · 2025-11-11T00:14:48Z

tuning/tune_allreduce_preallocated.py

-    uid_gpu = uid_bytes.to(f"cuda:{local_device}")
-    dist.broadcast(uid_gpu, src=0)
+    # Initialize NVSHMEM via nvshmem4py using UID method
+    cuda_dev = Device(local_device)


Same boilerplate.

prajwal1210 · 2025-11-18T20:19:58Z

Oh, somehow I missed the notification for this PR last week. I will look over the comments and changes and respond to them as soon as possible.

benhg and others added 5 commits November 10, 2025 11:12

First pass at replacing custom bindings with nvshmem4py versions

36e4eb9

cuda 13 fix

82ba99c

cuda 13 fix

959f467

remove mpi dependency and c-side init path

4706f16

Merge pull request #1 from benhg/benjaming/nvshmem4py

1b2776c

First pass at replacing custom bindings with nvshmem4py versions

benhg commented Nov 11, 2025

View reviewed changes

nvrar/comm.py

# This should be idempotent

cuda_dev.set_current()

stream = torch.cuda.current_stream()

Copy link

Author

benhg Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the same boilerplate

benhg commented Nov 11, 2025

View reviewed changes

benhg added 2 commits November 11, 2025 00:12

revert unwanted changes

1fd2280

Merge branch 'develop' of github.com:benhg/nvrar into develop

b548347

benhg commented Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make a first pass at using NVSHMEM4Py for host-side library management, etc. #4

Make a first pass at using NVSHMEM4Py for host-side library management, etc. #4

Uh oh!

benhg commented Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

benhg Nov 11, 2025

Uh oh!

prajwal1210 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make a first pass at using NVSHMEM4Py for host-side library management, etc. #4

Are you sure you want to change the base?

Make a first pass at using NVSHMEM4Py for host-side library management, etc. #4

Uh oh!

Conversation

benhg commented Nov 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prajwal1210 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants