TL/NCCL: add user buffer registration via memmap by wfaderhold21 · Pull Request #1260 · openucx/ucc

wfaderhold21 · 2026-01-30T00:26:39Z

What

This PR implements NCCL User Buffer Registration (UBR) in TL/NCCL. It replaces the three stub functions (mem_map, mem_unmap, memh_pack) that previously returned UCC_ERR_NOT_SUPPORTED with an implementation that registers GPU memory buffers with NCCL communicators.

Why

NCCL UBR (available since NCCL 2.19.0) lets applications pre-register GPU buffers so NCCL can bypass its internal memory registration on each collective call. Without this, NCCL must re-pin and register the buffer every time it's used in a collective, which adds latency. This is particularly impactful for AI/ML workloads where the same buffers (e.g., model gradients) are used in AllReduce repeatedly across training steps. This PR wires UCC's existing mem_map API into NCCL's ncclCommRegister/ncclCommDeregister, enabling that optimization.

How

Lazy registration: mem_map only allocates a ucc_tl_nccl_memh_data_t tracking struct — no NCCL call
yet. The actual ncclCommRegister call is deferred to the first collective operation that uses the
buffer (ucc_tl_nccl_lazy_register_memh in tl_nccl_coll.c). This avoids registering with comms that
never use the buffer.
Per-comm tracking: Each memh_data maintains dynamic arrays of (ncclComm_t, handle) pairs — one entry
per NCCL communicator that has registered the buffer — growing as needed.
Thread safety: Spinlocks guard both the per-handle arrays and the context-level handle list. The
NCCL calls (ncclCommRegister/ncclCommDeregister) are made outside spinlocks to avoid holding them
during potentially slow CUDA operations.
Clean teardown: team_destroy calls ucc_tl_nccl_team_deregister_memhs to null out all slots for the
dying comm before calling ncclCommDestroy, preventing use-after-free in a later mem_unmap. The reverse
order (mem_unmap before team_destroy) is also handled cleanly.
Export/import support: memh_pack packs the buffer address and length for cross-process sharing.
mem_map in import mode unpacks the local address from the blob.
Config toggle: A new UCC_TL_NCCL_ENABLE_UBR config option (try/yes/no) controls the feature, with
version gating against NCCL 2.19.0 at compile time.

wfaderhold21 · 2026-01-30T18:13:33Z

Tested allgather performance on GAIA with 8 nodes, 8 PPN:

greptile-apps · 2026-01-30T18:16:03Z

Greptile Summary

This PR implements NCCL User Buffer Registration (UBR) via the UCC mem map interface, enabling zero-copy collectives when NCCL ≥ 2.19.0 is available. Registration is intentionally lazy — buffers are registered with ncclCommRegister only at collective-init time — and the deregistration path is carefully restructured to avoid holding spinlocks across slow NCCL/CUDA calls. The import path is substantially improved over the original design: it now correctly parses the exported pack buffer and uses the new local_address/local_len fields in ucc_mem_map_memh_t to carry the importer's local VA independently from the remote VA.

Key changes:

tl_nccl_context.c: Full implementations of mem_map (EXPORT + IMPORT), mem_unmap, and memh_pack. IMPORT path reads from pack_buffer and correctly resolves local vs remote addresses.
tl_nccl_team.c: ucc_tl_nccl_team_deregister_memhs — collects (comm, handle) pairs under locks then deregisters outside the lock to avoid spinlock-during-NCCL-call. Contains a bug: NULL slots are written to registered_comms[] but num_comms is never decremented, causing arrays to grow unboundedly across team create/destroy cycles.
tl_nccl_coll.c: Lazy registration triggered at init_task time; correctly handles V-type (counts/displacements) vs flat buffer collectives. Prior realloc-atomicity bug addressed.
ucc_context.h/ucc_context.c: Adds local_address/local_len to ucc_mem_map_memh_t and populates them during ucc_mem_map_import.
test_mem_map.cc: Three new tests covering reverse unmap order, remote-address preservation on import, and stress cycling.

Confidence Score: 2/5

Not safe to merge — several correctness issues remain including unbounded array growth under repeated team create/destroy, a silent pack-buffer parse mismatch, and an incorrect zero-length guard in IMPORT mode.
The PR addresses a number of issues raised in previous review rounds (spinlock-during-NCCL-call, IMPORT address handling, realloc atomicity) but three new issues are confirmed: (1) NULL slots accumulate in registered_comms/nccl_handles without decrementing num_comms, causing unbounded array growth for long-lived memhs with many team create/destroy cycles; (2) the IMPORT pack-buffer parse silently skips the NCCL entry on a size mismatch making diagnosis extremely difficult; (3) the zero-length guard in IMPORT mode checks memh->len (the remote buffer length) rather than memh->local_len (the local buffer length). These issues lower confidence below merge threshold.
src/components/tl/nccl/tl_nccl_team.c (null-slot accumulation in registered_comms) and src/components/tl/nccl/tl_nccl_context.c (silent pack-buffer skip, wrong zero-length field in IMPORT mode).

Important Files Changed

Filename	Overview
src/components/tl/nccl/tl_nccl_coll.c	Adds `ucc_tl_nccl_lazy_register_memh` and UBR-aware init_task logic. Correctly addresses prior TOCTOU, realloc-failure atomicity, and v-type buffer size bugs. The lock-release/reacquire pattern around the slow NCCL call is sound; the double-check re-scan is correct. No new critical issues found in this iteration.
src/components/tl/nccl/tl_nccl_context.c	Implements `mem_map`, `mem_unmap`, and `memh_pack` for NCCL UBR. The IMPORT path now reads from `pack_buffer` instead of treating `memh->address`/`len` as local addresses, and correctly uses `memh->local_address`/`local_len`. However, a silent skip when the packed NCCL entry size mismatches is error-prone, and the zero-length guard in IMPORT mode uses the wrong field (`memh->len` instead of `memh->local_len`).
src/components/tl/nccl/tl_nccl_team.c	Adds `ucc_tl_nccl_team_deregister_memhs` that correctly avoids holding spinlocks during slow NCCL calls by collecting pairs first. Critical issue: NULL slots are written but `num_comms` is never decremented, causing unbounded array growth with each team create/destroy cycle against a long-lived memh.
src/components/tl/nccl/tl_nccl.h	Adds `ucc_tl_nccl_memh_data_t` struct with spinlock, list link, and comm/handle bookkeeping arrays; adds `ubr_available` flag and `memh_list`/`memh_lock` to the context. Clean, well-structured additions.
src/components/tl/nccl/tl_nccl.c	Adds `ENABLE_UBR` ternary config option. Straightforward config table addition with correct UCC_CONFIG_TYPE_TERNARY type.
src/core/ucc_context.h	Adds `local_address`/`local_len` fields to `ucc_mem_map_memh_t` to carry the importer's local VA separately from the remote VA preserved in `address`/`len`. This is an internal struct change, so ABI compatibility is not a concern. The export path allocates with `ucc_calloc`, so the new fields are correctly zero-initialized on export.
src/core/ucc_context.c	Sets `local_memh->local_address` and `local_memh->local_len` from import params before TL mem_map calls. Correctly limits to `params->segments[0]` matching the single-segment export convention. Straightforward and low-risk.
test/gtest/core/test_mem_map.cc	Adds three new test cases: reverse unmap order, remote-address-preservation on import, and repeated export/import cycle stress. Tests are well-structured with appropriate GTEST_SKIP for unsupported configurations.

Comments Outside Diff (2)

src/components/tl/nccl/tl_nccl_context.c, line 507-546 (link)

IMPORT pack-buffer parsing silently skips the NCCL entry when the packed size mismatches

When the strcmp(name, "nccl") == 0 branch is entered but *psz != sizeof(void *) + sizeof(size_t), the if condition is false and the loop falls through to the offset += ... update, silently skipping the NCCL entry. After the loop, m_data->address is still NULL, causing the function to return UCC_ERR_INVALID_PARAM with no indication of why.

This can happen if the exporter and importer were built against different NCCL TL pack formats. The parse should treat a size mismatch on the NCCL entry as a distinct failure case with a clear error message:
```
if (strcmp(name, "nccl") == 0) {
    if (*psz != sizeof(void *) + sizeof(size_t)) {
        tl_error(ctx->super.super.lib,
                 "NCCL UBR: unexpected packed size %zu (expected %zu)",
                 *psz, sizeof(void *) + sizeof(size_t));
        ucc_spinlock_destroy(&m_data->lock);
        ucc_free(m_data);
        return UCC_ERR_INVALID_PARAM;
    }
    /* unpack address and length */
    ...
    break;
}
```
src/components/tl/nccl/tl_nccl_context.c, line 479-483 (link)

IMPORT mode zero-length check uses memh->len (remote VA length) instead of memh->local_len (local VA length)

The early rejection at line 486–490 (if (memh->len == 0)) guards against zero-length exports, which is correct for EXPORT mode. For IMPORT mode however, the field that determines whether the local CUDA buffer is usable is memh->local_len (set by the core from params->segments[0].len just before calling into this function). A caller could legitimately import a non-zero-length remote buffer while providing a zero-length local segment (params->segments[0].len == 0), which would pass this check and only fail much later at the if (!m_data->address || m_data->length == 0) check — with a confusing UCC_ERR_INVALID_PARAM return and no error message.

Consider also checking memh->local_len in IMPORT mode:
```
if (memh->len == 0 ||
    (mode == UCC_MEM_MAP_MODE_IMPORT && memh->local_len == 0)) {
    tl_debug(ctx->super.super.lib,
             "NCCL UBR: zero-length buffer, skipping mem_map");
    return UCC_ERR_NOT_SUPPORTED;
}
```

_{Last reviewed commit: "TL/NCCL: add user bu..."}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-30T18:16:07Z

src/components/tl/nccl/tl_nccl_coll.c

+        if (!new_comms) {
+            tl_error(
+                UCC_TL_TEAM_LIB(team),
+                "failed to allocate memory for registered comms array");
+            /* Buffer is registered but we can't track it - this is a problem */
+            return UCC_ERR_NO_MEMORY;
+        }
+        m_data->registered_comms = new_comms;
+
+        new_handles              = (void **)ucc_realloc(
+            m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
+        if (!new_handles) {
+            tl_error(
+                UCC_TL_TEAM_LIB(team),
+                "failed to allocate memory for NCCL handles array");
+            return UCC_ERR_NO_MEMORY;


resource leak: if ucc_realloc for new_comms succeeds but the second ucc_realloc for new_handles fails (line 240-244), the newly allocated new_comms is assigned to m_data->registered_comms (line 236) but the function returns UCC_ERR_NO_MEMORY without deregistering the NCCL buffer. This leaks the NCCL registration created at line 208.

Suggested change

if (!new_comms) {

tl_error(

UCC_TL_TEAM_LIB(team),

"failed to allocate memory for registered comms array");

/* Buffer is registered but we can't track it - this is a problem */

return UCC_ERR_NO_MEMORY;

}

m_data->registered_comms = new_comms;

new_handles = (void **)ucc_realloc(

m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");

if (!new_handles) {

tl_error(

UCC_TL_TEAM_LIB(team),

"failed to allocate memory for NCCL handles array");

return UCC_ERR_NO_MEMORY;

if (!new_handles) {

/* Failed to grow handles array - must deregister to avoid leak */

ncclCommDeregister(team->nccl_comm, nccl_handle);

tl_error(

UCC_TL_TEAM_LIB(team),

"failed to allocate memory for NCCL handles array");

return UCC_ERR_NO_MEMORY;

}

greptile-apps

_{2 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-30T18:55:12Z

src/components/tl/nccl/tl_nccl_coll.c

+            status = ucc_tl_nccl_lazy_register_memh(
+                coll_args->args.src.info.buffer,
+                coll_args->args.src.info.count *
+                    ucc_dt_size(coll_args->args.src.info.datatype),


buffer size calculation is incorrect for variable-size collectives (ALLTOALLV, ALLGATHERV, GATHERV, SCATTERV). For these collectives, coll_args->args.src uses the info_v union member with counts* array, not info.count scalar. Accessing .info.count reads the counts pointer as an integer, resulting in incorrect buffer size. Need to check coll_args->args.coll_type and use appropriate buffer size calculation for each collective type.

Suggested change

status = ucc_tl_nccl_lazy_register_memh(

coll_args->args.src.info.buffer,

coll_args->args.src.info.count *

ucc_dt_size(coll_args->args.src.info.datatype),

// TODO: Need to calculate correct buffer size based on coll_type

// For variable-size collectives (ALLTOALLV, ALLGATHERV, etc),

// must sum counts array instead of using single count value

status = ucc_tl_nccl_lazy_register_memh(

coll_args->args.src.info.buffer,

coll_args->args.src.info.count *

ucc_dt_size(coll_args->args.src.info.datatype),

greptile-apps · 2026-01-30T18:55:13Z

src/components/tl/nccl/tl_nccl_coll.c

+            status = ucc_tl_nccl_lazy_register_memh(
+                coll_args->args.dst.info.buffer,
+                coll_args->args.dst.info.count *
+                    ucc_dt_size(coll_args->args.dst.info.datatype),


same buffer size calculation bug for dst buffer - incorrect for variable-size collectives that use info_v.counts* array instead of info.count scalar

greptile-apps · 2026-01-30T18:55:14Z

src/components/tl/nccl/tl_nccl_coll.c

+    buf_start    = (uintptr_t)buffer;
+    buf_end      = buf_start + length;


check for potential integer overflow when buf_start + length exceeds UINTPTR_MAX. Should validate that length <= UINTPTR_MAX - buf_start before addition to prevent wraparound

Suggested change

buf_start = (uintptr_t)buffer;

buf_end = buf_start + length;

buf_start = (uintptr_t)buffer;

if (length > UINTPTR_MAX - buf_start) {

tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow");

return UCC_ERR_INVALID_PARAM;

}

buf_end = buf_start + length;

greptile-apps

_{4 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-11T00:31:32Z

src/components/tl/nccl/tl_nccl_context.c

+
+    /* If UBR is not available/disabled or no TL data, return empty pack */
+    if (!ctx->ubr_available || !tl_h || !tl_h->tl_data) {
+        tl_h->packed_size = 0;
+        *pack_buffer      = NULL;
+        return UCC_OK;


Null deref on pack

In ucc_tl_nccl_memh_pack, the early-return condition includes !tl_h, but the block still does tl_h->packed_size = 0. If tl_h == NULL this will crash. This can trigger if the iface is ever invoked with a NULL tl_h (the function explicitly allows it in the condition), so the guard should avoid dereferencing tl_h in that path.

greptile-apps · 2026-02-11T00:31:33Z

src/components/tl/nccl/tl_nccl_context.c

+ucc_status_t ucc_tl_nccl_mem_map(
+    const ucc_base_context_t *context, ucc_mem_map_mode_t mode,
+    ucc_mem_map_memh_t *memh, ucc_mem_map_tl_t *tl_h)
 {
-    return UCC_ERR_NOT_SUPPORTED;
+    ucc_tl_nccl_context_t *ctx = ucc_derived_of(context, ucc_tl_nccl_context_t);
+    ucc_tl_nccl_memh_data_t *m_data;
+
+    /* Check if UBR is available and enabled */
+    if (!ctx->ubr_available) {
+        tl_debug(
+            ctx->super.super.lib, "NCCL UBR not available, skipping mem_map");
+        return UCC_ERR_NOT_SUPPORTED;
+    }
+
+    /* Support both EXPORT and IMPORT modes for global memh */
+    if (mode != UCC_MEM_MAP_MODE_EXPORT && mode != UCC_MEM_MAP_MODE_IMPORT) {
+        tl_debug(ctx->super.super.lib,
+                 "NCCL UBR: unsupported mode %d", mode);
+        return UCC_ERR_NOT_SUPPORTED;
+    }
+
+    /* Reject zero-length buffers */
+    if (memh->len == 0) {
+        tl_debug(ctx->super.super.lib,
+                 "NCCL UBR: zero-length buffer, skipping mem_map");
+        return UCC_ERR_NOT_SUPPORTED;
+    }
+
+    /* Allocate TL-specific memory handle data */
+    m_data = (ucc_tl_nccl_memh_data_t *)ucc_calloc(
+        1, sizeof(ucc_tl_nccl_memh_data_t), "tl_nccl_memh_data");
+    if (!m_data) {
+        tl_error(
+            ctx->super.super.lib, "failed to allocate TL memory handle data");
+        return UCC_ERR_NO_MEMORY;
+    }
+
+    /* Store buffer information - registration will happen lazily on first use */
+    m_data->address          = memh->address;
+    m_data->length           = memh->len;
+    m_data->registered_comms = NULL;
+    m_data->nccl_handles     = NULL;
+    m_data->num_comms        = 0;
+    m_data->max_comms        = 0;
+
+    /* Set TL handle data */
+    tl_h->tl_data = m_data;
+    strncpy(tl_h->tl_name, "nccl", UCC_MEM_MAP_TL_NAME_LEN - 1);
+    tl_h->tl_name[UCC_MEM_MAP_TL_NAME_LEN - 1] = '\0';
+


IMPORT memh uses garbage

ucc_tl_nccl_mem_map claims to support UCC_MEM_MAP_MODE_IMPORT, but it ignores the packed payload in memh->pack_buffer and instead stores memh->address/memh->len into m_data. In the core import path, those fields are explicitly treated as “likely garbage” for imported handles (src/core/ucc_context.c:1212-1214), so the imported NCCL TL handle will contain invalid address/length and later lazy registration may attempt ncclCommRegister on bogus pointers.

greptile-apps · 2026-02-11T00:31:34Z

src/components/tl/nccl/tl_nccl_coll.c

+    /* Verify that the entire buffer is within the registered memory region */
+    buf_start    = (uintptr_t)buffer;
+    buf_end      = buf_start + length;
+    region_start = (uintptr_t)m_data->address;
+    region_end   = region_start + m_data->length;
+
+    if (length > (UINTPTR_MAX - buf_start)) {
+        tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow");
+        return UCC_ERR_INVALID_PARAM;
+    }


Overflow check too late

In ucc_tl_nccl_lazy_register_memh, buf_end is computed as buf_start + length before checking length <= UINTPTR_MAX - buf_start. If the addition overflows, buf_end is already wrapped/UB, and the subsequent bounds check can misbehave. The overflow guard needs to run before computing buf_end.

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-13T17:35:42Z

src/components/tl/nccl/tl_nccl_context.c

+    /* For NCCL UBR, we only store metadata (address/length) for lazy registration.
+     * When ncclCommRegister is called later, it stores this metadata locally.
+     * The NCCL communicator handles IPC handle exchange internally during collective
+     * operations (via point-to-point proxy calls), so we don't need special IMPORT
+     * handling. We can use memh->address/memh->len directly in both EXPORT and IMPORT
+     * modes - the address should be valid in the current process context. */
+    m_data->address          = memh->address;
+    m_data->length           = memh->len;


For IMPORT mode, verify whether using memh->address/memh->len directly is correct. The core code treats these as "likely garbage" for imported handles (src/core/ucc_context.c:1212-1214), suggesting TLs should extract data from pack_buffer. If NCCL's IPC model truly makes local addresses valid for IMPORT mode (different from typical TL behavior), add documentation explaining this design choice. Otherwise, implement extraction from pack_buffer similar to the core's approach in ucc_context.c:1224-1238.

greptile-apps · 2026-03-11T21:07:07Z

src/components/tl/nccl/tl_nccl_coll.c

+            if (ucc_unlikely(status != UCC_OK)) {
+                tl_error(
+                    team->context->lib,
+                    "NCCL UBR: lazy_register failed with status %d",
+                    status);
+                ucc_mpool_put(task);
+                return status;
+            }


CUDA event leak on UBR registration failure

Both UBR error paths (src at line 357 and dst at line 395) call ucc_mpool_put(task) directly after a failed ucc_tl_nccl_lazy_register_memh. By this point in the function, task->completed may already hold a valid CUDA event (created at line 309 when sync_type == UCC_TL_NCCL_COMPLETION_SYNC_TYPE_EVENT). Putting the task back to the pool without destroying the event leaks it.

The fix is to destroy the event before returning, mirroring the cleanup in ucc_tl_nccl_free_task:

Suggested change

if (ucc_unlikely(status != UCC_OK)) {

tl_error(

team->context->lib,

"NCCL UBR: lazy_register failed with status %d",

status);

ucc_mpool_put(task);

return status;

}

if (ucc_unlikely(status != UCC_OK)) {

tl_error(

team->context->lib,

"NCCL UBR: lazy_register failed with status %d",

status);

if (task->completed) {

ucc_ec_destroy_event(task->completed, UCC_EE_CUDA_STREAM);

}

ucc_mpool_put(task);

return status;

}

The same fix is required for the dst-buffer error path at line 394–397.

greptile-apps · 2026-03-16T16:32:41Z

src/components/tl/nccl/tl_nccl_coll.c

+    region_start = (uintptr_t)m_data->address;
+    region_end   = region_start + m_data->length;


Missing overflow check for region_end

An overflow guard was added for buf_start + length (line 166), but the symmetric computation region_start + m_data->length has no such check. If m_data->address + m_data->length wraps around UINTPTR_MAX, region_end becomes a small value. The subsequent bounds test buf_end > region_end would then be spuriously false for any legitimately-contained buffer, meaning a future out-of-bounds access would be missed. The same pattern used for the buffer should be applied here:

if (m_data->length > (UINTPTR_MAX - (uintptr_t)m_data->address)) { tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: registered region size causes overflow"); return UCC_ERR_INVALID_PARAM; } region_start = (uintptr_t)m_data->address; region_end = region_start + m_data->length;

greptile-apps · 2026-03-17T20:57:09Z

src/components/tl/nccl/tl_nccl_team.c

+    ucc_spin_lock(&ctx->memh_lock);
+    ucc_list_for_each(m_data, &ctx->memh_list, list_elem) {
+        ucc_spin_lock(&m_data->lock);
+        for (i = 0; i < m_data->num_comms; i++) {
+            if (m_data->registered_comms[i] == team->nccl_comm) {
+                ncclCommDeregister(team->nccl_comm, m_data->nccl_handles[i]);
+                /* Null the slot so mem_unmap skips it */
+                m_data->registered_comms[i] = NULL;
+                m_data->nccl_handles[i]     = NULL;
+                break;
+            }
+        }
+        ucc_spin_unlock(&m_data->lock);
+    }
+    ucc_spin_unlock(&ctx->memh_lock);


Spinlock held during blocking NCCL deregistration calls

ctx->memh_lock is a spinlock, yet ncclCommDeregister (a potentially slow synchronous NCCL/CUDA call) is invoked on line 86 while both ctx->memh_lock and m_data->lock are still held. Any concurrent call to mem_map, mem_unmap, or lazy_register_memh that needs either of these locks will busy-spin for the entire duration of all deregistrations. In a multi-GPU setup with many registered buffers this can waste significant CPU cycles and may even deadlock CUDA if the same CPU thread is required for NCCL progress.

The fix is to collect the (comm, handle) pairs while holding the locks, then release both locks before calling ncclCommDeregister:

/* collect pairs under lock */ ucc_spin_lock(&ctx->memh_lock); ucc_list_for_each(m_data, &ctx->memh_list, list_elem) { ucc_spin_lock(&m_data->lock); for (i = 0; i < m_data->num_comms; i++) { if (m_data->registered_comms[i] == team->nccl_comm) { /* save for deregistration outside locks */ pending[npending].comm = m_data->registered_comms[i]; pending[npending].handle = m_data->nccl_handles[i]; m_data->registered_comms[i] = NULL; m_data->nccl_handles[i] = NULL; npending++; break; } } ucc_spin_unlock(&m_data->lock); } ucc_spin_unlock(&ctx->memh_lock); /* deregister outside of any spinlock */ for (i = 0; i < npending; i++) { ncclCommDeregister(pending[i].comm, pending[i].handle); }

greptile-apps · 2026-03-17T20:57:16Z

src/components/tl/nccl/tl_nccl_coll.c

+        m_data->registered_comms = new_comms;
+
+        new_handles = (void **)ucc_realloc(
+            m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
+        if (!new_handles) {
+            ucc_spin_unlock(&m_data->lock);
+            tl_error(
+                UCC_TL_TEAM_LIB(team),
+                "failed to allocate memory for NCCL handles array");
+            ncclCommDeregister(team->nccl_comm, nccl_handle);
+            return UCC_ERR_NO_MEMORY;
+        }
+        m_data->nccl_handles = new_handles;
+        m_data->max_comms    = new_max;


m_data->max_comms not updated when nccl_handles realloc fails, leaving arrays inconsistent

When new_comms realloc succeeds (line 241-252), m_data->registered_comms is immediately updated to the larger buffer (line 253). If the subsequent new_handles realloc fails (line 255-264), the function returns UCC_ERR_NO_MEMORY without updating m_data->max_comms to new_max.

This leaves the struct in an inconsistent state:

m_data->registered_comms has capacity new_max

m_data->nccl_handles still has capacity old_max

m_data->max_comms still reports old_max

On the next successful registration, both arrays are grown from old_max to 2 * old_max again (which equals new_max again), so correctness is self-recovering. However, the window of inconsistency – between the failed call and the next successful call – means any code that assumes registered_comms and nccl_handles have the same allocated capacity as max_comms will be wrong, making the logic harder to reason about.

The safest fix is to update m_data->max_comms = new_max immediately after the successful new_comms realloc, so the state always reflects the actual registered_comms capacity, even if nccl_handles is still at the old size.

greptile-apps · 2026-03-20T17:50:24Z

Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance.

wfaderhold21 added the WIP - Don't Merge label Jan 30, 2026

wfaderhold21 added Ready-for-Review and removed WIP - Don't Merge labels Jan 30, 2026

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

TL/NCCL: add user buffer registration via memmap

8efb81b

wfaderhold21 force-pushed the topic/nccl-ub branch from b17ed9b to 8efb81b Compare March 18, 2026 20:23

Merge branch 'openucx:master' into topic/nccl-ub

1d4b847

		region_start = (uintptr_t)m_data->address;
		region_end = region_start + m_data->length;

Conversation

wfaderhold21 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Uh oh!

wfaderhold21 commented Jan 30, 2026

Uh oh!

greptile-apps bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Comments Outside Diff (2)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wfaderhold21 commented Jan 30, 2026 •

edited

Loading

greptile-apps bot commented Jan 30, 2026 •

edited

Loading