Skip to content

TL/NCCL: add user buffer registration via memmap#1260

Open
wfaderhold21 wants to merge 2 commits intoopenucx:masterfrom
wfaderhold21:topic/nccl-ub
Open

TL/NCCL: add user buffer registration via memmap#1260
wfaderhold21 wants to merge 2 commits intoopenucx:masterfrom
wfaderhold21:topic/nccl-ub

Conversation

@wfaderhold21
Copy link
Copy Markdown
Collaborator

@wfaderhold21 wfaderhold21 commented Jan 30, 2026

What

This PR implements NCCL User Buffer Registration (UBR) in TL/NCCL. It replaces the three stub functions (mem_map, mem_unmap, memh_pack) that previously returned UCC_ERR_NOT_SUPPORTED with an implementation that registers GPU memory buffers with NCCL communicators.

Why

NCCL UBR (available since NCCL 2.19.0) lets applications pre-register GPU buffers so NCCL can bypass its internal memory registration on each collective call. Without this, NCCL must re-pin and register the buffer every time it's used in a collective, which adds latency. This is particularly impactful for AI/ML workloads where the same buffers (e.g., model gradients) are used in AllReduce repeatedly across training steps. This PR wires UCC's existing mem_map API into NCCL's ncclCommRegister/ncclCommDeregister, enabling that optimization.

How

  • Lazy registration: mem_map only allocates a ucc_tl_nccl_memh_data_t tracking struct — no NCCL call
    yet. The actual ncclCommRegister call is deferred to the first collective operation that uses the
    buffer (ucc_tl_nccl_lazy_register_memh in tl_nccl_coll.c). This avoids registering with comms that
    never use the buffer.
  • Per-comm tracking: Each memh_data maintains dynamic arrays of (ncclComm_t, handle) pairs — one entry
    per NCCL communicator that has registered the buffer — growing as needed.
  • Thread safety: Spinlocks guard both the per-handle arrays and the context-level handle list. The
    NCCL calls (ncclCommRegister/ncclCommDeregister) are made outside spinlocks to avoid holding them
    during potentially slow CUDA operations.
  • Clean teardown: team_destroy calls ucc_tl_nccl_team_deregister_memhs to null out all slots for the
    dying comm before calling ncclCommDestroy, preventing use-after-free in a later mem_unmap. The reverse
    order (mem_unmap before team_destroy) is also handled cleanly.
  • Export/import support: memh_pack packs the buffer address and length for cross-process sharing.
    mem_map in import mode unpacks the local address from the blob.
  • Config toggle: A new UCC_TL_NCCL_ENABLE_UBR config option (try/yes/no) controls the feature, with
    version gating against NCCL 2.19.0 at compile time.

@wfaderhold21
Copy link
Copy Markdown
Collaborator Author

Tested allgather performance on GAIA with 8 nodes, 8 PPN:
allgather_ubr_bandwidth_clean

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 30, 2026

Greptile Summary

This PR implements NCCL User Buffer Registration (UBR) via the UCC mem map interface, enabling zero-copy collectives when NCCL ≥ 2.19.0 is available. Registration is intentionally lazy — buffers are registered with ncclCommRegister only at collective-init time — and the deregistration path is carefully restructured to avoid holding spinlocks across slow NCCL/CUDA calls. The import path is substantially improved over the original design: it now correctly parses the exported pack buffer and uses the new local_address/local_len fields in ucc_mem_map_memh_t to carry the importer's local VA independently from the remote VA.

Key changes:

  • tl_nccl_context.c: Full implementations of mem_map (EXPORT + IMPORT), mem_unmap, and memh_pack. IMPORT path reads from pack_buffer and correctly resolves local vs remote addresses.
  • tl_nccl_team.c: ucc_tl_nccl_team_deregister_memhs — collects (comm, handle) pairs under locks then deregisters outside the lock to avoid spinlock-during-NCCL-call. Contains a bug: NULL slots are written to registered_comms[] but num_comms is never decremented, causing arrays to grow unboundedly across team create/destroy cycles.
  • tl_nccl_coll.c: Lazy registration triggered at init_task time; correctly handles V-type (counts/displacements) vs flat buffer collectives. Prior realloc-atomicity bug addressed.
  • ucc_context.h/ucc_context.c: Adds local_address/local_len to ucc_mem_map_memh_t and populates them during ucc_mem_map_import.
  • test_mem_map.cc: Three new tests covering reverse unmap order, remote-address preservation on import, and stress cycling.

Confidence Score: 2/5

  • Not safe to merge — several correctness issues remain including unbounded array growth under repeated team create/destroy, a silent pack-buffer parse mismatch, and an incorrect zero-length guard in IMPORT mode.
  • The PR addresses a number of issues raised in previous review rounds (spinlock-during-NCCL-call, IMPORT address handling, realloc atomicity) but three new issues are confirmed: (1) NULL slots accumulate in registered_comms/nccl_handles without decrementing num_comms, causing unbounded array growth for long-lived memhs with many team create/destroy cycles; (2) the IMPORT pack-buffer parse silently skips the NCCL entry on a size mismatch making diagnosis extremely difficult; (3) the zero-length guard in IMPORT mode checks memh->len (the remote buffer length) rather than memh->local_len (the local buffer length). These issues lower confidence below merge threshold.
  • src/components/tl/nccl/tl_nccl_team.c (null-slot accumulation in registered_comms) and src/components/tl/nccl/tl_nccl_context.c (silent pack-buffer skip, wrong zero-length field in IMPORT mode).

Important Files Changed

Filename Overview
src/components/tl/nccl/tl_nccl_coll.c Adds ucc_tl_nccl_lazy_register_memh and UBR-aware init_task logic. Correctly addresses prior TOCTOU, realloc-failure atomicity, and v-type buffer size bugs. The lock-release/reacquire pattern around the slow NCCL call is sound; the double-check re-scan is correct. No new critical issues found in this iteration.
src/components/tl/nccl/tl_nccl_context.c Implements mem_map, mem_unmap, and memh_pack for NCCL UBR. The IMPORT path now reads from pack_buffer instead of treating memh->address/len as local addresses, and correctly uses memh->local_address/local_len. However, a silent skip when the packed NCCL entry size mismatches is error-prone, and the zero-length guard in IMPORT mode uses the wrong field (memh->len instead of memh->local_len).
src/components/tl/nccl/tl_nccl_team.c Adds ucc_tl_nccl_team_deregister_memhs that correctly avoids holding spinlocks during slow NCCL calls by collecting pairs first. Critical issue: NULL slots are written but num_comms is never decremented, causing unbounded array growth with each team create/destroy cycle against a long-lived memh.
src/components/tl/nccl/tl_nccl.h Adds ucc_tl_nccl_memh_data_t struct with spinlock, list link, and comm/handle bookkeeping arrays; adds ubr_available flag and memh_list/memh_lock to the context. Clean, well-structured additions.
src/components/tl/nccl/tl_nccl.c Adds ENABLE_UBR ternary config option. Straightforward config table addition with correct UCC_CONFIG_TYPE_TERNARY type.
src/core/ucc_context.h Adds local_address/local_len fields to ucc_mem_map_memh_t to carry the importer's local VA separately from the remote VA preserved in address/len. This is an internal struct change, so ABI compatibility is not a concern. The export path allocates with ucc_calloc, so the new fields are correctly zero-initialized on export.
src/core/ucc_context.c Sets local_memh->local_address and local_memh->local_len from import params before TL mem_map calls. Correctly limits to params->segments[0] matching the single-segment export convention. Straightforward and low-risk.
test/gtest/core/test_mem_map.cc Adds three new test cases: reverse unmap order, remote-address-preservation on import, and repeated export/import cycle stress. Tests are well-structured with appropriate GTEST_SKIP for unsupported configurations.

Comments Outside Diff (2)

  1. src/components/tl/nccl/tl_nccl_context.c, line 507-546 (link)

    IMPORT pack-buffer parsing silently skips the NCCL entry when the packed size mismatches

    When the strcmp(name, "nccl") == 0 branch is entered but *psz != sizeof(void *) + sizeof(size_t), the if condition is false and the loop falls through to the offset += ... update, silently skipping the NCCL entry. After the loop, m_data->address is still NULL, causing the function to return UCC_ERR_INVALID_PARAM with no indication of why.

    This can happen if the exporter and importer were built against different NCCL TL pack formats. The parse should treat a size mismatch on the NCCL entry as a distinct failure case with a clear error message:

    if (strcmp(name, "nccl") == 0) {
        if (*psz != sizeof(void *) + sizeof(size_t)) {
            tl_error(ctx->super.super.lib,
                     "NCCL UBR: unexpected packed size %zu (expected %zu)",
                     *psz, sizeof(void *) + sizeof(size_t));
            ucc_spinlock_destroy(&m_data->lock);
            ucc_free(m_data);
            return UCC_ERR_INVALID_PARAM;
        }
        /* unpack address and length */
        ...
        break;
    }
  2. src/components/tl/nccl/tl_nccl_context.c, line 479-483 (link)

    IMPORT mode zero-length check uses memh->len (remote VA length) instead of memh->local_len (local VA length)

    The early rejection at line 486–490 (if (memh->len == 0)) guards against zero-length exports, which is correct for EXPORT mode. For IMPORT mode however, the field that determines whether the local CUDA buffer is usable is memh->local_len (set by the core from params->segments[0].len just before calling into this function). A caller could legitimately import a non-zero-length remote buffer while providing a zero-length local segment (params->segments[0].len == 0), which would pass this check and only fail much later at the if (!m_data->address || m_data->length == 0) check — with a confusing UCC_ERR_INVALID_PARAM return and no error message.

    Consider also checking memh->local_len in IMPORT mode:

    if (memh->len == 0 ||
        (mode == UCC_MEM_MAP_MODE_IMPORT && memh->local_len == 0)) {
        tl_debug(ctx->super.super.lib,
                 "NCCL UBR: zero-length buffer, skipping mem_map");
        return UCC_ERR_NOT_SUPPORTED;
    }

Last reviewed commit: "TL/NCCL: add user bu..."

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +229 to +244
if (!new_comms) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for registered comms array");
/* Buffer is registered but we can't track it - this is a problem */
return UCC_ERR_NO_MEMORY;
}
m_data->registered_comms = new_comms;

new_handles = (void **)ucc_realloc(
m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
if (!new_handles) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource leak: if ucc_realloc for new_comms succeeds but the second ucc_realloc for new_handles fails (line 240-244), the newly allocated new_comms is assigned to m_data->registered_comms (line 236) but the function returns UCC_ERR_NO_MEMORY without deregistering the NCCL buffer. This leaks the NCCL registration created at line 208.

Suggested change
if (!new_comms) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for registered comms array");
/* Buffer is registered but we can't track it - this is a problem */
return UCC_ERR_NO_MEMORY;
}
m_data->registered_comms = new_comms;
new_handles = (void **)ucc_realloc(
m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
if (!new_handles) {
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
if (!new_handles) {
/* Failed to grow handles array - must deregister to avoid leak */
ncclCommDeregister(team->nccl_comm, nccl_handle);
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
return UCC_ERR_NO_MEMORY;
}

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +344 to +347
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer size calculation is incorrect for variable-size collectives (ALLTOALLV, ALLGATHERV, GATHERV, SCATTERV). For these collectives, coll_args->args.src uses the info_v union member with counts* array, not info.count scalar. Accessing .info.count reads the counts pointer as an integer, resulting in incorrect buffer size. Need to check coll_args->args.coll_type and use appropriate buffer size calculation for each collective type.

Suggested change
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),
// TODO: Need to calculate correct buffer size based on coll_type
// For variable-size collectives (ALLTOALLV, ALLGATHERV, etc),
// must sum counts array instead of using single count value
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.src.info.buffer,
coll_args->args.src.info.count *
ucc_dt_size(coll_args->args.src.info.datatype),

Comment on lines +368 to +371
status = ucc_tl_nccl_lazy_register_memh(
coll_args->args.dst.info.buffer,
coll_args->args.dst.info.count *
ucc_dt_size(coll_args->args.dst.info.datatype),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same buffer size calculation bug for dst buffer - incorrect for variable-size collectives that use info_v.counts* array instead of info.count scalar

Comment on lines +167 to +168
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for potential integer overflow when buf_start + length exceeds UINTPTR_MAX. Should validate that length <= UINTPTR_MAX - buf_start before addition to prevent wraparound

Suggested change
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
buf_start = (uintptr_t)buffer;
if (length > UINTPTR_MAX - buf_start) {
tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow");
return UCC_ERR_INVALID_PARAM;
}
buf_end = buf_start + length;

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +433 to +438

/* If UBR is not available/disabled or no TL data, return empty pack */
if (!ctx->ubr_available || !tl_h || !tl_h->tl_data) {
tl_h->packed_size = 0;
*pack_buffer = NULL;
return UCC_OK;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null deref on pack

In ucc_tl_nccl_memh_pack, the early-return condition includes !tl_h, but the block still does tl_h->packed_size = 0. If tl_h == NULL this will crash. This can trigger if the iface is ever invoked with a NULL tl_h (the function explicitly allows it in the condition), so the guard should avoid dereferencing tl_h in that path.

Comment on lines +290 to +339
ucc_status_t ucc_tl_nccl_mem_map(
const ucc_base_context_t *context, ucc_mem_map_mode_t mode,
ucc_mem_map_memh_t *memh, ucc_mem_map_tl_t *tl_h)
{
return UCC_ERR_NOT_SUPPORTED;
ucc_tl_nccl_context_t *ctx = ucc_derived_of(context, ucc_tl_nccl_context_t);
ucc_tl_nccl_memh_data_t *m_data;

/* Check if UBR is available and enabled */
if (!ctx->ubr_available) {
tl_debug(
ctx->super.super.lib, "NCCL UBR not available, skipping mem_map");
return UCC_ERR_NOT_SUPPORTED;
}

/* Support both EXPORT and IMPORT modes for global memh */
if (mode != UCC_MEM_MAP_MODE_EXPORT && mode != UCC_MEM_MAP_MODE_IMPORT) {
tl_debug(ctx->super.super.lib,
"NCCL UBR: unsupported mode %d", mode);
return UCC_ERR_NOT_SUPPORTED;
}

/* Reject zero-length buffers */
if (memh->len == 0) {
tl_debug(ctx->super.super.lib,
"NCCL UBR: zero-length buffer, skipping mem_map");
return UCC_ERR_NOT_SUPPORTED;
}

/* Allocate TL-specific memory handle data */
m_data = (ucc_tl_nccl_memh_data_t *)ucc_calloc(
1, sizeof(ucc_tl_nccl_memh_data_t), "tl_nccl_memh_data");
if (!m_data) {
tl_error(
ctx->super.super.lib, "failed to allocate TL memory handle data");
return UCC_ERR_NO_MEMORY;
}

/* Store buffer information - registration will happen lazily on first use */
m_data->address = memh->address;
m_data->length = memh->len;
m_data->registered_comms = NULL;
m_data->nccl_handles = NULL;
m_data->num_comms = 0;
m_data->max_comms = 0;

/* Set TL handle data */
tl_h->tl_data = m_data;
strncpy(tl_h->tl_name, "nccl", UCC_MEM_MAP_TL_NAME_LEN - 1);
tl_h->tl_name[UCC_MEM_MAP_TL_NAME_LEN - 1] = '\0';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMPORT memh uses garbage

ucc_tl_nccl_mem_map claims to support UCC_MEM_MAP_MODE_IMPORT, but it ignores the packed payload in memh->pack_buffer and instead stores memh->address/memh->len into m_data. In the core import path, those fields are explicitly treated as “likely garbage” for imported handles (src/core/ucc_context.c:1212-1214), so the imported NCCL TL handle will contain invalid address/length and later lazy registration may attempt ncclCommRegister on bogus pointers.

Comment on lines +166 to +175
/* Verify that the entire buffer is within the registered memory region */
buf_start = (uintptr_t)buffer;
buf_end = buf_start + length;
region_start = (uintptr_t)m_data->address;
region_end = region_start + m_data->length;

if (length > (UINTPTR_MAX - buf_start)) {
tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow");
return UCC_ERR_INVALID_PARAM;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overflow check too late

In ucc_tl_nccl_lazy_register_memh, buf_end is computed as buf_start + length before checking length <= UINTPTR_MAX - buf_start. If the addition overflows, buf_end is already wrapped/UB, and the subsequent bounds check can misbehave. The overflow guard needs to run before computing buf_end.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +327 to +334
/* For NCCL UBR, we only store metadata (address/length) for lazy registration.
* When ncclCommRegister is called later, it stores this metadata locally.
* The NCCL communicator handles IPC handle exchange internally during collective
* operations (via point-to-point proxy calls), so we don't need special IMPORT
* handling. We can use memh->address/memh->len directly in both EXPORT and IMPORT
* modes - the address should be valid in the current process context. */
m_data->address = memh->address;
m_data->length = memh->len;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IMPORT mode, verify whether using memh->address/memh->len directly is correct. The core code treats these as "likely garbage" for imported handles (src/core/ucc_context.c:1212-1214), suggesting TLs should extract data from pack_buffer. If NCCL's IPC model truly makes local addresses valid for IMPORT mode (different from typical TL behavior), add documentation explaining this design choice. Otherwise, implement extraction from pack_buffer similar to the core's approach in ucc_context.c:1224-1238.

Comment on lines +357 to +364
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
ucc_mpool_put(task);
return status;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA event leak on UBR registration failure

Both UBR error paths (src at line 357 and dst at line 395) call ucc_mpool_put(task) directly after a failed ucc_tl_nccl_lazy_register_memh. By this point in the function, task->completed may already hold a valid CUDA event (created at line 309 when sync_type == UCC_TL_NCCL_COMPLETION_SYNC_TYPE_EVENT). Putting the task back to the pool without destroying the event leaks it.

The fix is to destroy the event before returning, mirroring the cleanup in ucc_tl_nccl_free_task:

Suggested change
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
ucc_mpool_put(task);
return status;
}
if (ucc_unlikely(status != UCC_OK)) {
tl_error(
team->context->lib,
"NCCL UBR: lazy_register failed with status %d",
status);
if (task->completed) {
ucc_ec_destroy_event(task->completed, UCC_EE_CUDA_STREAM);
}
ucc_mpool_put(task);
return status;
}

The same fix is required for the dst-buffer error path at line 394–397.

Comment on lines +174 to +175
region_start = (uintptr_t)m_data->address;
region_end = region_start + m_data->length;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing overflow check for region_end

An overflow guard was added for buf_start + length (line 166), but the symmetric computation region_start + m_data->length has no such check. If m_data->address + m_data->length wraps around UINTPTR_MAX, region_end becomes a small value. The subsequent bounds test buf_end > region_end would then be spuriously false for any legitimately-contained buffer, meaning a future out-of-bounds access would be missed. The same pattern used for the buffer should be applied here:

    if (m_data->length > (UINTPTR_MAX - (uintptr_t)m_data->address)) {
        tl_error(UCC_TL_TEAM_LIB(team),
                 "NCCL UBR: registered region size causes overflow");
        return UCC_ERR_INVALID_PARAM;
    }
    region_start = (uintptr_t)m_data->address;
    region_end   = region_start + m_data->length;

Comment on lines +81 to +95
ucc_spin_lock(&ctx->memh_lock);
ucc_list_for_each(m_data, &ctx->memh_list, list_elem) {
ucc_spin_lock(&m_data->lock);
for (i = 0; i < m_data->num_comms; i++) {
if (m_data->registered_comms[i] == team->nccl_comm) {
ncclCommDeregister(team->nccl_comm, m_data->nccl_handles[i]);
/* Null the slot so mem_unmap skips it */
m_data->registered_comms[i] = NULL;
m_data->nccl_handles[i] = NULL;
break;
}
}
ucc_spin_unlock(&m_data->lock);
}
ucc_spin_unlock(&ctx->memh_lock);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Spinlock held during blocking NCCL deregistration calls

ctx->memh_lock is a spinlock, yet ncclCommDeregister (a potentially slow synchronous NCCL/CUDA call) is invoked on line 86 while both ctx->memh_lock and m_data->lock are still held. Any concurrent call to mem_map, mem_unmap, or lazy_register_memh that needs either of these locks will busy-spin for the entire duration of all deregistrations. In a multi-GPU setup with many registered buffers this can waste significant CPU cycles and may even deadlock CUDA if the same CPU thread is required for NCCL progress.

The fix is to collect the (comm, handle) pairs while holding the locks, then release both locks before calling ncclCommDeregister:

/* collect pairs under lock */
ucc_spin_lock(&ctx->memh_lock);
ucc_list_for_each(m_data, &ctx->memh_list, list_elem) {
    ucc_spin_lock(&m_data->lock);
    for (i = 0; i < m_data->num_comms; i++) {
        if (m_data->registered_comms[i] == team->nccl_comm) {
            /* save for deregistration outside locks */
            pending[npending].comm   = m_data->registered_comms[i];
            pending[npending].handle = m_data->nccl_handles[i];
            m_data->registered_comms[i] = NULL;
            m_data->nccl_handles[i]     = NULL;
            npending++;
            break;
        }
    }
    ucc_spin_unlock(&m_data->lock);
}
ucc_spin_unlock(&ctx->memh_lock);

/* deregister outside of any spinlock */
for (i = 0; i < npending; i++) {
    ncclCommDeregister(pending[i].comm, pending[i].handle);
}

Comment on lines +253 to +266
m_data->registered_comms = new_comms;

new_handles = (void **)ucc_realloc(
m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles");
if (!new_handles) {
ucc_spin_unlock(&m_data->lock);
tl_error(
UCC_TL_TEAM_LIB(team),
"failed to allocate memory for NCCL handles array");
ncclCommDeregister(team->nccl_comm, nccl_handle);
return UCC_ERR_NO_MEMORY;
}
m_data->nccl_handles = new_handles;
m_data->max_comms = new_max;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 m_data->max_comms not updated when nccl_handles realloc fails, leaving arrays inconsistent

When new_comms realloc succeeds (line 241-252), m_data->registered_comms is immediately updated to the larger buffer (line 253). If the subsequent new_handles realloc fails (line 255-264), the function returns UCC_ERR_NO_MEMORY without updating m_data->max_comms to new_max.

This leaves the struct in an inconsistent state:

  • m_data->registered_comms has capacity new_max
  • m_data->nccl_handles still has capacity old_max
  • m_data->max_comms still reports old_max

On the next successful registration, both arrays are grown from old_max to 2 * old_max again (which equals new_max again), so correctness is self-recovering. However, the window of inconsistency – between the failed call and the next successful call – means any code that assumes registered_comms and nccl_handles have the same allocated capacity as max_comms will be wrong, making the logic harder to reason about.

The safest fix is to update m_data->max_comms = new_max immediately after the successful new_comms realloc, so the state always reflects the actual registered_comms capacity, even if nccl_handles is still at the old size.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant