TL/NCCL: add user buffer registration via memmap#1260
TL/NCCL: add user buffer registration via memmap#1260wfaderhold21 wants to merge 2 commits intoopenucx:masterfrom
Conversation
|
| Filename | Overview |
|---|---|
| src/components/tl/nccl/tl_nccl_coll.c | Adds ucc_tl_nccl_lazy_register_memh and UBR-aware init_task logic. Correctly addresses prior TOCTOU, realloc-failure atomicity, and v-type buffer size bugs. The lock-release/reacquire pattern around the slow NCCL call is sound; the double-check re-scan is correct. No new critical issues found in this iteration. |
| src/components/tl/nccl/tl_nccl_context.c | Implements mem_map, mem_unmap, and memh_pack for NCCL UBR. The IMPORT path now reads from pack_buffer instead of treating memh->address/len as local addresses, and correctly uses memh->local_address/local_len. However, a silent skip when the packed NCCL entry size mismatches is error-prone, and the zero-length guard in IMPORT mode uses the wrong field (memh->len instead of memh->local_len). |
| src/components/tl/nccl/tl_nccl_team.c | Adds ucc_tl_nccl_team_deregister_memhs that correctly avoids holding spinlocks during slow NCCL calls by collecting pairs first. Critical issue: NULL slots are written but num_comms is never decremented, causing unbounded array growth with each team create/destroy cycle against a long-lived memh. |
| src/components/tl/nccl/tl_nccl.h | Adds ucc_tl_nccl_memh_data_t struct with spinlock, list link, and comm/handle bookkeeping arrays; adds ubr_available flag and memh_list/memh_lock to the context. Clean, well-structured additions. |
| src/components/tl/nccl/tl_nccl.c | Adds ENABLE_UBR ternary config option. Straightforward config table addition with correct UCC_CONFIG_TYPE_TERNARY type. |
| src/core/ucc_context.h | Adds local_address/local_len fields to ucc_mem_map_memh_t to carry the importer's local VA separately from the remote VA preserved in address/len. This is an internal struct change, so ABI compatibility is not a concern. The export path allocates with ucc_calloc, so the new fields are correctly zero-initialized on export. |
| src/core/ucc_context.c | Sets local_memh->local_address and local_memh->local_len from import params before TL mem_map calls. Correctly limits to params->segments[0] matching the single-segment export convention. Straightforward and low-risk. |
| test/gtest/core/test_mem_map.cc | Adds three new test cases: reverse unmap order, remote-address-preservation on import, and repeated export/import cycle stress. Tests are well-structured with appropriate GTEST_SKIP for unsupported configurations. |
Comments Outside Diff (2)
-
src/components/tl/nccl/tl_nccl_context.c, line 507-546 (link)IMPORT pack-buffer parsing silently skips the NCCL entry when the packed size mismatches
When the
strcmp(name, "nccl") == 0branch is entered but*psz != sizeof(void *) + sizeof(size_t), theifcondition is false and the loop falls through to theoffset += ...update, silently skipping the NCCL entry. After the loop,m_data->addressis stillNULL, causing the function to returnUCC_ERR_INVALID_PARAMwith no indication of why.This can happen if the exporter and importer were built against different NCCL TL pack formats. The parse should treat a size mismatch on the NCCL entry as a distinct failure case with a clear error message:
if (strcmp(name, "nccl") == 0) { if (*psz != sizeof(void *) + sizeof(size_t)) { tl_error(ctx->super.super.lib, "NCCL UBR: unexpected packed size %zu (expected %zu)", *psz, sizeof(void *) + sizeof(size_t)); ucc_spinlock_destroy(&m_data->lock); ucc_free(m_data); return UCC_ERR_INVALID_PARAM; } /* unpack address and length */ ... break; }
-
src/components/tl/nccl/tl_nccl_context.c, line 479-483 (link)IMPORT mode zero-length check uses
memh->len(remote VA length) instead ofmemh->local_len(local VA length)The early rejection at line 486–490 (
if (memh->len == 0)) guards against zero-length exports, which is correct for EXPORT mode. For IMPORT mode however, the field that determines whether the local CUDA buffer is usable ismemh->local_len(set by the core fromparams->segments[0].lenjust before calling into this function). A caller could legitimately import a non-zero-length remote buffer while providing a zero-length local segment (params->segments[0].len == 0), which would pass this check and only fail much later at theif (!m_data->address || m_data->length == 0)check — with a confusingUCC_ERR_INVALID_PARAMreturn and no error message.Consider also checking
memh->local_lenin IMPORT mode:if (memh->len == 0 || (mode == UCC_MEM_MAP_MODE_IMPORT && memh->local_len == 0)) { tl_debug(ctx->super.super.lib, "NCCL UBR: zero-length buffer, skipping mem_map"); return UCC_ERR_NOT_SUPPORTED; }
Last reviewed commit: "TL/NCCL: add user bu..."
| if (!new_comms) { | ||
| tl_error( | ||
| UCC_TL_TEAM_LIB(team), | ||
| "failed to allocate memory for registered comms array"); | ||
| /* Buffer is registered but we can't track it - this is a problem */ | ||
| return UCC_ERR_NO_MEMORY; | ||
| } | ||
| m_data->registered_comms = new_comms; | ||
|
|
||
| new_handles = (void **)ucc_realloc( | ||
| m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles"); | ||
| if (!new_handles) { | ||
| tl_error( | ||
| UCC_TL_TEAM_LIB(team), | ||
| "failed to allocate memory for NCCL handles array"); | ||
| return UCC_ERR_NO_MEMORY; |
There was a problem hiding this comment.
resource leak: if ucc_realloc for new_comms succeeds but the second ucc_realloc for new_handles fails (line 240-244), the newly allocated new_comms is assigned to m_data->registered_comms (line 236) but the function returns UCC_ERR_NO_MEMORY without deregistering the NCCL buffer. This leaks the NCCL registration created at line 208.
| if (!new_comms) { | |
| tl_error( | |
| UCC_TL_TEAM_LIB(team), | |
| "failed to allocate memory for registered comms array"); | |
| /* Buffer is registered but we can't track it - this is a problem */ | |
| return UCC_ERR_NO_MEMORY; | |
| } | |
| m_data->registered_comms = new_comms; | |
| new_handles = (void **)ucc_realloc( | |
| m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles"); | |
| if (!new_handles) { | |
| tl_error( | |
| UCC_TL_TEAM_LIB(team), | |
| "failed to allocate memory for NCCL handles array"); | |
| return UCC_ERR_NO_MEMORY; | |
| if (!new_handles) { | |
| /* Failed to grow handles array - must deregister to avoid leak */ | |
| ncclCommDeregister(team->nccl_comm, nccl_handle); | |
| tl_error( | |
| UCC_TL_TEAM_LIB(team), | |
| "failed to allocate memory for NCCL handles array"); | |
| return UCC_ERR_NO_MEMORY; | |
| } |
| status = ucc_tl_nccl_lazy_register_memh( | ||
| coll_args->args.src.info.buffer, | ||
| coll_args->args.src.info.count * | ||
| ucc_dt_size(coll_args->args.src.info.datatype), |
There was a problem hiding this comment.
buffer size calculation is incorrect for variable-size collectives (ALLTOALLV, ALLGATHERV, GATHERV, SCATTERV). For these collectives, coll_args->args.src uses the info_v union member with counts* array, not info.count scalar. Accessing .info.count reads the counts pointer as an integer, resulting in incorrect buffer size. Need to check coll_args->args.coll_type and use appropriate buffer size calculation for each collective type.
| status = ucc_tl_nccl_lazy_register_memh( | |
| coll_args->args.src.info.buffer, | |
| coll_args->args.src.info.count * | |
| ucc_dt_size(coll_args->args.src.info.datatype), | |
| // TODO: Need to calculate correct buffer size based on coll_type | |
| // For variable-size collectives (ALLTOALLV, ALLGATHERV, etc), | |
| // must sum counts array instead of using single count value | |
| status = ucc_tl_nccl_lazy_register_memh( | |
| coll_args->args.src.info.buffer, | |
| coll_args->args.src.info.count * | |
| ucc_dt_size(coll_args->args.src.info.datatype), |
| status = ucc_tl_nccl_lazy_register_memh( | ||
| coll_args->args.dst.info.buffer, | ||
| coll_args->args.dst.info.count * | ||
| ucc_dt_size(coll_args->args.dst.info.datatype), |
There was a problem hiding this comment.
same buffer size calculation bug for dst buffer - incorrect for variable-size collectives that use info_v.counts* array instead of info.count scalar
| buf_start = (uintptr_t)buffer; | ||
| buf_end = buf_start + length; |
There was a problem hiding this comment.
check for potential integer overflow when buf_start + length exceeds UINTPTR_MAX. Should validate that length <= UINTPTR_MAX - buf_start before addition to prevent wraparound
| buf_start = (uintptr_t)buffer; | |
| buf_end = buf_start + length; | |
| buf_start = (uintptr_t)buffer; | |
| if (length > UINTPTR_MAX - buf_start) { | |
| tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow"); | |
| return UCC_ERR_INVALID_PARAM; | |
| } | |
| buf_end = buf_start + length; |
|
|
||
| /* If UBR is not available/disabled or no TL data, return empty pack */ | ||
| if (!ctx->ubr_available || !tl_h || !tl_h->tl_data) { | ||
| tl_h->packed_size = 0; | ||
| *pack_buffer = NULL; | ||
| return UCC_OK; |
There was a problem hiding this comment.
Null deref on pack
In ucc_tl_nccl_memh_pack, the early-return condition includes !tl_h, but the block still does tl_h->packed_size = 0. If tl_h == NULL this will crash. This can trigger if the iface is ever invoked with a NULL tl_h (the function explicitly allows it in the condition), so the guard should avoid dereferencing tl_h in that path.
| ucc_status_t ucc_tl_nccl_mem_map( | ||
| const ucc_base_context_t *context, ucc_mem_map_mode_t mode, | ||
| ucc_mem_map_memh_t *memh, ucc_mem_map_tl_t *tl_h) | ||
| { | ||
| return UCC_ERR_NOT_SUPPORTED; | ||
| ucc_tl_nccl_context_t *ctx = ucc_derived_of(context, ucc_tl_nccl_context_t); | ||
| ucc_tl_nccl_memh_data_t *m_data; | ||
|
|
||
| /* Check if UBR is available and enabled */ | ||
| if (!ctx->ubr_available) { | ||
| tl_debug( | ||
| ctx->super.super.lib, "NCCL UBR not available, skipping mem_map"); | ||
| return UCC_ERR_NOT_SUPPORTED; | ||
| } | ||
|
|
||
| /* Support both EXPORT and IMPORT modes for global memh */ | ||
| if (mode != UCC_MEM_MAP_MODE_EXPORT && mode != UCC_MEM_MAP_MODE_IMPORT) { | ||
| tl_debug(ctx->super.super.lib, | ||
| "NCCL UBR: unsupported mode %d", mode); | ||
| return UCC_ERR_NOT_SUPPORTED; | ||
| } | ||
|
|
||
| /* Reject zero-length buffers */ | ||
| if (memh->len == 0) { | ||
| tl_debug(ctx->super.super.lib, | ||
| "NCCL UBR: zero-length buffer, skipping mem_map"); | ||
| return UCC_ERR_NOT_SUPPORTED; | ||
| } | ||
|
|
||
| /* Allocate TL-specific memory handle data */ | ||
| m_data = (ucc_tl_nccl_memh_data_t *)ucc_calloc( | ||
| 1, sizeof(ucc_tl_nccl_memh_data_t), "tl_nccl_memh_data"); | ||
| if (!m_data) { | ||
| tl_error( | ||
| ctx->super.super.lib, "failed to allocate TL memory handle data"); | ||
| return UCC_ERR_NO_MEMORY; | ||
| } | ||
|
|
||
| /* Store buffer information - registration will happen lazily on first use */ | ||
| m_data->address = memh->address; | ||
| m_data->length = memh->len; | ||
| m_data->registered_comms = NULL; | ||
| m_data->nccl_handles = NULL; | ||
| m_data->num_comms = 0; | ||
| m_data->max_comms = 0; | ||
|
|
||
| /* Set TL handle data */ | ||
| tl_h->tl_data = m_data; | ||
| strncpy(tl_h->tl_name, "nccl", UCC_MEM_MAP_TL_NAME_LEN - 1); | ||
| tl_h->tl_name[UCC_MEM_MAP_TL_NAME_LEN - 1] = '\0'; | ||
|
|
There was a problem hiding this comment.
IMPORT memh uses garbage
ucc_tl_nccl_mem_map claims to support UCC_MEM_MAP_MODE_IMPORT, but it ignores the packed payload in memh->pack_buffer and instead stores memh->address/memh->len into m_data. In the core import path, those fields are explicitly treated as “likely garbage” for imported handles (src/core/ucc_context.c:1212-1214), so the imported NCCL TL handle will contain invalid address/length and later lazy registration may attempt ncclCommRegister on bogus pointers.
| /* Verify that the entire buffer is within the registered memory region */ | ||
| buf_start = (uintptr_t)buffer; | ||
| buf_end = buf_start + length; | ||
| region_start = (uintptr_t)m_data->address; | ||
| region_end = region_start + m_data->length; | ||
|
|
||
| if (length > (UINTPTR_MAX - buf_start)) { | ||
| tl_error(UCC_TL_TEAM_LIB(team), "NCCL UBR: buffer size causes overflow"); | ||
| return UCC_ERR_INVALID_PARAM; | ||
| } |
There was a problem hiding this comment.
Overflow check too late
In ucc_tl_nccl_lazy_register_memh, buf_end is computed as buf_start + length before checking length <= UINTPTR_MAX - buf_start. If the addition overflows, buf_end is already wrapped/UB, and the subsequent bounds check can misbehave. The overflow guard needs to run before computing buf_end.
| /* For NCCL UBR, we only store metadata (address/length) for lazy registration. | ||
| * When ncclCommRegister is called later, it stores this metadata locally. | ||
| * The NCCL communicator handles IPC handle exchange internally during collective | ||
| * operations (via point-to-point proxy calls), so we don't need special IMPORT | ||
| * handling. We can use memh->address/memh->len directly in both EXPORT and IMPORT | ||
| * modes - the address should be valid in the current process context. */ | ||
| m_data->address = memh->address; | ||
| m_data->length = memh->len; |
There was a problem hiding this comment.
For IMPORT mode, verify whether using memh->address/memh->len directly is correct. The core code treats these as "likely garbage" for imported handles (src/core/ucc_context.c:1212-1214), suggesting TLs should extract data from pack_buffer. If NCCL's IPC model truly makes local addresses valid for IMPORT mode (different from typical TL behavior), add documentation explaining this design choice. Otherwise, implement extraction from pack_buffer similar to the core's approach in ucc_context.c:1224-1238.
| if (ucc_unlikely(status != UCC_OK)) { | ||
| tl_error( | ||
| team->context->lib, | ||
| "NCCL UBR: lazy_register failed with status %d", | ||
| status); | ||
| ucc_mpool_put(task); | ||
| return status; | ||
| } |
There was a problem hiding this comment.
CUDA event leak on UBR registration failure
Both UBR error paths (src at line 357 and dst at line 395) call ucc_mpool_put(task) directly after a failed ucc_tl_nccl_lazy_register_memh. By this point in the function, task->completed may already hold a valid CUDA event (created at line 309 when sync_type == UCC_TL_NCCL_COMPLETION_SYNC_TYPE_EVENT). Putting the task back to the pool without destroying the event leaks it.
The fix is to destroy the event before returning, mirroring the cleanup in ucc_tl_nccl_free_task:
| if (ucc_unlikely(status != UCC_OK)) { | |
| tl_error( | |
| team->context->lib, | |
| "NCCL UBR: lazy_register failed with status %d", | |
| status); | |
| ucc_mpool_put(task); | |
| return status; | |
| } | |
| if (ucc_unlikely(status != UCC_OK)) { | |
| tl_error( | |
| team->context->lib, | |
| "NCCL UBR: lazy_register failed with status %d", | |
| status); | |
| if (task->completed) { | |
| ucc_ec_destroy_event(task->completed, UCC_EE_CUDA_STREAM); | |
| } | |
| ucc_mpool_put(task); | |
| return status; | |
| } |
The same fix is required for the dst-buffer error path at line 394–397.
| region_start = (uintptr_t)m_data->address; | ||
| region_end = region_start + m_data->length; |
There was a problem hiding this comment.
Missing overflow check for region_end
An overflow guard was added for buf_start + length (line 166), but the symmetric computation region_start + m_data->length has no such check. If m_data->address + m_data->length wraps around UINTPTR_MAX, region_end becomes a small value. The subsequent bounds test buf_end > region_end would then be spuriously false for any legitimately-contained buffer, meaning a future out-of-bounds access would be missed. The same pattern used for the buffer should be applied here:
if (m_data->length > (UINTPTR_MAX - (uintptr_t)m_data->address)) {
tl_error(UCC_TL_TEAM_LIB(team),
"NCCL UBR: registered region size causes overflow");
return UCC_ERR_INVALID_PARAM;
}
region_start = (uintptr_t)m_data->address;
region_end = region_start + m_data->length;| ucc_spin_lock(&ctx->memh_lock); | ||
| ucc_list_for_each(m_data, &ctx->memh_list, list_elem) { | ||
| ucc_spin_lock(&m_data->lock); | ||
| for (i = 0; i < m_data->num_comms; i++) { | ||
| if (m_data->registered_comms[i] == team->nccl_comm) { | ||
| ncclCommDeregister(team->nccl_comm, m_data->nccl_handles[i]); | ||
| /* Null the slot so mem_unmap skips it */ | ||
| m_data->registered_comms[i] = NULL; | ||
| m_data->nccl_handles[i] = NULL; | ||
| break; | ||
| } | ||
| } | ||
| ucc_spin_unlock(&m_data->lock); | ||
| } | ||
| ucc_spin_unlock(&ctx->memh_lock); |
There was a problem hiding this comment.
Spinlock held during blocking NCCL deregistration calls
ctx->memh_lock is a spinlock, yet ncclCommDeregister (a potentially slow synchronous NCCL/CUDA call) is invoked on line 86 while both ctx->memh_lock and m_data->lock are still held. Any concurrent call to mem_map, mem_unmap, or lazy_register_memh that needs either of these locks will busy-spin for the entire duration of all deregistrations. In a multi-GPU setup with many registered buffers this can waste significant CPU cycles and may even deadlock CUDA if the same CPU thread is required for NCCL progress.
The fix is to collect the (comm, handle) pairs while holding the locks, then release both locks before calling ncclCommDeregister:
/* collect pairs under lock */
ucc_spin_lock(&ctx->memh_lock);
ucc_list_for_each(m_data, &ctx->memh_list, list_elem) {
ucc_spin_lock(&m_data->lock);
for (i = 0; i < m_data->num_comms; i++) {
if (m_data->registered_comms[i] == team->nccl_comm) {
/* save for deregistration outside locks */
pending[npending].comm = m_data->registered_comms[i];
pending[npending].handle = m_data->nccl_handles[i];
m_data->registered_comms[i] = NULL;
m_data->nccl_handles[i] = NULL;
npending++;
break;
}
}
ucc_spin_unlock(&m_data->lock);
}
ucc_spin_unlock(&ctx->memh_lock);
/* deregister outside of any spinlock */
for (i = 0; i < npending; i++) {
ncclCommDeregister(pending[i].comm, pending[i].handle);
}| m_data->registered_comms = new_comms; | ||
|
|
||
| new_handles = (void **)ucc_realloc( | ||
| m_data->nccl_handles, new_max * sizeof(void *), "nccl_handles"); | ||
| if (!new_handles) { | ||
| ucc_spin_unlock(&m_data->lock); | ||
| tl_error( | ||
| UCC_TL_TEAM_LIB(team), | ||
| "failed to allocate memory for NCCL handles array"); | ||
| ncclCommDeregister(team->nccl_comm, nccl_handle); | ||
| return UCC_ERR_NO_MEMORY; | ||
| } | ||
| m_data->nccl_handles = new_handles; | ||
| m_data->max_comms = new_max; |
There was a problem hiding this comment.
m_data->max_comms not updated when nccl_handles realloc fails, leaving arrays inconsistent
When new_comms realloc succeeds (line 241-252), m_data->registered_comms is immediately updated to the larger buffer (line 253). If the subsequent new_handles realloc fails (line 255-264), the function returns UCC_ERR_NO_MEMORY without updating m_data->max_comms to new_max.
This leaves the struct in an inconsistent state:
m_data->registered_commshas capacitynew_maxm_data->nccl_handlesstill has capacityold_maxm_data->max_commsstill reportsold_max
On the next successful registration, both arrays are grown from old_max to 2 * old_max again (which equals new_max again), so correctness is self-recovering. However, the window of inconsistency – between the failed call and the next successful call – means any code that assumes registered_comms and nccl_handles have the same allocated capacity as max_comms will be wrong, making the logic harder to reason about.
The safest fix is to update m_data->max_comms = new_max immediately after the successful new_comms realloc, so the state always reflects the actual registered_comms capacity, even if nccl_handles is still at the old size.
b17ed9b to
8efb81b
Compare
|
Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance. |

What
This PR implements NCCL User Buffer Registration (UBR) in TL/NCCL. It replaces the three stub functions (mem_map, mem_unmap, memh_pack) that previously returned UCC_ERR_NOT_SUPPORTED with an implementation that registers GPU memory buffers with NCCL communicators.
Why
NCCL UBR (available since NCCL 2.19.0) lets applications pre-register GPU buffers so NCCL can bypass its internal memory registration on each collective call. Without this, NCCL must re-pin and register the buffer every time it's used in a collective, which adds latency. This is particularly impactful for AI/ML workloads where the same buffers (e.g., model gradients) are used in AllReduce repeatedly across training steps. This PR wires UCC's existing mem_map API into NCCL's ncclCommRegister/ncclCommDeregister, enabling that optimization.
How
yet. The actual ncclCommRegister call is deferred to the first collective operation that uses the
buffer (ucc_tl_nccl_lazy_register_memh in tl_nccl_coll.c). This avoids registering with comms that
never use the buffer.
per NCCL communicator that has registered the buffer — growing as needed.
NCCL calls (ncclCommRegister/ncclCommDeregister) are made outside spinlocks to avoid holding them
during potentially slow CUDA operations.
dying comm before calling ncclCommDestroy, preventing use-after-free in a later mem_unmap. The reverse
order (mem_unmap before team_destroy) is also handled cleanly.
mem_map in import mode unpacks the local address from the blob.
version gating against NCCL 2.19.0 at compile time.