Skip to content

CORE: add ucc_modules config#1275

Open
Sergei-Lebedev wants to merge 1 commit intoopenucx:masterfrom
Sergei-Lebedev:topic/ucc_module_filter
Open

CORE: add ucc_modules config#1275
Sergei-Lebedev wants to merge 1 commit intoopenucx:masterfrom
Sergei-Lebedev:topic/ucc_module_filter

Conversation

@Sergei-Lebedev
Copy link
Copy Markdown
Contributor

What

Add configuration variable UCC_MODULES to disable/enable loading of specific components before they initialized.
Examples:
UCC_MODULES=^cuda disables all cuda components including mc_cuda, ec_cuda, tl_cuda, sysinfo_cuda
UCC_MODULES=^tl_cuda,ec_cuda disables tl_cuda and ec_cuda, but keeps mc_cuda and sysinfo_cuda

Copy link
Copy Markdown
Collaborator

@wfaderhold21 wfaderhold21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@janjust janjust force-pushed the topic/ucc_module_filter branch from fd52ba9 to 608ecba Compare March 20, 2026 04:20
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR adds a new UCC_MODULES configuration variable that allows users to enable or disable specific component modules (e.g., ^cuda to exclude all CUDA components) before they are loaded. The implementation adds a ucc_config_allow_list_t modules field to ucc_global_config_t, registers it via ucc_global_config_table, and gates ucc_component_load_one with a new ucc_component_is_enabled helper.

Key observations:

  • The negation form (^cuda, ^tl_cuda,ec_cuda) works correctly — only matched components are excluded, the rest load normally.
  • The ALLOW form (ucp,nccl) has a correctness problem: the filter is applied globally across all frameworks. CL components (basic, hier, doca_urom) and MC components do not match these names, so ucc_components_load returns UCC_ERR_NOT_FOUND for those required frameworks, causing UCC initialization to fail with a misleading "no CL/MC components found" error — even though those libraries are present on disk.
  • The static initializer {{NULL, 0}, 0} uses 0 for modules.mode, implicitly relying on UCC_CONFIG_ALLOW_LIST_ALLOW_ALL == 0.

Confidence Score: 3/5

Not safe to merge as-is — the documented ALLOW list example breaks UCC initialization in typical installations

A P1 logic bug exists: the ALLOW mode of UCC_MODULES causes UCC initialization to fail because required frameworks (CL, MC) have their components filtered out when only TL names are listed. The PR's own documented example (UCC_MODULES=ucp,nccl) would trigger this failure. This is a primary-path reliability issue that will affect users who follow the documentation.

src/utils/ucc_component.c — the cross-framework ALLOW list filtering logic needs to be revisited

Important Files Changed

Filename Overview
src/core/ucc_global_opts.c Adds MODULES config entry with help text and static initializer {{NULL, 0}, 0}; initializer relies on UCC_CONFIG_ALLOW_LIST_ALLOW_ALL == 0
src/core/ucc_global_opts.h Adds ucc_config_allow_list_t modules field to the global config struct; straightforward and correct
src/utils/ucc_component.c Introduces ucc_component_is_enabled and ucc_modules_list_search; ALLOW list mode causes fatal init failure for CL/MC frameworks when only TL names are specified

Reviews (3): Last reviewed commit: "CORE: add ucc_modules config" | Re-trigger Greptile

.log_print_enable = 0,
.log_level_trigger = UCC_LOG_LEVEL_FATAL};
.log_level_trigger = UCC_LOG_LEVEL_FATAL,
.modules = {{NULL, 0}, 0}};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Magic numeric literal for modules.mode

The static initializer uses 0 as the value for modules.mode, relying on the assumption that UCC_CONFIG_ALLOW_LIST_ALLOW_ALL == 0. This is an implicit contract with the UCX library's enum ordering. If UCX ever changes this enum (e.g., adds a value before ALLOW_ALL), or if ucc_component_is_enabled is ever called before ucc_config_read processes the default "all" value, all components would be silently disabled because the mode == UCC_CONFIG_ALLOW_LIST_ALLOW_ALL guard would not fire, the empty array would match nothing, and neither the ALLOW nor NEGATE branches would return 1.

Use the named constant to make the intent explicit and safe:

Suggested change
.modules = {{NULL, 0}, 0}};
.log_level_trigger = UCC_LOG_LEVEL_FATAL,
.modules = {{NULL, 0}, UCC_CONFIG_ALLOW_LIST_ALLOW_ALL}};

Comment on lines +40 to +48

for (i = 0; i < names->count; i++) {
if ((strcmp(names->names[i], qualified) == 0) ||
(strcmp(names->names[i], component_name) == 0)) {
return i;
}
}
return -1;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No diagnostic for unrecognised module names in the allow list

If a user makes a typo in UCC_MODULES (e.g., UCC_MODULES=^tl_cudaa), the entry silently fails to match any component — no warning is emitted and the filtering just has no effect. This can be confusing to debug.

Consider emitting a ucc_warn when a ALLOW or NEGATE list is active and none of the entries match any component that was actually found on disk. At minimum, a debug-level note when an entry in the list is never matched during the whole component-loading scan would help users detect typos in their configuration.

Comment on lines +94 to +100
if (!ucc_component_is_enabled(framework_name,
iface_struct + strlen(framework_pattern))) {
ucc_debug("component '%s_%s' is disabled by UCC_MODULES configuration",
framework_name, iface_struct + strlen(framework_pattern));
*c_iface = NULL;
return UCC_ERR_NO_MESSAGE;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Disabled component returns UCC_ERR_NO_MESSAGE, indistinguishable from load failure

Returning UCC_ERR_NO_MESSAGE for a component that is intentionally disabled by the user mixes two semantically different outcomes under the same error code. The same code is already returned when the .so basename doesn't contain the framework pattern (line 83), and when dlopen/dlsym fails (lines 107/128). In all three cases ucc_components_load simply continues.

This isn't a bug today, but it means callers can never distinguish "this component was deliberately filtered out by the user" from "this component failed to load." If in the future a caller wants to emit a summarized diagnostic (e.g., "3 components were disabled, 1 failed to load"), the distinction will be impossible to recover. Using a separate status (e.g., UCC_ERR_NOT_SUPPORTED, or even UCC_OK with *c_iface = NULL) for the intentional-disable case would preserve that information.

@janjust janjust force-pushed the topic/ucc_module_filter branch from 608ecba to 75e25c2 Compare March 31, 2026 13:44
@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Mar 31, 2026

/build

@janjust janjust force-pushed the topic/ucc_module_filter branch from 75e25c2 to e22007d Compare March 31, 2026 16:17
Comment on lines +94 to +99
if (!ucc_component_is_enabled(framework_name,
iface_struct + strlen(framework_pattern))) {
ucc_debug("component '%s_%s' is disabled by UCC_MODULES configuration",
framework_name, iface_struct + strlen(framework_pattern));
*c_iface = NULL;
return UCC_ERR_NO_MESSAGE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ALLOW list silently breaks required frameworks

When UCC_MODULES is set to an ALLOW list (no ^ prefix), the filter is applied per-component across all frameworks. The PR's own documented example UCC_MODULES=ucp,nccl will cause UCC initialization to fail because:

  • The cl framework has components basic, hier, doca_urom — none of which match ucp or nccl.
  • All CL components are filtered out → ucc_components_load("cl", ...) returns UCC_ERR_NOT_FOUND.
  • ucc_constructor.c line 138 treats that as fatal: "no CL components were found in the ucc modules dir".
  • The same problem applies to the mc framework, which is also treated as fatal (lines 159–163).

So the documented ucp,nccl example would crash initialization for any installation that has CL components. The negation examples (^cuda, ^tl_cuda,ec_cuda) work correctly because they only exclude matching components. Consider either:

  1. Filtering only the framework(s) whose component names appear in the allow list (cross-framework awareness), or
  2. Documenting that the ALLOW form requires the user to also include all required CL/MC component names (e.g., UCC_MODULES=ucp,nccl,basic,hier,cpu), and removing the ucp,nccl example from the help string until that is clarified.

@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Mar 31, 2026

/build

@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Mar 31, 2026

@Sergei-Lebedev looks legit, is it?

[2026-03-31T16:29:12.314Z]   CC       components/mc/base/libucc_la-ucc_mc_base.lo

[2026-03-31T16:29:12.314Z]   CC       components/ec/libucc_la-ucc_ec.lo

[2026-03-31T16:29:12.314Z]   CC       components/topo/libucc_la-ucc_sbgp.lo

[2026-03-31T16:29:12.314Z]   CC       components/topo/libucc_la-ucc_topo.lo

[2026-03-31T16:29:12.314Z]   CC       components/topo/libucc_la-ucc_sysinfo.lo

[2026-03-31T16:29:12.569Z] "core/ucc_global_opts.c", line 33: error #188: enumerated type mixed with another type [mixed_enum_type]

[2026-03-31T16:29:12.569Z]       .modules           = {{NULL, 0}, 0}};

[2026-03-31T16:29:12.569Z]                                        ^

[2026-03-31T16:29:12.569Z] 

[2026-03-31T16:29:12.569Z] Remark #3391-D: individual warnings can be suppressed with "--diag_suppress <warning-name>"

[2026-03-31T16:29:12.569Z] 

[2026-03-31T16:29:12.569Z] 1 error detected in the compilation of "core/ucc_global_opts.c".

[2026-03-31T16:29:12.569Z] make[2]: *** [Makefile:1130: core/libucc_la-ucc_global_opts.lo] Error 1

[2026-03-31T16:29:12.569Z] make[2]: *** Waiting for unfinished jobs....

[2026-03-31T16:29:15.081Z] make[2]: Leaving directory '/home/jenkins/agent/workspace/UCC/ucc-build-hpcsdk/src'

[2026-03-31T16:29:15.081Z] make[1]: *** [Makefile:1477: install-recursive] Error 1

[2026-03-31T16:29:15.081Z] make[1]: Leaving directory '/home/jenkins/agent/workspace/UCC/ucc-build-hpcsdk/src'

[2026-03-31T16:29:15.081Z] make: *** [Makefile:638: install-recursive] Error 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants