Conversation
fd52ba9 to
608ecba
Compare
|
| Filename | Overview |
|---|---|
| src/core/ucc_global_opts.c | Adds MODULES config entry with help text and static initializer {{NULL, 0}, 0}; initializer relies on UCC_CONFIG_ALLOW_LIST_ALLOW_ALL == 0 |
| src/core/ucc_global_opts.h | Adds ucc_config_allow_list_t modules field to the global config struct; straightforward and correct |
| src/utils/ucc_component.c | Introduces ucc_component_is_enabled and ucc_modules_list_search; ALLOW list mode causes fatal init failure for CL/MC frameworks when only TL names are specified |
Reviews (3): Last reviewed commit: "CORE: add ucc_modules config" | Re-trigger Greptile
| .log_print_enable = 0, | ||
| .log_level_trigger = UCC_LOG_LEVEL_FATAL}; | ||
| .log_level_trigger = UCC_LOG_LEVEL_FATAL, | ||
| .modules = {{NULL, 0}, 0}}; |
There was a problem hiding this comment.
Magic numeric literal for
modules.mode
The static initializer uses 0 as the value for modules.mode, relying on the assumption that UCC_CONFIG_ALLOW_LIST_ALLOW_ALL == 0. This is an implicit contract with the UCX library's enum ordering. If UCX ever changes this enum (e.g., adds a value before ALLOW_ALL), or if ucc_component_is_enabled is ever called before ucc_config_read processes the default "all" value, all components would be silently disabled because the mode == UCC_CONFIG_ALLOW_LIST_ALLOW_ALL guard would not fire, the empty array would match nothing, and neither the ALLOW nor NEGATE branches would return 1.
Use the named constant to make the intent explicit and safe:
| .modules = {{NULL, 0}, 0}}; | |
| .log_level_trigger = UCC_LOG_LEVEL_FATAL, | |
| .modules = {{NULL, 0}, UCC_CONFIG_ALLOW_LIST_ALLOW_ALL}}; |
|
|
||
| for (i = 0; i < names->count; i++) { | ||
| if ((strcmp(names->names[i], qualified) == 0) || | ||
| (strcmp(names->names[i], component_name) == 0)) { | ||
| return i; | ||
| } | ||
| } | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
No diagnostic for unrecognised module names in the allow list
If a user makes a typo in UCC_MODULES (e.g., UCC_MODULES=^tl_cudaa), the entry silently fails to match any component — no warning is emitted and the filtering just has no effect. This can be confusing to debug.
Consider emitting a ucc_warn when a ALLOW or NEGATE list is active and none of the entries match any component that was actually found on disk. At minimum, a debug-level note when an entry in the list is never matched during the whole component-loading scan would help users detect typos in their configuration.
| if (!ucc_component_is_enabled(framework_name, | ||
| iface_struct + strlen(framework_pattern))) { | ||
| ucc_debug("component '%s_%s' is disabled by UCC_MODULES configuration", | ||
| framework_name, iface_struct + strlen(framework_pattern)); | ||
| *c_iface = NULL; | ||
| return UCC_ERR_NO_MESSAGE; | ||
| } |
There was a problem hiding this comment.
Disabled component returns
UCC_ERR_NO_MESSAGE, indistinguishable from load failure
Returning UCC_ERR_NO_MESSAGE for a component that is intentionally disabled by the user mixes two semantically different outcomes under the same error code. The same code is already returned when the .so basename doesn't contain the framework pattern (line 83), and when dlopen/dlsym fails (lines 107/128). In all three cases ucc_components_load simply continues.
This isn't a bug today, but it means callers can never distinguish "this component was deliberately filtered out by the user" from "this component failed to load." If in the future a caller wants to emit a summarized diagnostic (e.g., "3 components were disabled, 1 failed to load"), the distinction will be impossible to recover. Using a separate status (e.g., UCC_ERR_NOT_SUPPORTED, or even UCC_OK with *c_iface = NULL) for the intentional-disable case would preserve that information.
608ecba to
75e25c2
Compare
|
/build |
75e25c2 to
e22007d
Compare
| if (!ucc_component_is_enabled(framework_name, | ||
| iface_struct + strlen(framework_pattern))) { | ||
| ucc_debug("component '%s_%s' is disabled by UCC_MODULES configuration", | ||
| framework_name, iface_struct + strlen(framework_pattern)); | ||
| *c_iface = NULL; | ||
| return UCC_ERR_NO_MESSAGE; |
There was a problem hiding this comment.
ALLOW list silently breaks required frameworks
When UCC_MODULES is set to an ALLOW list (no ^ prefix), the filter is applied per-component across all frameworks. The PR's own documented example UCC_MODULES=ucp,nccl will cause UCC initialization to fail because:
- The
clframework has componentsbasic,hier,doca_urom— none of which matchucpornccl. - All CL components are filtered out →
ucc_components_load("cl", ...)returnsUCC_ERR_NOT_FOUND. ucc_constructor.cline 138 treats that as fatal: "no CL components were found in the ucc modules dir".- The same problem applies to the
mcframework, which is also treated as fatal (lines 159–163).
So the documented ucp,nccl example would crash initialization for any installation that has CL components. The negation examples (^cuda, ^tl_cuda,ec_cuda) work correctly because they only exclude matching components. Consider either:
- Filtering only the framework(s) whose component names appear in the allow list (cross-framework awareness), or
- Documenting that the ALLOW form requires the user to also include all required CL/MC component names (e.g.,
UCC_MODULES=ucp,nccl,basic,hier,cpu), and removing theucp,ncclexample from the help string until that is clarified.
|
/build |
|
@Sergei-Lebedev looks legit, is it? |
What
Add configuration variable UCC_MODULES to disable/enable loading of specific components before they initialized.
Examples:
UCC_MODULES=^cuda disables all cuda components including mc_cuda, ec_cuda, tl_cuda, sysinfo_cuda
UCC_MODULES=^tl_cuda,ec_cuda disables tl_cuda and ec_cuda, but keeps mc_cuda and sysinfo_cuda