-
Notifications
You must be signed in to change notification settings - Fork 127
Open
Description
It can be the case that an application using UCC may attempt to perform a collective operation with a UCC team that was created on a UCC context after that context has been destroyed. While this is invalid and documented in the API, this can result in two potential failures:
ucc_collective_initsucceeds butucc_collective_postproduces a segmentation fault. This can occur if the context has been destroyed, but the library is not finalized.ucc_collective_initproduces errors similar to:
[1751278711.720649] [eos0260:399988:0] ucc_mc.c:143 UCC ERROR no components supported memory type host available
This occurs when both the UCC context and UCC library have been destroyed/finalized, respectively.
These errors can be difficult for a user to track down unless familiar with UCC. Currently, we only check for the reuse of a destroyed UCC team. It may be beneficial to check for these additional failing cases in ucc_collective_init to prevent such failures and allow applications to continue executing for a graceful shutdown.
Metadata
Metadata
Assignees
Labels
No labels