Skip to content

Graceful Failure with Incorrect API Usage #1174

@wfaderhold21

Description

@wfaderhold21

It can be the case that an application using UCC may attempt to perform a collective operation with a UCC team that was created on a UCC context after that context has been destroyed. While this is invalid and documented in the API, this can result in two potential failures:

  1. ucc_collective_init succeeds but ucc_collective_post produces a segmentation fault. This can occur if the context has been destroyed, but the library is not finalized.
  2. ucc_collective_init produces errors similar to:
[1751278711.720649] [eos0260:399988:0]          ucc_mc.c:143  UCC  ERROR no components supported memory type host available

This occurs when both the UCC context and UCC library have been destroyed/finalized, respectively.

These errors can be difficult for a user to track down unless familiar with UCC. Currently, we only check for the reuse of a destroyed UCC team. It may be beneficial to check for these additional failing cases in ucc_collective_init to prevent such failures and allow applications to continue executing for a graceful shutdown.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions