Skip to content

[core] Check if a ray task has errored without calling ray.get on it #45229

Open
@justinvyu

Description

Description

Goal: From a list of ray remote task futures, I want to be able to check if each of these has errored without needing to call ray.get individually on each element.

This feature is offered by similar async execution APIs:

Current workaround

We have a "check for failure" function in Ray Train, which may incur some unnecessary overhead to fetch objects:

for object_ref in finished:
# Everything in finished has either failed or completed
# successfully.
try:
ray.get(object_ref)
except RayActorError as exc:
failed_actor_rank = remote_values.index(object_ref)
logger.info(f"Worker {failed_actor_rank} has failed.")
return False, exc
except Exception as exc:

Use case

I am implementing a control loop where I want to check on the status of some actor tasks every N seconds. I want to know if these actor tasks have failed as soon as possible so I can trigger some error handling. This involves me running an "error check" in a loop with a small amount of sleep time:

while True:
    ready, remaining = ray.wait(tasks, num_returns=len(tasks), timeout=0.01)

    # I want to be able to collect errored tasks without calling ray.get.
    # I want to distinguish successful tasks vs. errored tasks from the output from ray.wait.
    errors = []
    for task in ready:
        try:
            ray.get(task)
        except Exception as e:
            errors.append(e)

cc: @jjyao @rkooo567

Metadata

Assignees

No one assigned

    Labels

    P0Issues that should be fixed in short ordercoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions