Skip to content

[core] Check if a ray task has errored without calling ray.get on it #45229

Open
@justinvyu

Description

Description

Goal: From a list of ray remote task futures, I want to be able to check if each of these has errored without needing to call ray.get individually on each element.

This feature is offered by similar async execution APIs:

Current workaround

We have a "check for failure" function in Ray Train, which may incur some unnecessary overhead to fetch objects:

for object_ref in finished:
# Everything in finished has either failed or completed
# successfully.
try:
ray.get(object_ref)
except RayActorError as exc:
failed_actor_rank = remote_values.index(object_ref)
logger.info(f"Worker {failed_actor_rank} has failed.")
return False, exc
except Exception as exc:

Use case

I am implementing a control loop where I want to check on the status of some actor tasks every N seconds. I want to know if these actor tasks have failed as soon as possible so I can trigger some error handling. This involves me running an "error check" in a loop with a small amount of sleep time:

while True:
    ready, remaining = ray.wait(tasks, num_returns=len(tasks), timeout=0.01)

    # I want to be able to collect errored tasks without calling ray.get.
    # I want to distinguish successful tasks vs. errored tasks from the output from ray.wait.
    errors = []
    for task in ready:
        try:
            ray.get(task)
        except Exception as e:
            errors.append(e)

cc: @jjyao @rkooo567

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions