[core] Check if a ray task has errored without calling ray.get
on it #45229
Open
Description
Description
Goal: From a list of ray remote task futures, I want to be able to check if each of these has errored without needing to call ray.get
individually on each element.
This feature is offered by similar async execution APIs:
- https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Future.exception
- multiprocessing exitcode
Current workaround
We have a "check for failure" function in Ray Train, which may incur some unnecessary overhead to fetch objects:
ray/python/ray/train/_internal/utils.py
Lines 49 to 58 in fa61109
Use case
I am implementing a control loop where I want to check on the status of some actor tasks every N seconds. I want to know if these actor tasks have failed as soon as possible so I can trigger some error handling. This involves me running an "error check" in a loop with a small amount of sleep time:
while True:
ready, remaining = ray.wait(tasks, num_returns=len(tasks), timeout=0.01)
# I want to be able to collect errored tasks without calling ray.get.
# I want to distinguish successful tasks vs. errored tasks from the output from ray.wait.
errors = []
for task in ready:
try:
ray.get(task)
except Exception as e:
errors.append(e)