Closed
Description
Currently, the Launcher process just raises a RuntimeError
if any agents fail. I think it could be more useful to raise the actual exception from the agents. Then the user can have more conditional control at the launcher level (e.g. what to do next if there is an OutOfMemoryError
vs something else).
The only problem might be if multiple agents fail: then which exception do we raise?
torchrunx/src/torchrunx/launcher.py
Lines 241 to 253 in f081a00
Metadata
Metadata
Assignees
Labels
No labels