Propagate exceptions

Currently, the Launcher process just raises a `RuntimeError` if any agents fail. I think it could be more useful to raise the actual exception from the agents. Then the user can have more conditional control at the launcher level (e.g. what to do next if there is an `OutOfMemoryError` vs something else).

The only problem might be if multiple agents fail: then which exception do we raise?

https://github.com/apoorvkh/torchrunx/blob/f081a00543bebe469ddae8a942a0930a45d2fe1a/src/torchrunx/launcher.py#L241-L253

	if any(s.is_failed() for s in agent_statuses):
	# TODO: cleaner way to print these?
	e = ""
	for i, s in enumerate(agent_statuses):
	if s is not None and s.is_failed():
	for k, v in s.failures.items():
	e += f"Node {i}, local worker {k} exited with error: "
	if isinstance(v.message, str):
	e += f"{v.message}\n"
	else:
	e += f"{v.message['message']}\n"
	e += f"{v.message['extraInfo']['py_callstack']}\n\n"
	raise RuntimeError(e)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Propagate exceptions #59

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Propagate exceptions #59

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions