-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Make multi-node job fail fast when one fails, and output segment fault #3081
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll. Could we also add a basic smoke tests to cover the multinode job failure case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll, LGTM.
@concretevitamin In order to support the case of having different setup on different nodes, I added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM with some nits, thanks @Michaelvll.
* - ``SKYPILOT_SETUP_NODE_IPS`` | ||
- A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address. | ||
- 1.2.3.4 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add something like:
Since
setup
commands always run on all nodes of a cluster, SkyPilot ensures both of these environment variables (the ranks and the IP list) never change across multiple setups on the same cluster.
What about after restarting a cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added the sentence.
For restarted clusters, we only guarantee the NODE_RANK==0
should always be the head node, but the order of the other instances depend on the cloud's implementation for assigning the external IPs for the restarted nodes. We sort the nodes by the external IP, so if the external IPs change the node rank may be reordered for the workers:
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 2429 to 2430 in 30b39a6
stable_internal_external_ips = [internal_external_ips[0]] + sorted( | |
internal_external_ips[1:], key=lambda x: x[1]) |
I guess the order for the worker nodes should be fine, as long as we keep the head node to be the first one, as most heterogenity after restarting the cluster comes from the head node vs the workers, e.g., we only store the job state and skypilot logs on the head node.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…ypilot into print-segmentation-fault
Fixes #3080
ray driver somehow failed to print the output
segmentation fault
when the job fails due to that error. We try to make the UX better for this case, by checking the returncode with139
and printing out the possible reason.This PR also fixes #3116 the multi-node case where one of the worker fails, we should fail the entire job quickly, instead of waiting for the other workers.
This PR also makes the job submission a bit more efficient, by releasing the resources a earlier.
Tested (run the relevant ones):
bash format.sh
sky launch -c test --num-nodes 100 --cpus 2 --cloud aws echo hi
sky launch -c test --num-nodes 100 --cpus 2 --cloud aws tests/test_yaml/failed_worker_run.yaml
sky launch -c test --num-nodes 100 --cpus 2 --cloud aws --detach-setup tests/test_yaml/failed_worker_setup.yaml
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
pytest tests/test_smoke.py::test_multi_node_failure
bash tests/backward_comaptibility_tests.sh