-
Notifications
You must be signed in to change notification settings - Fork 511
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Core] Make multi-node job fail fast when one fails, and output segme…
…nt fault (#3081) * [minor] Make job scheduling more efficient and output segment fault * remove additional space * fail early for multiple tasks * fix * Fix * address comments * add smoke tests * Add setup IP and ranks * Fix returncodes order * Add todo * Add comment * Add comment back * fix returncodes * remove print * Add todo in smoke test * Fix failed run yaml * Address comments * use run_timestamp * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * address comments * mypy * format * address comments --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
- Loading branch information
1 parent
d2b2118
commit b042741
Showing
6 changed files
with
198 additions
and
84 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
resources: | ||
cpus: 2+ | ||
|
||
num_nodes: 3 | ||
|
||
run: | | ||
if [ "$SKYPILOT_NODE_RANK" == "1" ]; then | ||
exit 1 | ||
fi | ||
echo My hostname: $(hostname) | ||
sleep 10000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
resources: | ||
cpus: 2+ | ||
|
||
num_nodes: 3 | ||
|
||
setup: | | ||
echo "Setting up nodes" | ||
echo "$SKYPILOT_SETUP_NODE_RANK" | ||
if [ "$SKYPILOT_SETUP_NODE_RANK" == "1" ]; then | ||
echo FAILING $SKYPILOT_SETUP_NODE_RANK | ||
exit 1 | ||
fi | ||
sleep 10000 | ||
run: | | ||
echo Should not get here |