-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fluidstack] Improve fluidstack provisioner for corner cases #3254
Conversation
sky/provision/fluidstack/instance.py
Outdated
head_instance_id = inst_id | ||
rename(inst_id, f'{cluster_name_on_cloud}-head') | ||
logger.info(f'Rename head node {head_instance_id}.') | ||
if inst_id != head_instance_id and inst['hostname'].endswith('-head'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for the case when launching on an existing cluster where some node has been set to head node? in this case, should we keep the original head node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the case when there are multiple instances with -head
. We should have reuse the same head instance if only one head exist in L102
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would there be multiple -head
? A bug on the CSP side? Worth a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a comment about this. This should not happen in most of the case. It is just for a safeguard against the case when somehow multiple head instances exist, e.g., a user changed the name manually.
sky/provision/fluidstack/instance.py
Outdated
head_instance_id = inst_id | ||
rename(inst_id, f'{cluster_name_on_cloud}-head') | ||
logger.info(f'Rename head node {head_instance_id}.') | ||
if inst_id != head_instance_id and inst['hostname'].endswith('-head'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would there be multiple -head
? A bug on the CSP side? Worth a comment?
…o improve-fluidstack
Just updated the instance creation logic, so that we don't wait infinitely for a instance directly being converted to terminated state from a pending state. PTAL @concretevitamin @cblmemo |
Merging this now to unblock our user : ) |
…o improve-fluidstack
Fixes #3259.
It also fixes the issue when there are multiple existing instance with the name
*-head
, we now reuse those instances by renaming them.Tested (run the relevant ones):
bash format.sh
sky launch --cloud fluidstack --gpus A100-80GB nvidia-smi -c test-fluid
sky launch -c test-a100 --cloud fluidstack --gpus A100-80GB nvidia-smi --num-nodes 2
sky launch -c test-a100 --cloud fluidstack --gpus A100-80GB:8 nvidia-smi
sky launch -c test-h100 --cloud fluidstack --gpus H100 nvidia-smi
sky launch -c test-h100 --cloud fluidstack --gpus H100:2 nvidia-smi
sky launch -c test-h100 --cloud fluidstack --gpus H100:8 nvidia-smi
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh