Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fluidstack] Improve fluidstack provisioner for corner cases #3254

Merged
merged 14 commits into from
Mar 1, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 29, 2024

Fixes #3259.

It also fixes the issue when there are multiple existing instance with the name *-head, we now reuse those instances by renaming them.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud fluidstack --gpus A100-80GB nvidia-smi -c test-fluid
    • sky launch -c test-a100 --cloud fluidstack --gpus A100-80GB nvidia-smi --num-nodes 2
    • sky launch -c test-a100 --cloud fluidstack --gpus A100-80GB:8 nvidia-smi
    • sky launch -c test-h100 --cloud fluidstack --gpus H100 nvidia-smi
    • sky launch -c test-h100 --cloud fluidstack --gpus H100:2 nvidia-smi
    • sky launch -c test-h100 --cloud fluidstack --gpus H100:8 nvidia-smi
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

sky/provision/fluidstack/instance.py Outdated Show resolved Hide resolved
head_instance_id = inst_id
rename(inst_id, f'{cluster_name_on_cloud}-head')
logger.info(f'Rename head node {head_instance_id}.')
if inst_id != head_instance_id and inst['hostname'].endswith('-head'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the case when launching on an existing cluster where some node has been set to head node? in this case, should we keep the original head node?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case when there are multiple instances with -head. We should have reuse the same head instance if only one head exist in L102

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be multiple -head? A bug on the CSP side? Worth a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a comment about this. This should not happen in most of the case. It is just for a safeguard against the case when somehow multiple head instances exist, e.g., a user changed the name manually.

sky/provision/fluidstack/fluidstack_utils.py Outdated Show resolved Hide resolved
sky/provision/fluidstack/instance.py Outdated Show resolved Hide resolved
sky/provision/fluidstack/instance.py Outdated Show resolved Hide resolved
sky/provision/fluidstack/instance.py Show resolved Hide resolved
sky/provision/fluidstack/instance.py Outdated Show resolved Hide resolved
head_instance_id = inst_id
rename(inst_id, f'{cluster_name_on_cloud}-head')
logger.info(f'Rename head node {head_instance_id}.')
if inst_id != head_instance_id and inst['hostname'].endswith('-head'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be multiple -head? A bug on the CSP side? Worth a comment?

@Michaelvll
Copy link
Collaborator Author

Just updated the instance creation logic, so that we don't wait infinitely for a instance directly being converted to terminated state from a pending state. PTAL @concretevitamin @cblmemo

@Michaelvll
Copy link
Collaborator Author

Merging this now to unblock our user : )

@Michaelvll Michaelvll merged commit 095acab into master Mar 1, 2024
19 checks passed
@Michaelvll Michaelvll deleted the improve-fluidstack branch March 1, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Fluidstack] Launching A100 machine fails due to the reboot after nvidia driver installation takes too long
3 participants