Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SkyServe] Support mixture of spot and on-demand #3194

Merged
merged 224 commits into from
Feb 28, 2024
Merged

Conversation

MaoZiming
Copy link
Collaborator

@MaoZiming MaoZiming commented Feb 20, 2024

#2793
without spot placer.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@MaoZiming
Copy link
Collaborator Author

@cblmemo Thanks! resolved all the comments.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just finished a pass : )

sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/replica_managers.py Outdated Show resolved Hide resolved
sky/serve/replica_managers.py Outdated Show resolved Hide resolved
# blocking probe to other replicas.
is_preempted = self._handle_preemption(info)
if is_preempted:
continue
Copy link
Collaborator

@Michaelvll Michaelvll Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole for loop of the future in probe_futures is quite weird, why can't we just have the whole "probe and process probe results" in a same function and have that function to be parallelized, instead of only parallelizing info.probe. cc'ing @cblmemo
Otherwise the processing and termination of the unhealthy replica could block the whole process of the probing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Actually this is taking a lot of time and once caused some bugs in our SpotHedge e2e experiment; we should parallelize the preemption handling as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump for this one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we spawn another thread to handle preemption (for modularity)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling like we could spawn a thread to do all of the following:

  1. info.probe
  2. Update all attributes in info, e.g. info.status_property
  3. Handle preemption
  4. Update database

Does that sounds good to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should mix all these tasks under one thread. This probably requires more discussion and I think can be a different PR.

sky/serve/service_spec.py Outdated Show resolved Hide resolved
sky/serve/service_spec.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/service_spec.py Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
sky/serve/replica_managers.py Outdated Show resolved Hide resolved
sky/serve/replica_managers.py Show resolved Hide resolved
sky/serve/replica_managers.py Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the smoke test! Left some comments about that 🫡

tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Comment on lines 3049 to 3052
def _check_both_spot_on_demand_in_status(name: str) -> str:
return (f'output=$(sky serve status {name});'
'echo "$output" | grep "GCP(\[Spot\]vCPU=2)" && '
'echo "$output" | grep "GCP(vCPU=2)";')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that makes me think, should we print the is_spot column in our sky serve status table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row will have \[Spot\] so I think we don't need a is_spot column

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this PR @MaoZiming! The code looks mostly good to me.

sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Show resolved Hide resolved
sky/serve/core.py Outdated Show resolved Hide resolved
sky/serve/replica_managers.py Outdated Show resolved Hide resolved
@MaoZiming MaoZiming merged commit a2f4a68 into master Feb 28, 2024
19 checks passed
@MaoZiming MaoZiming deleted the serve-spot-no-placer branch February 28, 2024 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants