Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SkyServe] Support mixture of spot and on-demand #3194

Merged
merged 224 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from 205 commits
Commits
Show all changes
224 commits
Select commit Hold shift + click to select a range
ede09ac
rebase
MaoZiming Nov 16, 2023
fe4ad21
shift string
MaoZiming Nov 16, 2023
af54fa4
Merge remote-tracking branch 'origin/master' into serve-spot
cblmemo Nov 21, 2023
b194a06
rebase & add autoscaler
cblmemo Nov 21, 2023
0fe5947
add exmaple and format
cblmemo Nov 21, 2023
254ea19
fix bug
cblmemo Nov 21, 2023
4930f3a
reimplement
cblmemo Nov 21, 2023
c74623c
fix
cblmemo Nov 21, 2023
20f0302
clear cnt when decision is made
cblmemo Nov 22, 2023
2dee686
scale down status order
cblmemo Nov 22, 2023
03f2a7b
fix bug & policy
cblmemo Nov 22, 2023
bcb4812
dont count overprovision
cblmemo Nov 22, 2023
1036a40
log
cblmemo Nov 22, 2023
98e8085
fix
cblmemo Nov 22, 2023
2c50a1c
fix
cblmemo Nov 23, 2023
df6326f
fix including not ready
cblmemo Nov 23, 2023
1eb1ec7
fix bootstrap
cblmemo Nov 23, 2023
3a55418
fix status
cblmemo Nov 24, 2023
8dccb8b
rewrite autoscaler logic
cblmemo Nov 24, 2023
31d8e76
cap num ondemand to scale up
cblmemo Nov 25, 2023
1cad549
move to active after successful launch
cblmemo Nov 25, 2023
5bda62d
move to preemption list if the sky launch failed
cblmemo Nov 26, 2023
5baa7f0
added on-demand policy;
MaoZiming Nov 26, 2023
25661b4
shrink downscale factor
cblmemo Nov 27, 2023
63fb2f0
Merge remote-tracking branch 'origin/master' into serve-spot
cblmemo Nov 27, 2023
a9f461b
e2e experiment info dump
cblmemo Nov 28, 2023
6ce8988
Merge branch 'serve-spot-on-demand' into serve-spot
MaoZiming Nov 29, 2023
def9e1c
fix on-demand check
MaoZiming Nov 29, 2023
868bd25
format
MaoZiming Nov 29, 2023
6f5c84e
fix policy
cblmemo Nov 29, 2023
0d107cd
Merge branch 'serve-spot' of github.com:skypilot-org/skypilot into se…
cblmemo Nov 29, 2023
2301824
update
MaoZiming Nov 30, 2023
c80ce99
dynamicfailoverspot
MaoZiming Nov 30, 2023
7ac62f7
comment out logg
MaoZiming Dec 1, 2023
282612a
checkpt
MaoZiming Dec 2, 2023
428b862
overprovision=2
MaoZiming Dec 2, 2023
faf2d0c
zone awareness
cblmemo Dec 4, 2023
fac6e0e
remove insufficient capacity
cblmemo Dec 4, 2023
157d3b2
Merge branch 'serve-spot' of github.com:skypilot-org/skypilot into se…
cblmemo Dec 4, 2023
b66a969
filter out od fallback
cblmemo Dec 4, 2023
d059ccc
fix
cblmemo Dec 7, 2023
9e4bd87
add num_extra as a yaml parameter
MaoZiming Dec 10, 2023
c36820b
fix pytest
MaoZiming Dec 10, 2023
a17a02c
setstate
MaoZiming Dec 10, 2023
d8cf3f6
format.sh
MaoZiming Dec 10, 2023
8aa6357
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Dec 13, 2023
5877001
add safety net
MaoZiming Dec 13, 2023
268ade2
get info.zone
MaoZiming Dec 13, 2023
7813f35
uncomment add to preemption list
MaoZiming Dec 13, 2023
c258ea5
fix bug
MaoZiming Dec 13, 2023
44f722e
bug fix
MaoZiming Dec 14, 2023
a07dc04
num_init_replicas
MaoZiming Dec 14, 2023
fd8697f
delete todos: (tian): Change spot_mixer to boolean
MaoZiming Dec 14, 2023
6d3b973
.
MaoZiming Dec 14, 2023
d434f75
deprecate original RequestRateAutoscaler
MaoZiming Dec 14, 2023
66e3400
spot zones
MaoZiming Dec 14, 2023
8ed6b9e
bug fix
MaoZiming Dec 14, 2023
3b3eab6
bug fix
MaoZiming Dec 14, 2023
e629484
fix
MaoZiming Dec 14, 2023
db6f398
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Dec 14, 2023
cb6f998
remove cooldown
MaoZiming Dec 15, 2023
9fb6305
drain the replica at scale down or up
MaoZiming Dec 15, 2023
39a0fcc
tmp
MaoZiming Dec 16, 2023
9c254b1
update AutoscalerDecision, use target_qps, migrate _get_desired_num_r…
MaoZiming Dec 16, 2023
5d19a9e
bug fix
MaoZiming Dec 16, 2023
1d7683a
fix bugs
MaoZiming Dec 16, 2023
a544a8a
merge master
MaoZiming Dec 16, 2023
940cb06
update templates
MaoZiming Dec 16, 2023
0c06507
remove test.py
MaoZiming Dec 16, 2023
39bcc28
move parameters to user config
MaoZiming Dec 16, 2023
ee35379
merge into master
MaoZiming Dec 27, 2023
3772b25
clean up code after merging master
MaoZiming Dec 27, 2023
96b050e
refactor code, autoscaler and controller
MaoZiming Dec 27, 2023
d78a221
update yaml and change spot_mixer to autoscaler
MaoZiming Dec 27, 2023
7205f51
added yaml examples and fix bugs
MaoZiming Dec 27, 2023
2a1342c
added no spot zones examples
MaoZiming Dec 28, 2023
3aabee4
min_on_demand_replicas
MaoZiming Dec 28, 2023
69affd2
add min_on_demand_replicas and a todo for preemption warning
MaoZiming Dec 28, 2023
3c89daf
fix bug and update spot_placer
MaoZiming Dec 28, 2023
0f0f9a5
spot_placer rename
MaoZiming Dec 28, 2023
4b35162
code review
MaoZiming Dec 29, 2023
e32e0c8
fix comments
MaoZiming Dec 29, 2023
e510f5e
remove evenspread
MaoZiming Dec 29, 2023
69223ef
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Dec 30, 2023
038af16
address code reviews
MaoZiming Dec 30, 2023
4c53795
fix pr
MaoZiming Dec 30, 2023
120ab31
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Jan 1, 2024
dd56475
deprecate spot_zones, infer spot_zones from resource field
MaoZiming Jan 1, 2024
f3f0956
update yamls
MaoZiming Jan 1, 2024
3f963ec
update yaml
MaoZiming Jan 1, 2024
ad12fac
remove ordered
MaoZiming Jan 1, 2024
e419e9d
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Jan 4, 2024
113ba9b
num_extra_on_demand
MaoZiming Jan 4, 2024
05ab639
address pr reviews
MaoZiming Jan 4, 2024
d30b239
update resource handling
MaoZiming Jan 4, 2024
5d5cba2
fix print and dictionary issue
MaoZiming Jan 4, 2024
77234db
address some comments
MaoZiming Jan 5, 2024
031e896
use filter instead of list comprehension
MaoZiming Jan 5, 2024
1cf4c95
refactor autoscaler
MaoZiming Jan 5, 2024
a4b8fed
get_feasible_launchable_resources
MaoZiming Jan 5, 2024
71997bb
fix pr
MaoZiming Jan 11, 2024
efd71d3
merge master
MaoZiming Jan 11, 2024
ceb30a2
update evenly spreading zones among active zones
MaoZiming Jan 11, 2024
3da66d5
fix any_of bug
MaoZiming Jan 12, 2024
03bb164
merge master
MaoZiming Jan 24, 2024
e8ea1d9
merge autoscalers
MaoZiming Jan 24, 2024
3d3e5ff
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Jan 24, 2024
79fe625
_fill_in_launchable_resources
MaoZiming Jan 24, 2024
096e980
add yaml description
MaoZiming Jan 24, 2024
2268101
update yamls
MaoZiming Jan 24, 2024
a35cf88
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Jan 27, 2024
476ffcf
update existing zone
MaoZiming Jan 28, 2024
5973da1
Update sky/task.py
MaoZiming Jan 28, 2024
7741026
Update sky/serve/spot_policy.py
MaoZiming Jan 28, 2024
f5edd11
Update sky/cli.py
MaoZiming Jan 28, 2024
24e2a3a
Update sky/serve/autoscalers.py
MaoZiming Jan 28, 2024
16eaf32
code review
MaoZiming Jan 28, 2024
97f269d
Merge branch 'serve-spot' of https://github.com/skypilot-org/skypilot…
MaoZiming Jan 28, 2024
d612c5f
code review
MaoZiming Jan 28, 2024
a07ce18
update other spotautoscaler variables
MaoZiming Jan 28, 2024
ca135cd
fix
MaoZiming Jan 28, 2024
a7c750d
remove anyof
MaoZiming Jan 29, 2024
56e7550
merge master
MaoZiming Jan 29, 2024
48e0e90
use defaultdict
MaoZiming Jan 29, 2024
ef308e5
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Jan 30, 2024
c21f0b3
update name for init_subclass
MaoZiming Jan 30, 2024
442888b
running format
MaoZiming Jan 30, 2024
ed516ca
fixing pr
MaoZiming Jan 31, 2024
4d5c52c
fix PR
MaoZiming Jan 31, 2024
1db2be2
update any_of
MaoZiming Jan 31, 2024
d4e76a9
ordered resources
MaoZiming Jan 31, 2024
4f1c3e8
ordered resources
MaoZiming Jan 31, 2024
1268dc2
add with ux_utils.print_exception_no_traceback():
MaoZiming Jan 31, 2024
2e2b00d
print final target_num_replicas
MaoZiming Jan 31, 2024
628a53f
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Feb 1, 2024
f9aa7f6
update target based on qps target
MaoZiming Feb 1, 2024
242fe02
added check for both up and update
MaoZiming Feb 1, 2024
bd9bfce
add skyserve tests
MaoZiming Feb 1, 2024
38390a3
remove spot_placer, use spot_policy instead
MaoZiming Feb 2, 2024
84413cd
spot_placer
MaoZiming Feb 2, 2024
74a07b3
format
MaoZiming Feb 2, 2024
1792712
update use_spot or non_use_spot
MaoZiming Feb 2, 2024
e4baf0f
fix pr and add a newline to yaml
MaoZiming Feb 2, 2024
3fcd3ba
dataclass
MaoZiming Feb 2, 2024
66c03f5
update autoscaler
MaoZiming Feb 2, 2024
82565d9
bump autoscaler version
MaoZiming Feb 2, 2024
2bbfef9
request timestamps update
MaoZiming Feb 2, 2024
c186954
merge master
MaoZiming Feb 3, 2024
b2813be
update tests and error handling
MaoZiming Feb 3, 2024
abe57f0
fix Location hash
MaoZiming Feb 3, 2024
a010e6b
update comment
MaoZiming Feb 3, 2024
52ac31c
name change
MaoZiming Feb 4, 2024
df9f604
_serve_check_service
MaoZiming Feb 4, 2024
4109263
add newline to yamls
MaoZiming Feb 4, 2024
201fbed
spot_policies
MaoZiming Feb 4, 2024
1c509f5
Update sky/serve/replica_managers.py
MaoZiming Feb 4, 2024
d6b84b9
code review
MaoZiming Feb 4, 2024
8bdc8df
Merge branch 'serve-spot' of https://github.com/skypilot-org/skypilot…
MaoZiming Feb 4, 2024
1339364
format
MaoZiming Feb 4, 2024
b895b74
fix import
MaoZiming Feb 4, 2024
f90852d
add skyserve spot policy
MaoZiming Feb 4, 2024
1a20d57
format
MaoZiming Feb 4, 2024
8e729c6
assert len(task.resources) >= 1
MaoZiming Feb 4, 2024
d7887b5
fix bug and added SpotOnDemandMix
MaoZiming Feb 5, 2024
ff9092c
bug fix and edit wording on yaml
MaoZiming Feb 5, 2024
ae26bed
spot_policy_str
MaoZiming Feb 5, 2024
c905e02
update examples/serve/policy/spot_on_demand_mix.yaml yaml
MaoZiming Feb 5, 2024
cf5ab91
yaml doc
MaoZiming Feb 5, 2024
cd6846d
not expose the autoscaler option to the user
MaoZiming Feb 6, 2024
d39e431
update interface
MaoZiming Feb 11, 2024
d4c7b8d
update initialization
MaoZiming Feb 11, 2024
8da4024
require use_spot explicitly
MaoZiming Feb 11, 2024
ae97cc9
went through spec
MaoZiming Feb 11, 2024
402a048
remove NAME
MaoZiming Feb 11, 2024
6d5e18c
update yaml
MaoZiming Feb 11, 2024
4eec556
revert is True
MaoZiming Feb 11, 2024
0275dab
update yamls
MaoZiming Feb 11, 2024
e4ef246
interface fix
MaoZiming Feb 12, 2024
ea50201
added multi accelerator support
MaoZiming Feb 12, 2024
848dead
fix yaml
MaoZiming Feb 12, 2024
65e92a5
fix UI issues
MaoZiming Feb 12, 2024
0d74aa7
fix pr reviews
MaoZiming Feb 12, 2024
f98ac30
replica_ids_to_scale_down
MaoZiming Feb 12, 2024
ee16f3c
update autoscaler names
MaoZiming Feb 12, 2024
5af48cb
remove initialization and add back checking active
MaoZiming Feb 12, 2024
28a248d
remove target_qps_per_replicas
MaoZiming Feb 14, 2024
7fe60c1
max_replicas required where target_qps_per_replica is set
MaoZiming Feb 14, 2024
274ccf9
pr
MaoZiming Feb 14, 2024
a9aad13
fix nits
MaoZiming Feb 14, 2024
a25d140
code review
MaoZiming Feb 19, 2024
28a3854
code review
MaoZiming Feb 19, 2024
f16bd20
remove Autoscaler.from_spec
MaoZiming Feb 19, 2024
4be8677
num_ready_spot
MaoZiming Feb 19, 2024
fc1c774
removed spot placer
MaoZiming Feb 20, 2024
1d640e6
update yaml
MaoZiming Feb 20, 2024
ce279c8
delete spot_only yaml
MaoZiming Feb 20, 2024
5d6917e
format
MaoZiming Feb 20, 2024
16119cb
# use_spot is needed for ondemand fallback
MaoZiming Feb 20, 2024
834df43
error msg
MaoZiming Feb 20, 2024
0a63aa1
update print
MaoZiming Feb 20, 2024
d6a3f99
Update sky/serve/autoscalers.py
MaoZiming Feb 20, 2024
de959e7
code review
MaoZiming Feb 20, 2024
3846235
Merge branch 'serve-spot-no-placer' of https://github.com/skypilot-or…
MaoZiming Feb 20, 2024
81688f3
added a todo
MaoZiming Feb 21, 2024
5c560f4
added todo
MaoZiming Feb 21, 2024
016e4a8
smoke test for base on demand fallback
MaoZiming Feb 21, 2024
26387b3
pr review
MaoZiming Feb 21, 2024
4f77875
update ports
MaoZiming Feb 21, 2024
d80817e
replace up validation to _validate_service_task
MaoZiming Feb 21, 2024
1e6ff03
updated status order
MaoZiming Feb 21, 2024
daeeee8
get_dynamic_states and load dynamic_states
MaoZiming Feb 21, 2024
ef065bf
move function locations
MaoZiming Feb 21, 2024
73dd9c3
add other statuses
MaoZiming Feb 21, 2024
547fa41
merge
MaoZiming Feb 25, 2024
9dbe9ab
code review
MaoZiming Feb 25, 2024
33df3d6
fix pr
MaoZiming Feb 26, 2024
68be000
move interrupted position
MaoZiming Feb 26, 2024
3ea5a8a
added smoke test test_skyserve_dynamic_ondemand_fallback
MaoZiming Feb 26, 2024
5c99e8b
_terminate_gcp_replica
MaoZiming Feb 26, 2024
c8a16b7
updated smoke tests
MaoZiming Feb 27, 2024
aacbcaf
update first_ready_time
MaoZiming Feb 27, 2024
4fe7fb7
code review
MaoZiming Feb 28, 2024
6f63b65
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
MaoZiming Feb 28, 2024
b87764a
update smoke test sleep
MaoZiming Feb 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions examples/serve/spot_policy/base_on_demand_fallback_replicas.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# SkyServe YAML to launch a service with mixed spot and on-demand instances.
# The policy will maintain `base_ondemand_fallback_replicas` number of on-demand instances, in addition to spot instances.
MaoZiming marked this conversation as resolved.
Show resolved Hide resolved
# On-demand instances are counted in autoscaling decisions (i.e., between `min_replicas` and `max_replicas`).

service:
readiness_probe: /health
replica_policy:
min_replicas: 2
max_replicas: 3
target_qps_per_replica: 1
base_ondemand_fallback_replicas: 1

resources:
ports: 8081
cpus: 2+
# use_spot is needed for ondemand fallback
use_spot: true

workdir: examples/serve/http_server

run: python3 server.py
23 changes: 23 additions & 0 deletions examples/serve/spot_policy/dynamic_on_demand_fallback.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# SkyServe YAML to launch a service with mixed spot and on-demand instances.
# The policy will dynamically fallback to on-demand instances when spot instances are not available.

service:
readiness_probe: /health
replica_policy:
min_replicas: 2
max_replicas: 3
target_qps_per_replica: 1
dynamic_ondemand_fallback: true

resources:
any_of:
- zone: us-central1-a
- region: us-east1
ports: 8081
cpus: 2+
# use_spot is needed for ondemand fallback
use_spot: true

workdir: examples/serve/http_server

run: python3 server.py
23 changes: 23 additions & 0 deletions examples/serve/spot_policy/multi_accelerators.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# SkyServe YAML to launch a service with mixed spot and on-demand instances and an ordered preference for accelerators.
# The policy will maintain `base_ondemand_fallback_replicas` number of on-demand instances, in addition to spot instances.

service:
readiness_probe: /health
replica_policy:
min_replicas: 2
max_replicas: 3
target_qps_per_replica: 1
base_ondemand_fallback_replicas: 1

resources:
ordered:
- accelerators: V100
- accelerators: T4
ports: 8081
cpus: 2+
# use_spot is needed for ondemand fallback
use_spot: true

workdir: examples/serve/http_server

run: python3 server.py
5 changes: 4 additions & 1 deletion sky/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,8 +274,11 @@ def _execute(
task)

if not cluster_exists:
# If spot is launched by skyserve controller or managed spot controller,
# We don't need to print out the logger info.
if (Stage.PROVISION in stages and task.use_spot and
not _is_launched_by_spot_controller):
not _is_launched_by_spot_controller and
not _is_launched_by_sky_serve_controller):
yellow = colorama.Fore.YELLOW
bold = colorama.Style.BRIGHT
reset = colorama.Style.RESET_ALL
Expand Down
Loading
Loading