Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][BugFix] Fix docker runtime env on Azure #3450

Merged
merged 2 commits into from
Apr 19, 2024

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Apr 19, 2024

This is a bug introduced by #3362 , where the code assume image_id is always of format <publisher>:<offer>:<sku>:<version> but ignore the case of docker image as runtime environment.

To reproduce:

$ sky launch --cloud azure --image-id docker:ubuntu:20.04
I 04-18 18:55:57 optimizer.py:693] == Optimizer ==
I 04-18 18:55:57 optimizer.py:704] Target: minimizing cost
I 04-18 18:55:57 optimizer.py:716] Estimated cost: $0.4 / hour
I 04-18 18:55:57 optimizer.py:716] 
I 04-18 18:55:57 optimizer.py:839] Considered resources (1 node):
I 04-18 18:55:57 optimizer.py:909] ----------------------------------------------------------------------------------------------
I 04-18 18:55:57 optimizer.py:909]  CLOUD   INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 04-18 18:55:57 optimizer.py:909] ----------------------------------------------------------------------------------------------
I 04-18 18:55:57 optimizer.py:909]  Azure   Standard_D8s_v5   8       32        -              eastus        0.38          ✔     
I 04-18 18:55:57 optimizer.py:909] ----------------------------------------------------------------------------------------------
I 04-18 18:55:57 optimizer.py:909] 
Launching a new cluster 'sky-32ec-txia'. Proceed? [Y/n]: 
I 04-18 18:55:59 cloud_vm_ray_backend.py:4226] Creating a new cluster: 'sky-32ec-txia' [1x Azure(Standard_D8s_v5, image_id={'eastus': 'docker:ubuntu:20.04'})].
I 04-18 18:55:59 cloud_vm_ray_backend.py:4226] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 04-18 18:56:00 cloud_vm_ray_backend.py:1363] To view detailed progress: tail -n100 -f /home/txia/sky_logs/sky-2024-04-18-18-55-56-923553/provision.log
Clusters
NAME                          LAUNCHED     RESOURCES                                                                  STATUS   AUTOSTOP  COMMAND                       
az-test-2                     10 hrs ago   1x Azure(Standard_NC4as_T4_v3, {'T4': 1}, image_id={'eastus': 'docker:...  STOPPED  -         sky exec az-test-2 nvidia...  
doc-debug-new                 3 weeks ago  1x GCP(g2-standard-4, {'L4': 1})                                           STOPPED  -         sky start doc-debug-new       
sky-spot-controller-4a0782e9  4 weeks ago  1x AWS(m6i.2xlarge, disk_size=50)                                          STOPPED  10m       sky spot launch -n t-spot...  

Traceback (most recent call last):
  File "/home/txia/miniconda3/envs/skyserve/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/home/txia/miniconda3/envs/skyserve/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/txia/miniconda3/envs/skyserve/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/txia/skypilot/sky/utils/common_utils.py", line 354, in _record
    return f(*args, **kwargs)
  File "/home/txia/skypilot/sky/cli.py", line 805, in invoke
    return super().invoke(ctx)
  File "/home/txia/miniconda3/envs/skyserve/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/txia/miniconda3/envs/skyserve/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/txia/miniconda3/envs/skyserve/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/txia/skypilot/sky/utils/common_utils.py", line 375, in _record
    return f(*args, **kwargs)
  File "/home/txia/skypilot/sky/cli.py", line 1069, in launch
    _launch_with_confirm(task,
  File "/home/txia/skypilot/sky/cli.py", line 597, in _launch_with_confirm
    sky.launch(
  File "/home/txia/skypilot/sky/utils/common_utils.py", line 375, in _record
    return f(*args, **kwargs)
  File "/home/txia/skypilot/sky/utils/common_utils.py", line 375, in _record
    return f(*args, **kwargs)
  File "/home/txia/skypilot/sky/execution.py", line 452, in launch
    return _execute(
  File "/home/txia/skypilot/sky/execution.py", line 267, in _execute
    handle = backend.provision(task,
  File "/home/txia/skypilot/sky/utils/common_utils.py", line 375, in _record
    return f(*args, **kwargs)
  File "/home/txia/skypilot/sky/utils/common_utils.py", line 354, in _record
    return f(*args, **kwargs)
  File "/home/txia/skypilot/sky/backends/backend.py", line 57, in provision
    return self._provision(task, to_provision, dryrun, stream_logs,
  File "/home/txia/skypilot/sky/backends/cloud_vm_ray_backend.py", line 2668, in _provision
    config_dict = retry_provisioner.provision_with_retries(
  File "/home/txia/skypilot/sky/utils/common_utils.py", line 375, in _record
    return f(*args, **kwargs)
  File "/home/txia/skypilot/sky/backends/cloud_vm_ray_backend.py", line 1990, in provision_with_retries
    config_dict = self._retry_zones(
  File "/home/txia/skypilot/sky/backends/cloud_vm_ray_backend.py", line 1437, in _retry_zones
    config_dict = backend_utils.write_cluster_config(
  File "/home/txia/skypilot/sky/utils/common_utils.py", line 375, in _record
    return f(*args, **kwargs)
  File "/home/txia/skypilot/sky/backends/backend_utils.py", line 789, in write_cluster_config
    resources_vars = to_provision.make_deploy_variables(cluster_name_on_cloud,
  File "/home/txia/skypilot/sky/resources.py", line 951, in make_deploy_variables
    cloud_specific_variables = self.cloud.make_deploy_resources_variables(
  File "/home/txia/skypilot/sky/clouds/azure.py", line 313, in make_deploy_resources_variables
    publisher, offer, sku, version = image_id.split(':')
ValueError: not enough values to unpack (expected 4, got 3)

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • The reproducible code above
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

sky/clouds/azure.py Outdated Show resolved Hide resolved
@cblmemo cblmemo merged commit cade827 into master Apr 19, 2024
20 checks passed
@cblmemo cblmemo deleted the fix-az-docker-runtime-env branch April 19, 2024 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants