Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Spot job fails when storage object is added #1431

Closed
romilbhardwaj opened this issue Nov 18, 2022 · 3 comments
Closed

[Spot] Spot job fails when storage object is added #1431

romilbhardwaj opened this issue Nov 18, 2022 · 3 comments

Comments

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Nov 18, 2022

YAML:

file_mounts:
  /covid:
    source: s3://fah-public-data-covid19-cryptic-pockets # or name: mybucket123
    mode: MOUNT

resources:
  cloud: aws

setup: |
  echo "running setup"

run: |
  conda env list
  sleep 600

Sky spot launch fails with FAILED_CONTROLLER:

 romilb@romilbx1yoga:/mnt/d/Romil/Berkeley/Research/sky-experiments/examples$ sky spot launch minimal.yaml 
Task from YAML spec: minimal.yaml
Launching a new spot task 'sky-3bb7-romilb'. Proceed? [Y/n]: Y
I 11-18 13:04:52 execution.py:672] Uploading sources to cloud storage. See: sky storage ls
Launching managed spot job sky-3bb7-romilb from spot controller...
Launching spot controller...
I 11-18 13:04:54 optimizer.py:606] == Optimizer ==
I 11-18 13:04:54 optimizer.py:618] Target: minimizing cost
I 11-18 13:04:54 optimizer.py:629] Estimated cost: $0.4 / hour
I 11-18 13:04:54 optimizer.py:629]
I 11-18 13:04:54 optimizer.py:686] Considered resources (1 node):
I 11-18 13:04:54 optimizer.py:714] ---------------------------------------------------------------------
I 11-18 13:04:54 optimizer.py:714]  CLOUD   INSTANCE         vCPUs   ACCELERATORS   COST ($)   CHOSEN
I 11-18 13:04:54 optimizer.py:714] ---------------------------------------------------------------------
I 11-18 13:04:54 optimizer.py:714]  AWS     m6i.2xlarge      8       -              0.38          ✔
I 11-18 13:04:54 optimizer.py:714]  Azure   Standard_D8_v4   8       -              0.38
I 11-18 13:04:54 optimizer.py:714]  GCP     n1-highmem-8     8       -              0.47
I 11-18 13:04:54 optimizer.py:714] ---------------------------------------------------------------------
I 11-18 13:04:54 optimizer.py:714]
I 11-18 13:04:54 cloud_vm_ray_backend.py:2852] Creating a new cluster: "sky-spot-controller-a70f1c17" [1x AWS(m6i.2xlarge, disk_size=50)].
I 11-18 13:04:54 cloud_vm_ray_backend.py:2852] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 11-18 13:04:54 cloud_vm_ray_backend.py:988] To view detailed progress: tail -n100 -f /home/romilb/sky_logs/sky-2022-11-18-13-04-52-647826/provision.log
I 11-18 13:04:55 cloud_vm_ray_backend.py:1249] Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f)
I 11-18 13:06:24 log_utils.py:45] Head node is up.
I 11-18 13:07:27 cloud_vm_ray_backend.py:1083] Successfully provisioned or found existing VM.
I 11-18 13:07:29 cloud_vm_ray_backend.py:2896] Processing file mounts.
I 11-18 13:07:29 cloud_vm_ray_backend.py:2926] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2022-11-18-13-04-52-647826/file_mounts.log
I 11-18 13:07:29 backend_utils.py:1080] Syncing (to 1 node): /tmp/spot-task-sky-3bb7-romilb-k11jn2lm -> ~/.sky/spot_tasks/sky-3bb7-romilb.yaml
I 11-18 13:07:31 cloud_vm_ray_backend.py:2122] Running setup on 1 node.
Warning: Permanently added '54.198.26.94' (ECDSA) to the list of known hosts.
I 11-18 13:07:45 cloud_vm_ray_backend.py:2131] Setup completed.
I 11-18 13:07:50 cloud_vm_ray_backend.py:2196] Job submitted with Job ID: 1
I 11-18 21:07:51 spot_utils.py:206] Waiting for the spot controller process to be RUNNING (status: PENDING).
Job 1 is already in terminal state FAILED_CONTROLLER. Logs will not be shown.
For detailed error message, please check: sky logs sky-spot-controller-98083c18 1
Shared connection to 54.198.26.94 closed.
I 11-18 13:07:56 cloud_vm_ray_backend.py:2210] Spot Job ID: 1
I 11-18 13:07:56 cloud_vm_ray_backend.py:2210] To cancel the job:               sky spot cancel 1
I 11-18 13:07:56 cloud_vm_ray_backend.py:2210] To stream the logs:              sky spot logs 1
I 11-18 13:07:56 cloud_vm_ray_backend.py:2210] To stream controller logs:       sky logs sky-spot-controller-a70f1c17 1
I 11-18 13:07:56 cloud_vm_ray_backend.py:2210] To view all spot jobs:           sky spot queue

Looking at controller logs:

(base) romilb@romilbx1yoga:/mnt/d/Softwares/cmder$ sky logs sky-spot-controller-a70f1c17 1
Tailing logs of job 1 on cluster 'sky-spot-controller-a70f1c17'...
I 11-18 21:09:45 log_lib.py:388] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['172.31.35.107']
(sky-3bb7-romilb pid=23603) 2022-11-18 21:07:52,447     INFO worker.py:1337 -- Connecting to existing Ray cluster at address: 172.31.35.107:6379...
(sky-3bb7-romilb pid=23603) 2022-11-18 21:07:52,451     INFO worker.py:1513 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
(sky-3bb7-romilb pid=23603) I 11-18 21:07:52 storage.py:592] Storage type StoreType.S3 already exists.
(sky-3bb7-romilb pid=23603) I 11-18 21:07:52 controller.py:54] Submitted spot job; SKYPILOT_JOB_ID: sky-2022-11-18-21-07-52-665303_spot_id-1
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95] Traceback (most recent call last):
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "python/ray/_raylet.pyx", line 419, in ray._raylet.prepare_args_internal
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/_private/serialization.py", line 433, in serialize
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     return self._serialize_to_msgpack(value)
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/_private/serialization.py", line 411, in _serialize_to_msgpack
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     pickle5_serialized_object = self._serialize_to_pickle5(
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/_private/serialization.py", line 373, in _serialize_to_pickle5
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     raise e
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/_private/serialization.py", line 368, in _serialize_to_pickle5
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     inband = pickle.dumps(
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     cp.dump(obj)
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     return Pickler.dump(self, obj)
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95] TypeError: cannot pickle '_thread.lock' object
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95] The above exception was the direct cause of the following exception:
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95] Traceback (most recent call last):
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/spot/controller.py", line 64, in start
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     controller_task = _controller_run.remote(self)
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/remote_function.py", line 121, in _remote_proxy
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     return self._remote(args=args, kwargs=kwargs, **self._default_options)
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 307, in _invocation_remote_span
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     return method(self, args, kwargs, *_args, **_kwargs)
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/remote_function.py", line 393, in _remote
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     return invocation(args, kwargs)
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/remote_function.py", line 369, in invocation
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]     object_refs = worker.core_worker.submit_task(
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "python/ray/_raylet.pyx", line 1536, in ray._raylet.CoreWorker.submit_task
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "python/ray/_raylet.pyx", line 1540, in ray._raylet.CoreWorker.submit_task
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "python/ray/_raylet.pyx", line 385, in ray._raylet.prepare_args_and_increment_put_refs
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "python/ray/_raylet.pyx", line 376, in ray._raylet.prepare_args_and_increment_put_refs
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]   File "python/ray/_raylet.pyx", line 427, in ray._raylet.prepare_args_internal
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95] TypeError: Could not serialize the argument <__main__.SpotController object at 0x7f1762b3d550> for a task or actor controller._controller_run. Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information.
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:95]
(sky-3bb7-romilb pid=23603) E 11-18 21:07:52 controller.py:96] Unexpected error occurred: TypeError: Could not serialize the argument <__main__.SpotController object at 0x7f1762b3d550> for a task or actor controller._controller_run. Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information.
(sky-3bb7-romilb pid=23603) I 11-18 21:07:53 spot_state.py:198] Job failed due to unexpected controller failure.
Shared connection to 54.198.26.94 closed.
@romilbhardwaj
Copy link
Collaborator Author

For reference, sky launch minimal.yaml works and /covid is mounted correctly.

@romilbhardwaj romilbhardwaj changed the title [Spot] Public bucket in mount mode fails when run as spot job [Spot] Spot job fails when storage object is added Nov 18, 2022
@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented Nov 18, 2022

Appears to be working fine on 9aecc7b. Quite likely a regression introduced by #1414, perhaps related to using Ray to run the controller loop which fails to serialize some object? cc @Michaelvll

@Michaelvll
Copy link
Collaborator

Fixed by #1432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants