Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Refactor storage and fix data transfer service #1239

Merged
merged 4 commits into from
Oct 16, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Oct 12, 2022

Adopt changes for storage from #1152.

Tested:

  • tests/run_smoke_tests.sh TestStorageWithCredentials
  • The following command
gsutil mb gs://sky-imagenet-bucket-gcp
python - <<EOF
from sky.data import data_transfer
data_transfer.s3_to_gcs('sky-imagenet-bucket', 'sky-imagenet-bucket-gcp') 
EOF

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @Michaelvll!

sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/data_transfer.py Show resolved Hide resolved
Comment on lines +83 to +88
response = storagetransfer.transferJobs().create(
body=transfer_job).execute()
operation = storagetransfer.transferJobs().run(jobName=response['name'],
body={
'projectId': project_id
}).execute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious - are there any benefits other than readability to calling transferJobs().create() followed by transferJobs().run() instead of setting the schedule field to current time and only calling transferJobs().create() like we had before?

(Just to be clear, I prefer the former as we have now, but curious if there's some other reason to do it this way)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for using the current interactive way to start the job is that we can get the name of the operation for the run().execute() (this is different from the name of the submitted TransferJob, in the response). With the name, we don't have to list all the running transfering jobs and find out the correct operation name that is scheduled by the cloud in L98 below. Also, since we will only run the TransferJob once and in a blocking manner, I feel like having a schedule field in the specification can be a bit misleading.

sky/data/data_transfer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks for fixing this!

@Michaelvll Michaelvll merged commit 2d4fee3 into master Oct 16, 2022
@Michaelvll Michaelvll deleted the data-transfer branch October 16, 2022 23:40
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
…g#1239)

* Refactor storage and fix data transfer service

* fix UX for the data transfer

* UX fixes

* Address comments
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
…g#1239)

* Refactor storage and fix data transfer service

* fix UX for the data transfer

* UX fixes

* Address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants