Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of IBM's Cloud Object Storage: COS #1966

Merged
merged 142 commits into from
Aug 1, 2023

Conversation

Cohen-J-Omer
Copy link
Contributor

@Cohen-J-Omer Cohen-J-Omer commented May 18, 2023

Support for IBM's COS.

Notes:

  • As requested here, I've updated documentation(i.e. getting-started/installation.rst) to provide instructions for setting up IBM credentials.
  • As this issue remains unresolved, sky check still returns IBM:enabled, even if the necessary fields for storage are missing in ~/.ibm/credentials.yaml.
  • COS resources are partitioned by storage instances, thus a user will have to provide one ahead of time.
  • COS' URIs must include a region, ergo some adjustments to repo's general code were necessary.
  • I've chosen rclone as a syncing/mounting tool. For each bucket an Rclone profile will be created under its name. If an existing profile with the same name exists, it will be deleted. Once a bucket is deleted, the profile associated with it will be removed as well.

Issue

I wasn't able to make persistent: False delete the used bucket, either after task concluded, or cluster terminated. Would appreciate some guidance.

Tested

  • test_ibm_storage_mounts
  • test_new_bucket_creation_and_deletion
  • test_bucket_bulk_deletion
  • test_bucket_external_deletion
  • test_upload_to_existing_bucket
  • test_list_source
  • test_private_bucket
  • test_nonexistent_bucket
  • test_invalid_names

notes on the rest of the tests

  • test_copy_mount_existing_storage - I believe it tests s3 exclusively.
  • test_public_bucket - can't provide a reliable public cos bucket.

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Cohen-J-Omer I left some comments. It should be ready after this. I'll let @romilbhardwaj sign off on this


Follow the next steps to create/extract the above fields:

1. Create/Select a COS instance from the `web console <https://cloud.ibm.com/objectstorage/>`__.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cohen-J-Omer It would be sufficient to add an example in the yaml section starting with

# *** Copying files from IBM COS ***

just like how it's written for S3 and GCS

Comment on lines 3 to 15
from multiprocessing import pool
import concurrent.futures
import os
import subprocess
from typing import Any, Callable, Dict, List, Optional, Tuple
import urllib.parse
import re
import textwrap
import threading
from filelock import FileLock
from enum import Enum

from sky import exceptions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from multiprocessing import pool
import concurrent.futures
import os
import subprocess
from typing import Any, Callable, Dict, List, Optional, Tuple
import urllib.parse
import re
import textwrap
import threading
from filelock import FileLock
from enum import Enum
from sky import exceptions
import concurrent.futures
from enum import Enum
from multiprocessing import pool
import os
import subprocess
import textwrap
import threading
from typing import Any, Callable, Dict, List, Optional, Tuple
import urllib.parse
from filelock import FileLock
from sky import exceptions

@Cohen-J-Omer
Copy link
Contributor Author

Hey @landscapepainter @romilbhardwaj ,
I've merged and added support for parallel storage deletion (subprocess_utils.run_in_parallel() in cli.storage_delete())

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cohen-J-Omer Thanks for adding support for parallel storage deletion! Would you mind merging master with the repo and confirm if it passes test_smoke.py? There was a recent update in test_multiple_buckets_creation_and_deletion.

README.md Outdated
@@ -39,7 +39,7 @@ SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offer
SkyPilot **abstracts away cloud infra burdens**:
- Launch jobs & clusters on any cloud
- Easy scale-out: queue and run many jobs, automatically managed
- Easy access to object stores (S3, GCS, R2)
- Easy access to object stores (S3, GCS, R2, IBM COS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of keeping the README concise, can we exclude this change for now? We can add it when users start using it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're breaking my heart here @romilbhardwaj...
Will remove.
If not all supported object stores are listed, I suggest implying there are more, e.g.:

Easy access to object stores, such as: S3, GCS, R2

from sky.utils import ux_utils

Client = Any

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintended?

@Cohen-J-Omer
Copy link
Contributor Author

Hey @landscapepainter , @romilbhardwaj

  • I've Committed the requested changes.
  • Storage tests are executed successfully, including test_multiple_buckets_creation_and_deletion.

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Jul 23, 2023

Thanks @Cohen-J-Omer. I ran into an issue when running final tests:

I ran

(base) ➜  sky-experiments git:(storage-ibm) ✗ pytest tests/test_smoke.py::test_ibm_storage_mounts --ibm
bringing up nodes...
[ibm_storage_mounts] Test started. Log: less /var/folders/98/hhq8wrtx6y13196q61xphjsm0000gn/T/ibm_storage_mounts-5ifbcq3b.log
/Users/romilb/tools/anaconda3/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[ibm_storage_mounts] Passed.
[ibm_storage_mounts] Log: less /var/folders/98/hhq8wrtx6y13196q61xphjsm0000gn/T/ibm_storage_mounts-5ifbcq3b.log
[ibm_storage_mounts] 
.
1 passed, 1152 warnings in 283.60s (0:04:43)

The test passed, but note the leaked semaphore warning. Now whenever I run any sky command, including sky status, I get the following warning:

(base) ➜  ~ sky status
Clusters
No existing clusters.

Managed spot jobs
No in progress jobs. (See: sky spot -h)
/Users/romilb/tools/anaconda3/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

If I switch branch to master, the warning goes away. Switching back to storage-ibm brings back the warning.

This is likely stemming from the global locks created in adaptors/ibm.py. Is there some way to move those locks out of global scope or define them lazily instead of eagerly?

For reference, I'm on M2 Macbook with python 3.9.13.

Also, can you merge with master?

@Cohen-J-Omer
Copy link
Contributor Author

Hey @romilbhardwaj,

  • I can't replicate the issue. Seeing as it might be a RAM issue (from search results), I've ran the test on the poorest spec VM accessible to me, on a 3.9.17 venv (what's available on deadsnakes PPA), and on my personal Fedora 37 machine. Could it be a specific M2 and/or general Macbook issue?
  • I've merged with master and don't see Github conflicts alert.
  • Since the processes are spawned from Skypilot code, I can't create the locks there and pass a shared reference along, meaning the only other way for the processes to share a lock is to create the locks on the module level. Would love to get your input on the matter.

@romilbhardwaj
Copy link
Collaborator

  • I can't replicate the issue. Seeing as it might be a RAM issue (from search results), I've ran the test on the poorest spec VM accessible to me, on a 3.9.17 venv (what's available on deadsnakes PPA), and on my personal Fedora 37 machine. Could it be a specific M2 and/or general Macbook issue?

Hmm, I restarted my machine and have 17G memory free. The warning still shows up only when i checkout storage-ibm branch and run any skypilot command (i.e., when adaptors/ibm.py is imported). Works fine on master.

If I remove L24 ( threading_lock = threading.Lock()) from adaptors/ibm.py and run sky status, the error message goes away. Curiously, running an empty script with:

# test.py
import threading
threading_lock = threading.Lock()

# in shell
python test.py # No warning

does not trigger the warning.

BTW, do we need both multiprocessing_lock and threading_lock? Can we do with just multiprocessing_lock?

@Cohen-J-Omer
Copy link
Contributor Author

Cohen-J-Omer commented Jul 26, 2023

Hey @romilbhardwaj,

  • Couldn't replicate the issue on any linux/python environment, ergo i couldn't debug the error you raised. From web searches and my inability to replicate the issue, I assume it's somehow connected to Mac machines, or specifically M1/ M2 processors.
  • I removed the threading_lock, after replacing threads on get_ibm_cos_bucket_region with processes, although it did serve a purpose.
    Technical explanation:
    get_ibm_cos_bucket_region used to take advantage of concurrency to speed up storage location lookups, since each lookup needs to reinitialize the boto3.client, which creates a not thread safe default boto3.session.
    on I/O bound use-cases Multiprocess aren't the preferred solution (in contrast to threads) and will result on slower storage lookups on machines that have fewer cores.
  • The process lock is now lazily defined and initialized, as requested.

I hope we can now merge this PR.

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Cohen-J-Omer, I noticed that your recent merge with master branch does not reflect the update to the test I mentioned. The test is still outdated in this repository.

There was a recent update in test_multiple_buckets_creation_and_deletion.

Please merge with the recent master branch and check if the updated test passes.

With regard to the error @romilbhardwaj encountered, I ran pytest tests/test_smoke.py::test_ibm_storage_mounts --ibm on my machine, Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux, with python 3.8 and did not come across the same error. Seems like it's an issue for M1/M2 machines:

@Cohen-J-Omer
Copy link
Contributor Author

Cohen-J-Omer commented Jul 27, 2023

Hey @landscapepainter, @romilbhardwaj:
Merged with master. All (IBM related) tests passed.
Thanks for checking out the error @landscapepainter.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Cohen-J-Omer - all looks good! I tested by installing from scratch in a docker container (both, pip install -e .[ibm] and pip install -e .[aws]). This should be ready to go. Thanks again for the hard work!

Comment on lines 52 to 53
from sky.adaptors import ibm
from sky.data.data_utils import Rclone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alphabetical ordering for imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Hope I understood you, since i had to reorder 2 other import lines in the process.

@romilbhardwaj romilbhardwaj merged commit 657fd84 into skypilot-org:master Aug 1, 2023
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants