Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IBM cloud provider integration #1598

Merged
merged 77 commits into from
Apr 26, 2023
Merged

Conversation

Cohen-J-Omer
Copy link
Contributor

@Cohen-J-Omer Cohen-J-Omer commented Jan 16, 2023

PR Topic - integration of IBM as a new cloud provider.
Successfully ran the following tests:

  • test_ibm_n_node_job_queue
  • test_cancel_ibm
  • test_ibm_autodown
  • test_ibm_autostop
  • test_ibm_job_queue
  • test_ibm_file_mounts
  • test_ibm_zone
  • test_ibm_region
  • test_ibm_minimal

Notes:

  • Had to introduce new tests (that slightly differ from their original counterparts), since current tests are tailored for specific cloud providers.
  • Couldn't run tests as a batch, since some pytest config files are requiring aws credentials to run.
  • Couldn't run backwards compatibility test for reason mentioned above.

I plan to:

  • Create a script users could run to generate a credential file located in ~/.ibm, similarly to aws configure.
  • Add support for IBM's cloud object storage.

Setup:
Until the above mentioned script will be created, authenticate your IBM account by adding the following data in ~/.ibm/credentials.yaml :

iam_api_key: <String>
resource_group_id: <String>
  • IBM cloud API key can be created on the IBM Cloud console: Manage > Access (IAM) > Create an IBM Cloud API key.
    For the complete guide visit this link.
  • Resource group id can be found in the IBM Cloud console: Manage > Account > Account resources > Resource groups. For the complete guide visit this link.

@ewzeng
Copy link
Collaborator

ewzeng commented Jan 24, 2023

Thanks for your hard work @Cohen-J-Omer!

Are there any instructions for us to try out this PR (e.g. create an account on IBM cloud, save IBM credentials to a certain file, etc.)?

@Cohen-J-Omer
Copy link
Contributor Author

Thanks for your hard work @Cohen-J-Omer!

Are there any instructions for us to try out this PR (e.g. create an account on IBM cloud, save IBM credentials to a certain file, etc.)?

Hey @ewzeng,
You're right. I've added a setup segment to the PR's description.
Once the PR will get approved I'll:

  • Create a script to automate the process, including configuration options, such as base image selection.
  • Add an IBM setup segment to Readme.md as a separate PR (or include in this PR if you'd prefer).

@gilv
Copy link

gilv commented Jan 25, 2023

@ewzeng great you assigned to this PR... Do you plan to test it with IBM Cloud account? Do you have one? The account need to have permissions to setup VPc..

@ewzeng
Copy link
Collaborator

ewzeng commented Jan 25, 2023

@Cohen-J-Omer Thanks! I agree -- we can update the docs in a separate PR.

@gilv Yes I was planning to make an IBM cloud account to test it out (I don't have one yet). Is there anything special I should do to get the necessary account permissions?

Edit: it looks like the Sky Lab already has an IBM cloud account. I'll use that one once I get access; if I run into any account permission issues I'll report back.

@gilv
Copy link

gilv commented Jan 25, 2023

@ewzeng just regular account will do the trick... but if you use all kind of academic accounts or free trials, then I think VPC is blocked there

@ewzeng
Copy link
Collaborator

ewzeng commented Jan 25, 2023

@gilv Okay, got it. This is good to know. Thanks!

@ewzeng
Copy link
Collaborator

ewzeng commented Feb 5, 2023

@Cohen-J-Omer Here are some comments from trying out this pr! (I will go through the code soon)

  1. If IBM credentials are not set up, sky check returns

    /home/zeng/.ibm/credentials.yaml does not exist. Run the following command:
    a configuration script that asks for
    api key and stores it in ~/.ibm/credentials.yaml
    

    Let's replace this with actual instructions for now (and add a TODO statement in the code to make a configuration script)

  2. In sky launch, it seems that we ssh into the node (and run set up commands) as the root user. We should probably use the ubuntu user instead.

  3. After sky launch --cloud ibm -i 1 --down, it seems that the vpc and security group don't get deleted.

  4. This might be a bug in IBM's UI, but when I sky stop a two-node cluster, the UI says that the worker node is Stopped while the head node is Stopping (no matter how long I wait or refresh the page).

  5. sky launch --cloud ibm examples/huggingface_glue_imdb_app.yaml results in a sqlite error on the launched node. (Perhaps this has something to do with 2)

  6. It seems that nvidia-smi is not automatically installed on IBM nodes; since some of the smoke tests use nvidia-smi, we should consider installing it.

Overall, awesome work! I tested sky start/stop/launch/down/status/exec in combination with autodown, autostop, and multi-node, and the experience felt very smooth.

@Cohen-J-Omer
Copy link
Contributor Author

Cohen-J-Omer commented Feb 13, 2023

Hey @ewzeng,
I appreciate the thorough review.

  1. IBM's sky check have been altered to meet expectations.
  2. Changed user to ubuntu as requested.
  3. commands such as sky autostop --down / sky launch --down will now also delete the cluster's VPC.
    As discussed on slack, since a VSI can't issue a command to delete its own VPC, on such occasions when ray down is executed remotely, the cluster's head node will now:
    • Reuse/create a predefined IBM CloudFunctions namespace on a predefined region.
    • Delete and create an action with a predefined name.
    • Invoke the action with user's relevant data.
  4. Indeed a UI issue. Haven't encountered it myself.
  5. Not related to clause 2. I'm not sure it's related to my implementation. Might require further inquiry.
  6. It's an issue we are well aware of. Currently I advise users to use this tool. Since its default variables are tuned to support Skypilot it can be executed as easily as: vpc-img-inst -r <region_name>. Then the script (to be built) will update the necessary fields: image_id: <region>: <image-id> on .ibm/credentials.yaml. The above will be added to documentation and will be prompted to user on the configuration script I'll build. However, since IBM's images are region locked, users are advised to run the tool in parallel (on different shell sessions) on various regions. I do not recommend installing CUDA as part of a Ray node's setup_commands, due to extended execution time (~12 minutes).

@ewzeng
Copy link
Collaborator

ewzeng commented Feb 14, 2023

Thanks for the updates @Cohen-J-Omer! I will test this out.

@ewzeng
Copy link
Collaborator

ewzeng commented Feb 14, 2023

@Cohen-J-Omer I tried running

sky launch --cloud ibm   # launches cluster sky-f6c7-zeng
sky autostop sky-f6c7-zeng -i 2 --down

and waited for about 10 minutes, but the cluster was still up. Are you able to replicate this?

@Cohen-J-Omer
Copy link
Contributor Author

@ewzeng thanks for noticing!
Ray's cli_logger caused an unclear exception about the logging messages i fed it. Replaced it and made the implementation a bit more robust. Please re-pull.

@ewzeng
Copy link
Collaborator

ewzeng commented Feb 16, 2023

I re-pulled! But now when I run

sky launch --cloud ibm --down

the instance still stays up (waited for ~10 minutes).

Regarding your previous comments:

  1. It might be a good idea to tell users where to get iam_api_key and resource_group_id since new IBM users might not know. (This could be as simple as linking some IBM documentation). What do you think? Otherwise I think sky check looks great!
  2. Great!
  3. See above.
  4. Got it.
  5. Okay, I will investigate further.
  6. Hmmm, yeah 12 minute set up time sounds horrible; IBM-VPC sounds like the way to go. Is the configuration script that you are referring to the setup script you plan to write (to set up ibm credentials, etc.)

@Cohen-J-Omer
Copy link
Contributor Author

(1) I could add it until a script is made, since it will locate the resource group id and will suggest applying optional fields.
(3) It works on my end. I'd like to get more info on slack.
(6) yes, the same one. it will be an interactive tool that will ask for an IAM-API-key and according to user's choices will generate a credential file.

Copy link
Collaborator

@ewzeng ewzeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments! I'll take a break, and then I will review node_provider and vpc_provider

sky/adaptors/ibm.py Outdated Show resolved Hide resolved
sky/adaptors/ibm.py Outdated Show resolved Hide resolved
sky/adaptors/ibm.py Outdated Show resolved Hide resolved
sky/authentication.py Outdated Show resolved Hide resolved
sky/authentication.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/ibm_catalog.py Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
examples/job_queue/job_ibm.yaml Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
@ewzeng ewzeng self-requested a review February 22, 2023 04:12
Copy link
Collaborator

@ewzeng ewzeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some more comments on node_provider and vpc_provider

sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/vpc_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/vpc_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/vpc_provider.py Outdated Show resolved Hide resolved
@ewzeng ewzeng self-requested a review February 25, 2023 15:17
Copy link
Collaborator

@ewzeng ewzeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment on hybrid ips

sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
@ewzeng ewzeng self-requested a review February 28, 2023 20:06
Copy link
Collaborator

@ewzeng ewzeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass through node_provider and added a few more comments!

sky/skylet/providers/ibm/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
@ewzeng ewzeng self-requested a review March 1, 2023 15:57
sky/skylet/providers/ibm/vpc_provider.py Show resolved Hide resolved
sky/skylet/providers/ibm/vpc_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/vpc_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/ibm/vpc_provider.py Show resolved Hide resolved
@Cohen-J-Omer
Copy link
Contributor Author

Hey @ewzeng,
I've rebased, please review the changes.

Copy link
Collaborator

@ewzeng ewzeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your changes look good. I added one comment on race conditions.

sky/skylet/providers/ibm/node_provider.py Show resolved Hide resolved
@Michaelvll Michaelvll self-requested a review April 8, 2023 08:01
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Cohen-J-Omer for the incredible work, and @ewzeng for reviewing this PR! It looks pretty nice to me and should be almost ready to go in. Left several comments, most of which are nits. Once the master branch is merged and the tests are passed, the PR should be ready to go in.
Thanks again for the awesome work! Looking forward to seeing IBM cloud supported by SkyPilot!

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
examples/using_file_mounts_ibm.yaml Outdated Show resolved Hide resolved
_regions: List[clouds.Region] = []

@classmethod
def regions(cls):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed, as all the requests will go to the catalog, instead of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o.k, good to know. Then i believe you should get rid of this method , right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think the right way is to be opposite. We should get rid of the method this comment quoted. Instead, using the method you linked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it eventually tries to call a regions method on the CLOUD_PROIVDER_catalog.py module, due to this function?

Copy link
Collaborator

@Michaelvll Michaelvll Apr 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not? It only generates the regions you hardcoded here. I am saying this def regions(cls) is not called anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood the part where the regions() class method on clouds/ibm.py have become obsolete and should be removed, along with the class method _get_default_region() i've created.
Unrelated, I think this method should also be removed, since _map_clouds_catalog(clouds, 'regions') tries to invoke a function that i haven't seen implemented on other providers' catalogs, e.g. there's no regions() method on aws_catalog.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, it is used in the following place, but we can definitely make it more general in a different PR.
https://github.com/concretevitamin/sky-experiments/blob/508be85e4a2a0df1fa98f7fac63980ff6e24da3e/sky/backends/cloud_vm_ray_backend.py#L840

sky/clouds/ibm.py Outdated Show resolved Hide resolved
sky/clouds/ibm.py Outdated Show resolved Hide resolved

@classmethod
def _get_default_region(cls) -> clouds.Region:
region_name = get_cred_file_field('region', 'us_south')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be able to be removed, as it will never be reached, due to the comment for if region is None above.

sky/templates/ibm-ray.yml.j2 Show resolved Hide resolved
@Cohen-J-Omer
Copy link
Contributor Author

Hey @Michaelvll,
I've updated the modules as requested.
The following tests were executed successfully:

  • test_ibm_region

  • test_ibm_zone

  • test_ibm_cancel_task

  • test_ibm_job_queue

  • test_ibm_job_queue_multinode

  • test_file_mounts

  • test_minimal

  • test_autodown

  • The stop functionality remains to be problematic on skypilot, e.g. test_autostop sporadically succeeds. A worker node is indefinitely stuck on uninitialized status.

  • Would add an image_id test, as the functionality is implemented, but I cannot publish a VPC image to the IBM public VPC image catalog, which I could then rely on its constant ID (for testing purposes).

If the PR is in a satisfactory condition I will rebase it (or merge if the process will prove too difficult due to numerous conflicts).

@Cohen-J-Omer
Copy link
Contributor Author

Hey @Michaelvll,
I've solved conflicts via a rebase. Please review.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent effort to support IBM cloud in SkyPilot @Cohen-J-Omer! The PR now looks good to me. Once we fix the CI errors, we are good to go!

Tested:

  • pytest tests/test_smoke.py --ibm

TODOs for future PRs (please file issues for these):

  • Add instructions for setting up the IBM credentials in the doc.
  • Add catalog fetcher for IBM

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit: when terminating with sky down or checking the status with sky status -r, there will be a bunch of logger info shows up corrupting the progress bar, can it be changed to DEBUG or somehow work well with the progress bar?

Another issue:
When I am trying to sky down the VM, it shows the following error, and it went away when I tried again. Can we add a retry for this error?

  File "/home/gcpuser/sky-ibm/sky/backends/cloud_vm_ray_backend.py", line 3110, in teardown_no_lock
    vpc_provider.delete_vpc(vpc_id, region)
  File "/home/gcpuser/sky-ibm/sky/skylet/providers/ibm/vpc_provider.py", line 283, in delete_vpc
    self.delete_subnets(tmp_vpc_client, vpc_data, region)
  File "/home/gcpuser/sky-ibm/sky/skylet/providers/ibm/vpc_provider.py", line 352, in delete_subnets
    vpc_client.delete_subnet(subnet_id).get_result()
  File "/opt/conda/envs/sky-ibm/lib/python3.10/site-packages/ibm_vpc/vpc_v1.py", line 2081, in delete_subnet
    response = self.send(request, **kwargs)
  File "/opt/conda/envs/sky-ibm/lib/python3.10/site-packages/ibm_cloud_sdk_core/base_service.py", line 313, in send
    response = self.http_client.request(**request, cookies=self.jar, **kwargs)
  File "/opt/conda/envs/sky-ibm/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/envs/sky-ibm/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/envs/sky-ibm/lib/python3.10/site-packages/requests/adapters.py", line 578, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='us-east.iaas.cloud.ibm.com', port=443): Read timed out. (read timeout=60)

@Cohen-J-Omer
Copy link
Contributor Author

Hey @Michaelvll,
Thanks for the thorough review.

  • Messages will no longer be displayed when deleting an IBM cluster.
  • Regarding the timeout error: I cannot add a retry for it, as:
    • It rarely happens (yet to see it personally).
    • it's an error that implies that either IBMs servers were down, or the user experienced a connectivity issue.
    • it's caused by an internal REST API request shared between all methods, invoked by any IBM client used in this project, thus i'll have to wrap quite a lot.
  • With regard to the github issues:
    • No problem, i'll open the issues.
    • sky check will describe to users how and where they should create the credentials file, if any issue arises (e.g. credentials file missing, or some of the mandatory fields are missing). On which documentation file would you like me to add said instructions?

@Michaelvll
Copy link
Collaborator

Michaelvll commented Apr 26, 2023 via email

@Michaelvll Michaelvll merged commit 9c0f174 into skypilot-org:master Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants