Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Resources] Add cpus in resource specification #1622

Merged
merged 54 commits into from
Jan 28, 2023
Merged

[Resources] Add cpus in resource specification #1622

merged 54 commits into from
Jan 28, 2023

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Jan 24, 2023

Closes #387

This PR introduces the ability for users to specify resource requirements in terms of the number of vCPUs in the Resources class and the resources section in the SkyPilot yaml.

Semantics

  1. If the instance type and accelerators are not specified, SkyPilot will use the cpus parameter to select an appropriate VM size from the cloud's default VM family (i.e., m6i for AWS, Dv5 for Azure, and n2-standard for GCP).

For example,

  • sky launch --cpus 8 or Resources(cpus=8): SkyPilot will find the VMs that have exactly 8 vCPUs. Namely,

    • AWS: m6i.2xlarge
    • Azure: Standard_D8_v5
    • GCP: n2-standard-8
  • sky launch --cpus 12+ or Resources(cpus='12+'): SkyPilot will consider the VMs that have 12 or more vCPUs. Namely,

    • AWS: m6i.4xlarge
    • Azure: Standard_D16_v5
    • GCP: n2-standard-16

(Note: The selected three instance types have 16 vCPUs instead of 12 vCPUs, because the default instance families of the clouds do not include any VM with 12-15 vCPUs.)

  1. If the accelerators are specified, the instance family can be inferred for each cloud, and SkyPilot will use the cpus parameter to choose the VM of the right size in the instance family. For example,
  • sky launch --gpus T4 --cpus 8+ or Resources(accelerators='T4', cpus='8+'): SkyPilot will choose VMs that have 1 T4 GPU along with 8 or more vCPUs. Namely,
    • AWS: g4dn.2xlarge'
    • Azure: Standard_NC8as_T4_v3
    • GCP: n1-highmem-8 (+ T4)

The same semantics applies to cpunode, gpunode, tpunode, and spot launch.

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@WoosukKwon WoosukKwon changed the title [Resources] Add cpu in resources [Resources] Add cpu in resource specification Jan 24, 2023
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @WoosukKwon! Had a quick scan and the approach makes sense to me.

Quick question: are we deferring adding CPU as a scheduling constraint?

E.g., what happens if we launch a node with 8 CPUs, then keep sky launch -c existing --cpu 4 for many times? Or, for --cpu 16 do we correctly throw an error?

(I think it's okay to defer, just checking.)

sky/clouds/service_catalog/aws_catalog.py Outdated Show resolved Hide resolved
@WoosukKwon
Copy link
Collaborator Author

@concretevitamin Thanks for the quick feedback. Yes, I'd like to defer it.

@dongreenberg
Copy link
Contributor

dongreenberg commented Jan 24, 2023

Love it!! Two questions:

  1. I'm assuming this will work with mixing and matching inside the Resources() python API? e.g. I could do
    Resources(accelerators='A100:1', cpus='8+')?
  2. Do all of the CPU-only boxes launch with the same OS and architecture?

@WoosukKwon
Copy link
Collaborator Author

Hey @dongreenberg, thanks for your questions!

  1. I'm assuming this will work with mixing and matching inside the Resources() python API? e.g. I could do
    Resources(accelerators='A100:1', cpus='8+')?

Yes. It will work on the Python API in the same way.

  1. Do all of the CPU-only boxes launch with the same OS and architecture?

They all have recent Intel x86-64 CPUs (i.e., either Ice Lake or Cascade Lake CPUs), and they all have 4 GB RAM per 1 vCPU. More specifically,

  • AWS m6i series have Intel Ice Lake 8375C CPUs
  • Azure Standard Dv5 series have Ice Lake 8370C CPUs
  • GCP N2 series have Intel Ice Lake 8373C or Cascade Lake 6268CL CPUs

The OS might be different across the clouds and the VM images you select.

@dongreenberg
Copy link
Contributor

Amazing, that's perfect. You guys are the best!

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @WoosukKwon. Did a pass and left some comments.

  1. Tests: consider spelling out the manual tests. Also, should we add some dryrun unit tests to make sure sky.optimize() works for things like sky.Resources(..., cpu=...)?

  2. Should we warn + ignore --cpu in sky launch -c existing --cpu x?

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/optimizer.py Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/azure_catalog.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WoosukKwon. Some final comments & reminder to add some tests to maybe test_optimizer_dryruns.py.

# 2. Change the __setstate__ method to handle the new fields.
# 3. Modify the to_config method to handle the new fields.
# If any fields changed, increment the version. For backward compatibility,
# modify the __setstate__ method to handle the old version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Was thinking this comment is a guide for "all the places we need to change when we add a new field". For just back compat, it seems __setstate__ is the only required change. cc @Michaelvll to double check

return (_make([AWS.get_default_instance_type()]),
fuzzy_candidate_list)
# Return a default instance type with the given number of vCPUs.
default_instance_type = AWS.get_default_instance_type(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind clarifying what you meant?

Currently, the Cloud interface includes a shallow method, which does not show the default type:

    @classmethod
    def get_default_instance_type(cls,
                                  cpu: Optional[str] = None) -> Optional[str]:
        return service_catalog.get_default_instance_type(cpu=cpu, clouds='aws')

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the awesome contribution @WoosukKwon. A major update to our Resources in a long time. LGTM.

PS: consider some basic smoke tests (e.g., minimal) before merging?

tests/test_optimizer_dryruns.py Outdated Show resolved Hide resolved
@WoosukKwon
Copy link
Collaborator Author

@concretevitamin This PR has passed all the smoke tests.

@concretevitamin
Copy link
Member

Missed this before: shall we add the new field to https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html?

@WoosukKwon
Copy link
Collaborator Author

Good catch. Added.

@WoosukKwon WoosukKwon merged commit 76eed62 into master Jan 28, 2023
@WoosukKwon WoosukKwon deleted the woosuk-cpu branch January 28, 2023 23:54
@dongreenberg
Copy link
Contributor

Woo!!! Congrats and thanks gang!! Can't wait to test it out!

sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Feb 22, 2023
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resource Specification for # of CPUs
3 participants