Skip to content

Add Runpod Data Fetcher for Community and Secure Pods #5929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

velaraptor-runpod
Copy link

Add data fetcher for runpod for Community and Secure Pods

Tested (run the relevant ones):

  • [ X ] Code formatting: install pre-commit (auto-check on commit) or bash format.sh

if gpu.get('secureCloud'):
cloud_type = 'SECURE'
price = gpu['securePrice']
elif gpu.get('communityCloud'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand entries can have secureCloud and communityCloud as both true if the GPU is available in both. This appears to skip those entries for communityCloud, unless I misunderstand?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct and good find, fixing that!

@adocherty
Copy link

Also, there seems to be another PR that adds the same feature:
#5930

@velaraptor-runpod
Copy link
Author

Also, there seems to be another PR that adds the same feature: #5930

Just saw this PR, only difference with this is that it adds the Community Pods.

@adocherty
Copy link

Hi, yes I understand that this PR implements community cloud, which is why I'm on board! The question was more around if there's two PRs both targeting the same feature how is it decided which one goes in?

To help this one get in, I've tested it and have some comments and a comparison to the other issue:

Also, I'm happy to help implement any features and propose some fixes.

Testing

  1. Generate new catalog

    python sky/catalog/data_fetchers/fetch_runpod.py -o ~/.sky/catalogs/v7/runpod/vms.csv
  2. Give a spec with community cloud

    name: minimal
    
    resources:
      cloud: runpod
      accelerators: RTX3090:1
      instance_type: 1x_RTX3090_COMMUNITY
    
    setup: |
      echo "running setup"
    
    run: |
      conda env list
  3. Run test on community cloud

    Considered resources (1 node):
    ----------------------------------------------------------------------------------------
     INFRA         INSTANCE               vCPUs   Mem(GB)   GPUS        COST ($)   CHOSEN   
    ----------------------------------------------------------------------------------------
     RunPod (CA)   1x_RTX3090_COMMUNITY   2       24        RTX3090:1   0.22          ✔     
    ----------------------------------------------------------------------------------------
    Launching a new cluster 'test'. Proceed? [Y/n]: 
    AssertionError: (Region(name='CA'), None)
  4. Run test on secure cloud

    YAML to run: minimal.yaml
    Considered resources (1 node):
    -------------------------------------------------------------------------------------
     INFRA         INSTANCE            vCPUs   Mem(GB)   GPUS        COST ($)   CHOSEN   
    -------------------------------------------------------------------------------------
     RunPod (CA)   1x_RTX4090_SECURE   2       24        RTX4090:1   0.69          ✔     
    -------------------------------------------------------------------------------------
    Launching a new cluster 'test'. Proceed? [Y/n]: 
    AssertionError: (Region(name='CA'), None)

Bugs

  1. As seen above, the current catalog fails with an assertion error.

    This seems to be due to using the CSV format without availability zones specified. The assertion is triggered on line 171 in runpod.py (see below).

    The failure is due to the assert that the zone is not None; however, I don’t understand why it doesn’t occur with the previous VMS.csv file, which also didn’t have availability zones — any ideas?.

    With the following code change we can avoid the error, and the tests work.

     diff --git a/sky/clouds/runpod.py b/sky/clouds/runpod.py
     index a2f81ee7..266b6969 100644
     --- a/sky/clouds/runpod.py
     +++ b/sky/clouds/runpod.py
     @@ -168,9 +168,13 @@ class RunPod(clouds.Cloud):
                  num_nodes: int,
                  dryrun: bool = False) -> Dict[str, Optional[Union[str, bool]]]:
              del dryrun, cluster_name  # unused
     -        assert zones is not None, (region, zones)
     +        #assert zones is not None, (region, zones)
      
     -        zone_names = [zone.name for zone in zones]
     +        # If zones is None, return an empty list for zone names.
     +        if zones is None:
     +            zone_names = []
     +        else:
     +            zone_names = [zone.name for zone in zones]
      
              resources = resources.assert_launchable()
              acc_dict = self.get_accelerators_from_instance_type(
  2. When generating the list of GPU numbers a gpu_count of 1 is always available if max_gpus=1 and 1,2,4,8 otherwise. This will fail if max_gpus<8 and not include other counts if max_count>8. PR [Core] Support setting sudo password when deploying remote clusters #5030 has a better implementation of this, see comment below.

  3. RunPod catalog file differences

    • The current RunPod output file columns
InstanceType,AcceleratorName,AcceleratorCount,vCPUs,MemoryGiB,GpuInfo,Region,SpotPrice,Price
  • This format output columns (extra OnDemandPrice and Name) — Unless there’s a good reason to include them I would remove them so the file columns are the same as the original CSV file.
InstanceType,AcceleratorName,AcceleratorCount,vCPUs,MemoryGiB,GpuInfo,Region,SpotPrice,OnDemandPrice,Price,Name

OnDemandPrice not found anywhere else

grep -rni "OnDemandPrice" *
sky/catalog/data_fetchers/fetch_runpod.py:284:            'OnDemandPrice',

Proposed improvements

In the current implementation current code that runs on Secure Cloud will switch to Community Cloud without any config changes. This may be a problem for existing users.

This is due to skypilot choosing the cheapest option of all matching config types, so if we specify resources available on both clouds for example RTX4090 without a specific instance_type then the community cloud will be lower price and will be chosen:

resources:
  cloud: runpod
  accelerators: RTX4090:1

I think it would be useful to implement a new option that specifies the cloud type (secure or community cloud)

cloud=[runpod, runpod:community, runpod:secure]

with cloud=runpod defaulting to secure cloud for backward compatibility

This would improve the usability a low as currently we need to specify the full instance_type in the resources to switch between clouds (including the GPU count).

Differences to #5030

There are a few differences between #5930 and this PR #5929 that I can see as important:

  1. PR 5930 includes a separate PR to run a cron job that updates the catalog in the repo https://github.com/skypilot-org/skypilot-catalog

    feat(runpod): Add RunPod catalog update cron skypilot-catalog#133

    this should be valid for the current file, but it does assume the output file is saved in runpod/vms.csv by default

            python -m sky.catalog.data_fetchers.fetch_runpod
            if [ ! -f "runpod/vms.csv" ] || [ ! -s "runpod/vms.csv" ]; then
              echo "Error: RunPod catalog file is missing or empty"
              exit 1
            fi
    
  2. PR 5930 gets the memory and vCPU lower bounds from the query in lowest price for each GPU count

            lowestPrice(input: {{gpuCount: {gpu_count}}}) {{
              minimumBidPrice
              uninterruptablePrice
              minVcpu
              minMemory
              stockStatus
              compliance
              maxUnreservedGpuCount
              availableGpuCounts
            }}

    and uses this in the spec, defaulting to a hard-coded VCPU and memory (I don’t know which approach is better here)

        # Use minVcpu and minMemory from lowestPrice if available,
        # otherwise use defaults
        vcpus = gpu_type.get('lowestPrice', {}).get('minVcpu')
        memory = gpu_type.get('lowestPrice', {}).get('minMemory')
    
        # Fall back to defaults if values are None
        # scale default value by gpu_count
        vcpus = DEFAULT_VCPUS * gpu_count if vcpus is None else vcpus
        memory = MEMORY_PER_GPU * gpu_count if memory is None else memory
  3. PR 5030 generates a list of GPU numbers that are powers of two and less than max_count, rather than the current PR which hard codes to 1 if max_count=1 and 1,2,4,8 otherwise. This will fail if max_count<8 or max_count>8

    def get_gpu_counts(max_count: int) -> List[int]:
        """Generate list of GPU counts based on powers of 2 and max count.
        RunPod supports any number up to max_count. We only generate a list of
        powers of two & the max count.
        For example, if the max count is 8, create a row for each of [1, 2, 4, 8].
        For max count 7, create a row for each of [1, 2, 4, 7].
        """
        counts = []
        power = 1
        while power <= max_count:
            counts.append(power)
            power *= 2
    
        # Add max count if it's not already included (not a power of 2)
        if max_count not in counts:
            counts.append(max_count)
    
        return sorted(counts)

    Proposal: move to use a similar adaptive list of counts

  4. PR 5030 regions and availability zones (the new Skypilot format). This is problematic for community cloud as RunPod API doesn’t seem to work when specifying community cloud instance and the availability zone, at least in my testing.

    # Mapping of regions to their availability zones
    REGION_ZONES = {
        'CA': ['CA-MTL-1', 'CA-MTL-2', 'CA-MTL-3'],

    Proposal: stick with no zones

@velaraptor-runpod
Copy link
Author

Not going to have time to work on all the improvements in the meantime. I can add the fixes from the difference and improvements to the other branch.

@adocherty
Copy link

I'm motivated and interested in getting RunPod Community Cloud support into SkyPilot so I'm happy to help by putting fixes into this PR or porting features from this PR into #5930.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants