Skip to content

feat(catalog): Add RunPod data fetcher #5930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pirtleshell
Copy link

@pirtleshell pirtleshell commented Jun 9, 2025

Adds a new data fetcher script to automatically generate RunPod instance catalog data. The script queries the RunPod API to fetch GPU types, pricing, and availability information.

  • Verified all previously existing Accelerator Names are in the newly generated CSV.

  • Additionally, includes support for previously missing GPUs:

  • Includes all available quantities of GPUs (previously did not include all permutations)

The script requires a RunPod API key with read access and generates a CSV file compatible with SkyPilot's catalog system.

For each GPU, a CSV catalog entry is created for every region (hardcoded list of regions) for every available quantityFor example, the api returns availableGpuCounts like [1,2,3,4] to indicate you can request an instance with up to 4 GPUs. Thus, the GPU is listed in the catalog 100 times: gpu quantity options (4) * number of regions (25)

Verification

For testing, I locally generated the catalog CSV:

$ RUNPOD_API_KEY=read-only-api-key python sky/catalog/data_fetchers/fetch_runpod.py --output-dir temp
RunPod Service Catalog saved to temp/vms.csv

Then I compared all the generated CSVs accelerator names to the ones in the existing v7 catalog CSV:

$ diff <(awk -F, '{print $2}' ../skypilot-catalog/catalogs/v7/runpod/vms.csv | sort | uniq) <(awk -F, '{print $2}' temp/vms.csv | sort | uniq)
13a14
> RTX2000-Ada
16a18
> RTX5090
21a24
> RTXPRO6000

Thus, all originally available accelerators are available (with an unchanged name) plus three more (RTX2000-Ada, RTX5090, RTXPRO6000).

For a final gut check, I manually compared lines for some specific quantity-GPU-region tuples.

Besides expected fluctuation of prices, there are some differences in the vCPU and MemoryGiB columns. It's unclear to me why they are different. To the best of my knowledge, the values coming from the API and used by this script are correct.

Example difference:

# 4x NVIDIA L40 in US-TX-3
before: 4x_L40_SECURE,L40,4.0,64.0,192.0,L40,US,2.76,4.56,US-TX-3
 after: 4x_L40_SECURE,L40,4.0,32.0,376.0,L40,US,3.96,2.0,US-TX-3

Previously, L40 was listed as having 64vCPUs and 192GiB RAM. I confirmed in the API & UI of RunPod that the vCPUs and memory for an instance with 4x L40s matches the newly generated values:

# from runpod's deploy UI
4x L40 (192 GB VRAM)
376 GB RAM • 32 vCPU

Similar differences can be seen in other GPUs, like the A100.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Adds a new data fetcher script to automatically generate RunPod instance catalog data.
The script queries the RunPod API to fetch GPU types, pricing, and availability information.

- Verified all previously existing Accelerator Names are in the newly generated CSV.

- Added catalog entries for previously missing GPUs:
  * L40S (manually added in skypilot-org/skypilot-catalog#131)
  * RTX 2000 Ada (RTX2000-Ada)
  * RTX 5090 (RTX5090)
  * RTX PRO 6000 (RTXPRO6000)

- Includes all available quantities of GPUs (previously did not include all permutations)

The script requires a RunPod API key with read access and generates a CSV file
compatible with SkyPilot's catalog system.
@pirtleshell
Copy link
Author

I've uploaded a copy of the full csv generated by this script here: https://gist.github.com/pirtleshell/41079b4c9752a16a60c3dbc45164e7c6

pirtleshell added a commit to pirtleshell/skypilot-catalog that referenced this pull request Jun 9, 2025
Adds a workflow based on the one for AWS that automates the updating of
RunPod GPU pricing and availability

Depends on skypilot-org/skypilot#5930

# Mapping of regions to their availability zones
REGION_ZONES = {
'CA': ['CA-MTL-1', 'CA-MTL-2', 'CA-MTL-3'],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use SkyPilot to cycle through all availability zones rather than not specifying it in the RunPod API and letting RunPod assign the zone automatically?

To add community cloud support it seems cycling through the zones doesn't work whereas letting RunPod choose them does:
#3441 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, There seems to be another PR with a different version of the data fetcher that does include community cloud support:
#5929

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello @adocherty 👋

i'm not sure by what serendipity of the universe caused these two PRs for the same feature to appear at once! haha, but it definitely shows it's time for automated availability & price updates for RunPod instances 😄

no hard feelings if #5929 is merged instead of this. i'll state my original intention: generate the existing catalog CSV as closely as possible. i believe it's easier to alter or add functionality after the existing functionality is automated. i don't have a complete and deep understanding of the workings of skypilot so i did my best to create the existing CSV with minimal new functionality (no community clouds, no changes to how regions are managed, etc). if we want to take on those changes with this script, i'm happy to assist.

as for region management, it was again just a focused attempt at creating a near-identical CSV to the one that currently exists without diving into how the CSV is used by SkyPilot to discover instances.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pirtleshell

It is quite serendipitous! I had actually drafted some code to do the same thing, then came here to find not one but two PRs with the same functionality!

I sounds a sensible strategy to get existing functionality in place first, then add features. I have an interest in getting community cloud pods enabled, but have no horses in the race between this and #5929. They both are a good step forward.

I'm also new to SkyPilot as well, and my questions were all about me understanding more about how things work under the hood. That being said, I'm happy to help you getting this PR over the line if you need a hand - although I don't have any magic reviewing powers!

If this gets in I'll be happy to put up a PR to enable the community cloud. And your help (and anyone who knows what they're doing) would be appreciated.

Comment on lines +209 to +211
# only add secure clouds. skipping community cloud instances.
if not detailed_gpu['secureCloud']:
continue
Copy link
Author

@pirtleshell pirtleshell Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i attempted to limit the changeset in this PR. to the best of my knowledge, this code could be easily updated to accommodate the request in #3441 for community instance types.

it requires checking detailed_gpu['communityCloud'] bool and setting InstanceType below to have a _COMMUNITY suffix where _SECURE is currently hardcoded.

Copy link
Collaborator

@andylizf andylizf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits here.

return round(price, 2)


def get_gpu_counts(max_count: int) -> List[int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function really used?

if 'errors' in result:
raise ValueError(f'GraphQL errors: {result["errors"]}')

return result['data']['gpuTypes'][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have some assertions here?

return sorted(counts)


def format_gpu_name(gpu_type: Dict) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def format_gpu_name(gpu_type: Dict) -> str:
def format_gpu_name(gpu_type: Dict[str, Any]) -> str:


# Fall back to defaults if values are None
# scale default value by gpu_count
vcpus = DEFAULT_VCPUS * gpu_count if vcpus is None else vcpus
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have more assertions for this vcpus here? like is a positive integer

# Generate instances from GPU types
instances = []
for gpu in gpus:
# initial gpu details. later, request specific quantity details
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary? seems its quite redundant

@kevinmingtarja
Copy link
Collaborator

Hi @pirtleshell, thanks for writing this PR! FYI We're also planning on adding CPU instances to the Runpod catalog: skypilot-org/skypilot-catalog#140

I generated the CPU instances list by modifying the script from this PR. Would you mind adding it to this PR as well? Or we could do it in a follow-up PR too. I can share the diffs I made to your script to support fetching CPU instances.

In any case, we should do that before merging skypilot-org/skypilot-catalog#133, as to not override the CPU instances we will be adding.

Comment on lines +41 to +42
'Price',
'SpotPrice',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the position of Price and SpotPrice is reversed here, looking at https://github.com/skypilot-org/skypilot-catalog/blob/master/catalogs/v7/runpod/vms.csv

Suggested change
'Price',
'SpotPrice',
'SpotPrice',
'Price',

@adocherty
Copy link

Hi @pirtleshell & @kevinmingtarja
I'd really like to get this functionality into the codebase and I'm happy to address the reviews on this PR and work with you to add CPU functionality.

If you're happy for me to help, can I get write access to this branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants