-
Notifications
You must be signed in to change notification settings - Fork 720
feat(catalog): Add RunPod data fetcher #5930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Adds a new data fetcher script to automatically generate RunPod instance catalog data. The script queries the RunPod API to fetch GPU types, pricing, and availability information. - Verified all previously existing Accelerator Names are in the newly generated CSV. - Added catalog entries for previously missing GPUs: * L40S (manually added in skypilot-org/skypilot-catalog#131) * RTX 2000 Ada (RTX2000-Ada) * RTX 5090 (RTX5090) * RTX PRO 6000 (RTXPRO6000) - Includes all available quantities of GPUs (previously did not include all permutations) The script requires a RunPod API key with read access and generates a CSV file compatible with SkyPilot's catalog system.
I've uploaded a copy of the full csv generated by this script here: https://gist.github.com/pirtleshell/41079b4c9752a16a60c3dbc45164e7c6 |
Adds a workflow based on the one for AWS that automates the updating of RunPod GPU pricing and availability Depends on skypilot-org/skypilot#5930
|
||
# Mapping of regions to their availability zones | ||
REGION_ZONES = { | ||
'CA': ['CA-MTL-1', 'CA-MTL-2', 'CA-MTL-3'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use SkyPilot to cycle through all availability zones rather than not specifying it in the RunPod API and letting RunPod assign the zone automatically?
To add community cloud support it seems cycling through the zones doesn't work whereas letting RunPod choose them does:
#3441 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, There seems to be another PR with a different version of the data fetcher that does include community cloud support:
#5929
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello @adocherty 👋
i'm not sure by what serendipity of the universe caused these two PRs for the same feature to appear at once! haha, but it definitely shows it's time for automated availability & price updates for RunPod instances 😄
no hard feelings if #5929 is merged instead of this. i'll state my original intention: generate the existing catalog CSV as closely as possible. i believe it's easier to alter or add functionality after the existing functionality is automated. i don't have a complete and deep understanding of the workings of skypilot so i did my best to create the existing CSV with minimal new functionality (no community clouds, no changes to how regions are managed, etc). if we want to take on those changes with this script, i'm happy to assist.
as for region management, it was again just a focused attempt at creating a near-identical CSV to the one that currently exists without diving into how the CSV is used by SkyPilot to discover instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pirtleshell
It is quite serendipitous! I had actually drafted some code to do the same thing, then came here to find not one but two PRs with the same functionality!
I sounds a sensible strategy to get existing functionality in place first, then add features. I have an interest in getting community cloud pods enabled, but have no horses in the race between this and #5929. They both are a good step forward.
I'm also new to SkyPilot as well, and my questions were all about me understanding more about how things work under the hood. That being said, I'm happy to help you getting this PR over the line if you need a hand - although I don't have any magic reviewing powers!
If this gets in I'll be happy to put up a PR to enable the community cloud. And your help (and anyone who knows what they're doing) would be appreciated.
# only add secure clouds. skipping community cloud instances. | ||
if not detailed_gpu['secureCloud']: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i attempted to limit the changeset in this PR. to the best of my knowledge, this code could be easily updated to accommodate the request in #3441 for community instance types.
it requires checking detailed_gpu['communityCloud']
bool and setting InstanceType
below to have a _COMMUNITY
suffix where _SECURE
is currently hardcoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits here.
return round(price, 2) | ||
|
||
|
||
def get_gpu_counts(max_count: int) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function really used?
if 'errors' in result: | ||
raise ValueError(f'GraphQL errors: {result["errors"]}') | ||
|
||
return result['data']['gpuTypes'][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have some assertions here?
return sorted(counts) | ||
|
||
|
||
def format_gpu_name(gpu_type: Dict) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def format_gpu_name(gpu_type: Dict) -> str: | |
def format_gpu_name(gpu_type: Dict[str, Any]) -> str: |
|
||
# Fall back to defaults if values are None | ||
# scale default value by gpu_count | ||
vcpus = DEFAULT_VCPUS * gpu_count if vcpus is None else vcpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets have more assertions for this vcpus here? like is a positive integer
# Generate instances from GPU types | ||
instances = [] | ||
for gpu in gpus: | ||
# initial gpu details. later, request specific quantity details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary? seems its quite redundant
Hi @pirtleshell, thanks for writing this PR! FYI We're also planning on adding CPU instances to the Runpod catalog: skypilot-org/skypilot-catalog#140 I generated the CPU instances list by modifying the script from this PR. Would you mind adding it to this PR as well? Or we could do it in a follow-up PR too. I can share the diffs I made to your script to support fetching CPU instances. In any case, we should do that before merging skypilot-org/skypilot-catalog#133, as to not override the CPU instances we will be adding. |
'Price', | ||
'SpotPrice', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the position of Price
and SpotPrice
is reversed here, looking at https://github.com/skypilot-org/skypilot-catalog/blob/master/catalogs/v7/runpod/vms.csv
'Price', | |
'SpotPrice', | |
'SpotPrice', | |
'Price', |
Hi @pirtleshell & @kevinmingtarja If you're happy for me to help, can I get write access to this branch? |
Adds a new data fetcher script to automatically generate RunPod instance catalog data. The script queries the RunPod API to fetch GPU types, pricing, and availability information.
Verified all previously existing Accelerator Names are in the newly generated CSV.
Additionally, includes support for previously missing GPUs:
Includes all available quantities of GPUs (previously did not include all permutations)
The script requires a RunPod API key with read access and generates a CSV file compatible with SkyPilot's catalog system.
For each GPU, a CSV catalog entry is created for every region (hardcoded list of regions) for every available quantityFor example, the api returns
availableGpuCounts
like[1,2,3,4]
to indicate you can request an instance with up to 4 GPUs. Thus, the GPU is listed in the catalog 100 times: gpu quantity options (4) * number of regions (25)Verification
For testing, I locally generated the catalog CSV:
Then I compared all the generated CSVs accelerator names to the ones in the existing v7 catalog CSV:
Thus, all originally available accelerators are available (with an unchanged name) plus three more (
RTX2000-Ada
,RTX5090
,RTXPRO6000
).For a final gut check, I manually compared lines for some specific quantity-GPU-region tuples.
Besides expected fluctuation of prices, there are some differences in the
vCPU
andMemoryGiB
columns. It's unclear to me why they are different. To the best of my knowledge, the values coming from the API and used by this script are correct.Example difference:
Previously, L40 was listed as having 64vCPUs and 192GiB RAM. I confirmed in the API & UI of RunPod that the vCPUs and memory for an instance with 4x L40s matches the newly generated values:
Similar differences can be seen in other GPUs, like the A100.
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orpytest tests/smoke_tests/test_backward_compat.py
(local)