Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(gpu): rework ci to adapt to the shortage of h100 #1742

Merged
merged 1 commit into from
Nov 4, 2024

Conversation

agnesLeroy
Copy link
Contributor

closes: please link all relevant issues

PR content/description

I'm removing classic PBS tests from the H100 workflows and moving them to dedicated workflows on RTX to be sure we test them, since there's currently a shortage of H100's on Hyperstack.

Check-list:

  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • Relevant issues are marked as resolved/closed, related issues are linked in the description
  • Check for breaking changes (including serialization changes) and add them to commit message following the conventional commit specification

Copy link
Contributor

@soonum soonum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good workaround 👍
I'm curious, why do we experience H100 shortage ? This chip is already too old for Hyperstack ?

@agnesLeroy
Copy link
Contributor Author

@soonum no I think it's just the platform has more & more clients. They will increase their capacity but I have no timeline about this atm. Blackwells won't be available before at least a year for the on-demand offer, because Nvidia had delays on it.

@agnesLeroy agnesLeroy merged commit bd255cd into main Nov 4, 2024
139 of 141 checks passed
@agnesLeroy agnesLeroy deleted the al/adapt_ci_h100 branch November 4, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants