Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU runners are stalled #2365

Closed
jhamman opened this issue Oct 14, 2024 · 3 comments
Closed

GPU runners are stalled #2365

jhamman opened this issue Oct 14, 2024 · 3 comments

Comments

@jhamman
Copy link
Member

jhamman commented Oct 14, 2024

All active PRs are currently failing to get GPU CI runners:

GPU Test V3 / py=3.11, np=2.0, deps=minimal (pull_request) Queued — Waiting to run this check...

The actions are all reporting something like this:

Requested labels: gpu-runner
Job defined at: zarr-developers/zarr-python/.github/workflows/gpu_test.yml@refs/heads/v3
Waiting for a runner to pick up this job...
Job is waiting for a runner from 'gpu-runner' to come online.

I've been merging without this action running for PRs that are obviously unrelated.

@jhamman
Copy link
Member Author

jhamman commented Oct 15, 2024

With @rabernat's help, I've determined this is not related to the switch to main. I have filed a support ticket with GitHub about this.

@jhamman
Copy link
Member Author

jhamman commented Oct 15, 2024

GitHub (GitHub Support)Oct 15, 2024, 2:19 PM UTC

Hello,

Thank you for reaching out to GitHub Support and bringing this issue to our attention.

We are currently experiencing an incident affecting our GPU runners.

After thorough investigation, we have identified the root cause to be an issue with the NVIDIA GPU-Optimized VMI image provided by NVIDIA, which we utilize for our GPU Runners.

Unfortunately, the latest image available on the Azure Marketplace appears to be faulty, and previous versions are currently unavailable.

This issue has also been publicly reported on NVIDIA's page.

Please rest assured that our Engineering team is actively working on resolving this issue in collaboration with NVIDIA.

We understand the importance of this service to your work and are committed to restoring full functionality as soon as possible.

At this time, we do not have a timeline when this will be resolved, but we will notify you immediately once it is resolved.

Thank you for your patience and understanding.

Best regards,

David
GitHub Support

@jhamman
Copy link
Member Author

jhamman commented Oct 15, 2024

Seems to be fixed. Will reopen if this happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant