Skip to content

2/3 Trainings on NVIDIA H100 80GB HBM3 get stuck with no chance to resume #964

Open
@WhoIsElMasri

Description

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

Impacted Trainings:

  • 6emf7AeSmgKZdjIDqE78
  • CTnXcC06MFRvU9BItUkK

e.g. 31% Disconnected. Checkpoint saved for epoch 167.
Resume: "Something went wrong. Please try again later."

Environment

Independent from Browser and local environment

Minimal Reproducible Example

No response

Additional

No response

Metadata

Assignees

Labels

HUBUltralytics HUB issuesbugSomething isn't workingwebRelated to web interface or web functionality

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions