Skip to content

Fix and improve loading of distributed checkpoints #314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 88 commits into
base: main
Choose a base branch
from

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jun 19, 2025

✨ Description

Fix #293 (not yet)

Lots of improvements on the loading of distributed checkpoints in different format.

  • Load files only if they are actually needed for conversion. This should speed things up a lot for large world sizes.
  • Implement a new, much faster loading method for the common case which just copies contiguous slices.
  • Keep track of the per-parameter loaded count so SafeLoad can verify with _check_parameters
  • (TODO) Implement a separate method for the more complex case of tensors which undergo a change in TP size.
  • (TODO?) Add more tests involving changes in TP size.

For the case of an unchanged distributed config (ex. starting a new experiment from a distributed checkpoint), loading should now be almost as fast as the unsafe version.

This will also help a lot with elastic training (#241) by cutting most of the resuming time.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] Conversion of distributed checkpoints to huggingface
1 participant