Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error warning for dist_cp loading without optimizer state #3752

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

j316chuck
Copy link
Contributor

@j316chuck j316chuck commented Jan 24, 2025

What does this PR do?

Improve error logging for models saved with load_weights_only=True and loaded with load_weights_only=False under the sharded checkpointing code path.

What issue(s) does this change relate to?

https://databricks.atlassian.net/browse/GRT-2801

Tests

Before: 1-node-mpt-13b-monolithic-crusoe-EVyT86 - optimizer key error 🔴

[rank4]: "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/default_planner.py"
[rank4]: , line 354, in create_default_local_load_plan
[rank4]:     raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.")
[rank4]: RuntimeError: Missing key in checkpoint state_dict:
[rank4]: state.model.model.transformer.blocks.5.norm_1.weight.

After: 1-node-mpt-13b-monolithic-crusoe-AF3tW4 - proper error warning ✅
then error about optimizer state is thrown again

2025-01-24 01:12:52,470: rank0[462][MainThread]: INFO: composer.utils.checkpoint: Optimizer states are not in the state_dict and won\'t be loaded. 
2025-01-24 01:12:52,470: rank0[462][MainThread]: INFO: Consider setting load_weights_only=True or ensure that the optimizer state is saved in the checkpoint.

@j316chuck j316chuck marked this pull request as draft January 24, 2025 01:13
@j316chuck j316chuck requested a review from dakinggg January 24, 2025 01:16
@j316chuck j316chuck marked this pull request as ready for review January 24, 2025 05:36
@j316chuck j316chuck changed the title Improve error logging for dist_cp loading without optimizer state Improve error warning for dist_cp loading without optimizer state Jan 24, 2025
@j316chuck j316chuck requested a review from a team as a code owner January 24, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant