Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP training not loading saving the best checkpoint #472

Open
BSharmi opened this issue Jan 22, 2024 · 0 comments
Open

FSDP training not loading saving the best checkpoint #472

BSharmi opened this issue Jan 22, 2024 · 0 comments

Comments

@BSharmi
Copy link

BSharmi commented Jan 22, 2024

Hi there!

I followed training a T5 model with FSDP on Sagemaker from the example https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py

I noticed that checkpointing is not done with save_strategy="no". Is it intentional(line https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93)? In my training I changed it to save_strategy="steps" and noticed two issues

  1. Best checkpoints based on min validation loss is not saved. If I set the limit to 2 for e.g., the last 2 checkpoints are saved
  2. I was not able to load the trained model from checkpoint and got the error which is mentioned elsewhere in issues RuntimeError: Trying to resize storage that is not resizable. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versions
PyTorch 1.13
Transformers 4.26

and

PyTorch 2.0.0
Transformers 4.28.1

and see the same issue with loading a model from checkpoint.

Would appreciate any pointers

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant