Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update stateful_callbacks state before saving checkpoint #32115

Conversation

pedrobrs
Copy link
Contributor

Fixes

This PR addresses an issue where stateful callbacks, such as EarlyStoppingCallback, were not being updated before saving checkpoints. As a result, resuming training would not have access to the latest state of these callbacks.

Description

Updated the state of stateful callbacks before saving checkpoints to ensure that their state is preserved and correctly restored when resuming training.

Suggested Reviewers

@muellerz @amyeroberts @SunMarc

@huggingface huggingface deleted a comment from github-actions bot Aug 20, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @muellerzr

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix :)

@muellerzr
Copy link
Contributor

@pedrobrs can you rebase from main? This should fix the failing tests

@pedrobrs pedrobrs force-pushed the exportable_state_callbacks_update_state_on_save branch from d57a56f to f68b8b9 Compare August 21, 2024 18:20
@muellerzr
Copy link
Contributor

looks like this logic broke some of the tests in test_trainer.py:

 tests/trainer/test_trainer_callback.py::TrainerCallbackTest::test_stateful_duplicate_callbacks

Looks like this is because this change doesn't actually wind up restoring the state?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@pedrobrs
Copy link
Contributor Author

I saw that error when pushing the first time, I'll investigate the issue to identify and fix the problem.
There is at least one other test failed t(marked as required) that I don't fully understand. The error occurs in step_and_quality / check_repository_consistency. The error message is:

RuntimeError: Failed to import transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer because of the following error (look up to see its traceback):
libtorch_cuda.so: cannot open shared object file: No such file or directory

I may not be able to resolve this issue on my own. What should I do about it?

@ArthurZucker
Copy link
Collaborator

This one is unrelated to you and was already fixed on main!

@pedrobrs pedrobrs force-pushed the exportable_state_callbacks_update_state_on_save branch from f68b8b9 to 63552fa Compare August 22, 2024 22:51
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job with grabbing the multiples :)

@muellerzr
Copy link
Contributor

cc @ArthurZucker for final 🚀

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all!

@ArthurZucker ArthurZucker merged commit d1f39c4 into huggingface:main Aug 27, 2024
21 checks passed
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
…#32115)

* update ExportableState callbacks state before saving trainer_state on save_checkpoint

* run make fixup and fix format

* manage multiple stateful callbacks of same class
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
…#32115)

* update ExportableState callbacks state before saving trainer_state on save_checkpoint

* run make fixup and fix format

* manage multiple stateful callbacks of same class
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
…#32115)

* update ExportableState callbacks state before saving trainer_state on save_checkpoint

* run make fixup and fix format

* manage multiple stateful callbacks of same class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants