Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune][Fix]Remove the clear_checkpoint function during Trial restoration error handling. #48532

Merged

Conversation

hongpeng-guo
Copy link
Contributor

@hongpeng-guo hongpeng-guo commented Nov 4, 2024

Why are these changes needed?

This PR is trying to fix a issue happened during a Trial Restoration. This is a bug that doesn’t happen often, but it will happen if the actor dies during Trainable.restore such as the preemption scenarios.

If Trainable.restore doesn’t run successfully, the TuneController will handle it as a special “restore error,” which clears the checkpoint.

Tune treats restore errors differently because it doesn’t necessarily increment num_failures, so trials can try “restoring” multiple times without the run erroring out even if max_failures=0.
Then, on the next restoration, we have an invalid state where latest_checkpoint_result = TrainingResult(checkpoint=None, metrics=…) since it was modified by clear_checkpoint, which will result in an error in Trainable.restore.

This PR removes the clear_checkpoint function. So that the new behavior is:

  1. Every TUNE_RESTORE_RETRY_NUM restoration failures contributes to one num_failures.
  2. We don't do extra handing of existing checkpoints, i.e., clear_checkpoints is not applied.

After this fix, we can still provide special handling of restoration errors. If restoration is interrupted by preemption, we provide extra chances to do restoration, without directly increasing total num_failures. If the restoration failure is due to some deterministic problem, i.e., corrupted latest checkpoint. The job will fail after total number of TUNE_RESTORE_RETRY_NUM*num_failures` retries.

We reduced our internal logic to make the overall process clearer.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.temporary_state.num_restore_failures >= int(
os.environ.get("TUNE_RESTORE_RETRY_NUM", 0)
):
# Restore was unsuccessful, try again without checkpoint.
self.clear_checkpoint()
self.run_metadata.num_failures += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are diff between self.run_metadata.num_failures and self.temporary_state.num_restore_failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_restore_failures is the number of failed restoration.
num_failures is the number of failures that caused by user/ application code.

because restoration is not a user defined behavior, but some feature we provided. We don't treat restoration failure same as normal application failure. The behavior is, when the program failed due to application, we increment the num_failures and trying to restore the application. If the restoration is successful, the program just goes on. If the restoration fail, we will keep on trying to restore but increments the number of num_restore_failures by 1. When the TUNE_RESTORE_RETRY_NUM restore reaches, we stop restoration, and increment the num_failures by another 1.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo
Copy link
Contributor Author

hongpeng-guo commented Nov 5, 2024

This makes sense to me. Btw, how does this test work...

https://github.com/anyscale/rayturbo/blob/7b749834329bc4cd4c87c2adf7a9c5eb084a2166/python/ray/tune/tests/test_tuner_restore.py#L539

The unit test is not super clear and not testing the issue mentioned above. I slightly modified the unit test and adding some documentation.
The unit test now captures the issue, and passes after the fix.

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the test!

Can we also update the entry in this doc? https://docs.ray.io/en/latest/tune/api/env.html

Btw, I found the original source of this clear_checkpoint behavior. I think it was a catch-all fix for checkpoints that were held in-memory in the object store (which is no longer a thing). These in-memory checkpoints could be lost on node failure, which would cause the restoration to fail. Then, in this situation, it kind of made sense to restart from scratch since the in-memory checkpoint cannot be found.

TL;DR: We are safe to remove the clear checkpoint functionality, since it was a patch for SUPER-LEGACY constraints.

@hongpeng-guo
Copy link
Contributor Author

Can we also update the entry in this doc? https://docs.ray.io/en/latest/tune/api/env.html

Sure, but I don't think this change modifies the original definition of the env var TUNE_RESTORE_RETRY_NUM that is specified in the doc. We removed the clear_checkpoint function, which is not specified by its definition in doc, either.

BTW, there is a readthedocs fail on this PR. Could it be related to anything that I am missing? cc @justinvyu

@justinvyu
Copy link
Contributor

Not sure what happened with that build, looks like it succeeded on this one.

@justinvyu justinvyu enabled auto-merge (squash) November 6, 2024 18:57
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 6, 2024
@justinvyu justinvyu merged commit a1bb4a4 into ray-project:master Nov 6, 2024
6 of 7 checks passed
@hongpeng-guo hongpeng-guo deleted the hpguo/v2/tune_restore_fix branch November 6, 2024 22:21
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…ation error handling. (ray-project#48532)

This PR removes the `clear_checkpoint` function,
so that Tune doesn't try to "restart trials from scratch.
`clear_checkpoint` solved for a legacy use case that doesn't
apply anymore, and "restoration failures" are also now an
edge case for function Trainables and Ray Train usage.

---------

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…ation error handling. (ray-project#48532)

This PR removes the `clear_checkpoint` function,
so that Tune doesn't try to "restart trials from scratch.
`clear_checkpoint` solved for a legacy use case that doesn't
apply anymore, and "restoration failures" are also now an
edge case for function Trainables and Ray Train usage.

---------

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants