Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] Add Checkpoint.as_directory() for efficient checkpoint fs processing #23908

Merged
merged 3 commits into from
Apr 14, 2022

Conversation

krfricke
Copy link
Contributor

Why are these changes needed?

This PR adds a Checkpoint_as_directory() context manager that either returns the local path (if checkpoint is already a directory) or a temporary directory path containing the checkpoint data, which is cleaned up after use. The path should be considered as a read-only source for loading data from the checkpoint.

A common use case for processing checkpoint data is to convert it into a directory with Checkpoint.to_directory() and then do some read-only processing (e.g. restoring a ML model).

This process has two flaws: First, to_directory() creates a temporary directory that has to be explicitly cleaned up by the user after use. Secondly, if the checkpoint is already a directory checkpoint, it is copied over, which is inefficient for large checkpoints (e.g. huggingface models) and then even more prone to unwanted side effects if not cleaned up properly.

With this context manager that effectively returns a directory that is to be used as a read-only data source, we can avoid manual cleaning up and unnecessary data copies (or avoid internal inspection as e.g. in https://github.com/ray-project/ray/pull/23876/files#diff-47db2f054ca359879f77306e7b054dd8b780aab994961e3b4911330ae15eeae3R57-R60)

See also discussion in https://github.com/ray-project/ray/pull/23850/files#r850036905

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/ml/checkpoint.py Outdated Show resolved Hide resolved
@amogkam
Copy link
Contributor

amogkam commented Apr 14, 2022

Should we change the current usages of to_directory with a tmpfile to use this context manager instead?

@ericl ericl merged commit d96ac25 into ray-project:master Apr 14, 2022
@krfricke krfricke deleted the air/checkpoint-ro-dir branch April 14, 2022 19:15
krfricke added a commit that referenced this pull request Apr 23, 2022
…24113)

Follow-up from #23908

Instead of manually deleting checkpoint paths after calling `to_directory()`, we should utilize `Checkpoint.as_directory()` when possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants