Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Storage refactor: Support PBT and BOHB #38736

Merged
merged 51 commits into from
Aug 25, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
fa992fc
Adjust save_checkpoint API
Aug 17, 2023
f0f0f41
more
Aug 17, 2023
77be920
fix test
Aug 17, 2023
c2c6073
Merge remote-tracking branch 'upstream/master' into tune/storage-pbt
Aug 17, 2023
711eefa
Update typehints
Aug 17, 2023
8bcc82e
Merge remote-tracking branch 'upstream/master' into tune/storage-pbt
Aug 17, 2023
80ec41e
Merge branch 'master' into tune/storage-pbt
Aug 21, 2023
964247e
undo pause logic
Aug 21, 2023
33e896f
Merge branch 'master' into tune/pbt-bohb-pause
Aug 22, 2023
863ec03
resolve future
Aug 22, 2023
a2eb589
Pausing
Aug 22, 2023
9af362e
skip memory test
Aug 22, 2023
0a98a16
typo
Aug 22, 2023
789752b
Overwrite trial restore path
Aug 22, 2023
965f3db
Merge branch 'master' into tune/pbt-bohb-pause
Aug 22, 2023
fa89632
default 0
Aug 22, 2023
190df4f
[train/tune] Remove save_to_object/restore_from_object
Aug 22, 2023
138f92d
Fixes
Aug 22, 2023
b674dd2
avoid variable name conflict
Aug 22, 2023
b0a1e57
Merge remote-tracking branch 'upstream/master' into tune/remove-save-…
Aug 23, 2023
e6ac302
fix last test
Aug 23, 2023
55f1b84
Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause
Aug 23, 2023
4b624c5
Merge branch 'tune/remove-save-restore-obj' into tune/pbt-bohb-pause
Aug 23, 2023
8c87077
fix last test
Aug 23, 2023
11966c0
Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause
Aug 23, 2023
40819b0
bohb unpause
Aug 23, 2023
031ea23
pbt tests for storage
Aug 23, 2023
209ff6a
fix checkpoint test
Aug 23, 2023
d6839b6
more fixes
Aug 23, 2023
8484c5a
Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause
Aug 23, 2023
101d053
Fix hashing
Aug 23, 2023
0625416
exclude pbt_transformers
Aug 23, 2023
14f2d42
default 0
Aug 23, 2023
04f7c66
fix examples
Aug 23, 2023
75cdcbd
fix some tests
Aug 23, 2023
0b2ee3f
Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause
Aug 23, 2023
359aaad
review
Aug 23, 2023
2515831
Remove changes to old codepath
Aug 23, 2023
7769bae
Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause
Aug 24, 2023
9fc4cc6
remove empty pipeline
Aug 24, 2023
6cbece0
Cache decision in pause
Aug 24, 2023
9ae6dfd
Exploit
Aug 24, 2023
9572d03
Fix trial.checkpoint
Aug 24, 2023
77b4ae9
fix tests
Aug 24, 2023
c3bf12b
review
Aug 24, 2023
80e8a6d
Revert
Aug 24, 2023
d490db5
Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause
Aug 24, 2023
25927f7
Merge branch 'master' into tune/pbt-bohb-pause
krfricke Aug 24, 2023
f697157
Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause
Aug 25, 2023
af8e44c
Update build files, resolve merge logic conflict
Aug 25, 2023
3ba628e
Merge remote-tracking branch 'origin/tune/pbt-bohb-pause' into tune/p…
Aug 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Overwrite trial restore path
Signed-off-by: Kai Fricke <kai@anyscale.com>
  • Loading branch information
Kai Fricke committed Aug 22, 2023
commit 789752bd2bb17e0ab21d03f44f8aa9513e01058f
2 changes: 1 addition & 1 deletion python/ray/tune/execution/tune_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -2079,7 +2079,7 @@ def _schedule_trial_restore(self, trial: Trial) -> bool:
)
return True

checkpoint = trial.checkpoint
checkpoint = trial.temporary_state.next_restore or trial.checkpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think new codepath needs this next_restore, can remove

The new checkpoint manager doesn't have ids for training results anymore - just holds onto the latest checkpoint as a property.


if checkpoint.dir_or_data is None:
logger.debug(f"Not restoring trial {trial}: No checkpoint found.")
Expand Down
4 changes: 4 additions & 0 deletions python/ray/tune/experiment/trial.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ def __init__(self):

self.saving_to = None
self.restoring_from = None
self.next_restore = None

self.num_restore_failures = 0

Expand Down Expand Up @@ -1091,6 +1092,9 @@ def on_checkpoint(self, checkpoint: Union[_TrackedCheckpoint, _TrainingResult]):
# This index will get restored when the trial is restored and will
# be passed to the Trainable as the starting checkpoint index.
self.storage.current_checkpoint_index += 1
# Remove any next restore overrides - instead we should now restore
# from trial.checkpoint
self.temporary_state.next_restore = None
else:
self.run_metadata.checkpoint_manager.on_checkpoint(checkpoint)
self.run_metadata.invalidate_cache()
Expand Down
3 changes: 2 additions & 1 deletion python/ray/tune/schedulers/pbt.py
Original file line number Diff line number Diff line change
Expand Up @@ -943,7 +943,8 @@ def _exploit(
else:
exploit = checkpoint_to_exploit

trial.on_checkpoint(exploit)
# Next trial restore should use this checkpoint
trial.temporary_state.next_restore = exploit

self._num_perturbations += 1
# Transfer over the last perturbation time as well
Expand Down
2 changes: 1 addition & 1 deletion python/ray/tune/tests/test_trial_scheduler_pbt.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ def get_virt_mem(cls):
)

checkpoint_config = CheckpointConfig(
num_to_keep=2,
num_to_keep=3,
checkpoint_frequency=2,
)

Expand Down