-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
- Ray installed from (source or binary): pip installed
- Ray version: 0.6.1
- Python version: 3.6.5
- Exact command to reproduce: N/A
Describe the problem
I was wondering if it'd be possible to get more information about the design decision of saving every checkpoint during training? When running large grid searches where each trial consists of many steps, disk space usage can blow up very quickly (I went beyond 100G very fast), which may lead to out of disk space errors when using AWS instances for example.
Instead, why not override the checkpoints every time? If the goal is persistence, then overriding the previous checkpoint would be reasonable. If the checkpoints are there to keep track of the best model (which I don't believe they are), then another more efficient startegy would be to always keep 2 checkpoints -> best checkpoint and last checkpoint (assuming "best" is defined).
This is actually what I do myself. I keep track of the best model my self directly in the Trainable, and always checkpoint it alongside the last model. So in theory, I could consistently override the checkpoint, and make disk space usage a function of the number of trials only, as opposed to also making it a function of the number of steps.
Please let me know if I'm missing something! And thank you in advance for the info.