Description
Is your feature request related to a problem? Please describe.
Regularly offloading checkpoints from local storage to object storage (i.e. FSX to S3) is very painful, time-consuming, and subject to errors. It would be great if we could automate this.
Describe the solution you'd like
Fsspec might be a good way of doing this.
Describe alternatives you've considered
open to alternatives for implementing this! Most importantly, we want a solution that is robust to interruption, i.e. will report and abort run if a checkpoint fails to save, and should not delete checkpoints until it is ensured that they are backed up.
Additional context
an example PR for OpenCLIP containing what we might want to implement is here: https://github.com/mlfoundations/open_clip/pull/319/files
I may work on this soon, TBD based on how much else I need to do.