Skip to content

Allow automatic saving / backing up checkpoints to object storage like S3 #781

Closed
@haileyschoelkopf

Description

@haileyschoelkopf

Is your feature request related to a problem? Please describe.
Regularly offloading checkpoints from local storage to object storage (i.e. FSX to S3) is very painful, time-consuming, and subject to errors. It would be great if we could automate this.

Describe the solution you'd like
Fsspec might be a good way of doing this.

Describe alternatives you've considered
open to alternatives for implementing this! Most importantly, we want a solution that is robust to interruption, i.e. will report and abort run if a checkpoint fails to save, and should not delete checkpoints until it is ensured that they are backed up.

Additional context
an example PR for OpenCLIP containing what we might want to implement is here: https://github.com/mlfoundations/open_clip/pull/319/files

I may work on this soon, TBD based on how much else I need to do.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions