Skip to content

[Tune] Revisiting checkpointing policy #4287

@jeremyasapp

Description

@jeremyasapp

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
  • Ray installed from (source or binary): pip installed
  • Ray version: 0.6.1
  • Python version: 3.6.5
  • Exact command to reproduce: N/A

Describe the problem

I was wondering if it'd be possible to get more information about the design decision of saving every checkpoint during training? When running large grid searches where each trial consists of many steps, disk space usage can blow up very quickly (I went beyond 100G very fast), which may lead to out of disk space errors when using AWS instances for example.

Instead, why not override the checkpoints every time? If the goal is persistence, then overriding the previous checkpoint would be reasonable. If the checkpoints are there to keep track of the best model (which I don't believe they are), then another more efficient startegy would be to always keep 2 checkpoints -> best checkpoint and last checkpoint (assuming "best" is defined).

This is actually what I do myself. I keep track of the best model my self directly in the Trainable, and always checkpoint it alongside the last model. So in theory, I could consistently override the checkpoint, and make disk space usage a function of the number of trials only, as opposed to also making it a function of the number of steps.

Please let me know if I'm missing something! And thank you in advance for the info.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions