Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trainer flag max_time_per_run #10226

Open
ericharper opened this issue Oct 28, 2021 · 1 comment
Open

Add trainer flag max_time_per_run #10226

ericharper opened this issue Oct 28, 2021 · 1 comment
Labels
design Includes a design discussion feature Is an improvement or enhancement trainer: argument
Milestone

Comments

@ericharper
Copy link
Contributor

ericharper commented Oct 28, 2021

🚀 Feature

Add a max_time_per_run flag to trainer. Currently there is a max_time flag: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#max-time . This is global training time which is not helpful in this case.

Motivation

When training on large GPU clusters with time limits, it's important to be able to stop training after a specified time. For example, assume the cluster has 4 hour time limits for jobs. If we are training a large model, it's possible that the job will be killed while writing a checkpoint to disk, resulting in a corrupted checkpoint.

Pitch

If we can configure max_time_per_run, we can help ensure that our job will terminate more gracefully. Preventing things like corrupted checkpoints during training.

Alternatives

We've implemented our own solution in this PR: NVIDIA/NeMo#3056

But this seems like a useful feature that anyone using PTL on a cluster with time limits will be able to benefit from.

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @tchaton @justusschock @awaelchli @kaushikb11 @rohitgr7

@ericharper ericharper added the feature Is an improvement or enhancement label Oct 28, 2021
@stale
Copy link

stale bot commented Dec 1, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 1, 2021
@Borda Borda added this to the 1.7 milestone Dec 1, 2021
@stale stale bot removed the won't fix This will not be worked on label Dec 1, 2021
@carmocca carmocca added trainer: argument design Includes a design discussion labels Jul 19, 2022
@carmocca carmocca modified the milestones: pl:1.7, future Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement trainer: argument
Projects
None yet
Development

No branches or pull requests

3 participants