Add trainer flag max_time_per_run #10226

ericharper · 2021-10-28T17:12:50Z

🚀 Feature

Add a max_time_per_run flag to trainer. Currently there is a max_time flag: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#max-time . This is global training time which is not helpful in this case.

Motivation

When training on large GPU clusters with time limits, it's important to be able to stop training after a specified time. For example, assume the cluster has 4 hour time limits for jobs. If we are training a large model, it's possible that the job will be killed while writing a checkpoint to disk, resulting in a corrupted checkpoint.

Pitch

If we can configure max_time_per_run, we can help ensure that our job will terminate more gracefully. Preventing things like corrupted checkpoints during training.

Alternatives

We've implemented our own solution in this PR: NVIDIA/NeMo#3056

But this seems like a useful feature that anyone using PTL on a cluster with time limits will be able to benefit from.

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

cc @Borda @tchaton @justusschock @awaelchli @kaushikb11 @rohitgr7

The text was updated successfully, but these errors were encountered:

stale · 2021-12-01T06:27:17Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ericharper added the feature Is an improvement or enhancement label Oct 28, 2021

stale bot added the won't fix This will not be worked on label Dec 1, 2021

Borda added this to the 1.7 milestone Dec 1, 2021

stale bot removed the won't fix This will not be worked on label Dec 1, 2021

carmocca added trainer: argument design Includes a design discussion labels Jul 19, 2022

carmocca modified the milestones: pl:1.7, future Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trainer flag max_time_per_run #10226

Add trainer flag max_time_per_run #10226

ericharper commented Oct 28, 2021 •

edited by github-actions bot

Loading

stale bot commented Dec 1, 2021

Add trainer flag max_time_per_run #10226

Add trainer flag max_time_per_run #10226

Comments

ericharper commented Oct 28, 2021 • edited by github-actions bot Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

stale bot commented Dec 1, 2021

ericharper commented Oct 28, 2021 •

edited by github-actions bot

Loading