You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training on large GPU clusters with time limits, it's important to be able to stop training after a specified time. For example, assume the cluster has 4 hour time limits for jobs. If we are training a large model, it's possible that the job will be killed while writing a checkpoint to disk, resulting in a corrupted checkpoint.
Pitch
If we can configure max_time_per_run, we can help ensure that our job will terminate more gracefully. Preventing things like corrupted checkpoints during training.
But this seems like a useful feature that anyone using PTL on a cluster with time limits will be able to benefit from.
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
🚀 Feature
Add a
max_time_per_run
flag to trainer. Currently there is amax_time
flag: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#max-time . This is global training time which is not helpful in this case.Motivation
When training on large GPU clusters with time limits, it's important to be able to stop training after a specified time. For example, assume the cluster has 4 hour time limits for jobs. If we are training a large model, it's possible that the job will be killed while writing a checkpoint to disk, resulting in a corrupted checkpoint.
Pitch
If we can configure
max_time_per_run
, we can help ensure that our job will terminate more gracefully. Preventing things like corrupted checkpoints during training.Alternatives
We've implemented our own solution in this PR: NVIDIA/NeMo#3056
But this seems like a useful feature that anyone using PTL on a cluster with time limits will be able to benefit from.
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @tchaton @justusschock @awaelchli @kaushikb11 @rohitgr7
The text was updated successfully, but these errors were encountered: