Adds pause-resume elasticity if the proxy backend is used. #1334

lukebaumann · 2025-08-22T23:40:09Z

Added the changes to the jobset for elastic training to enable elasticity.
Added changes to launch_trainer so that the pause_resume decorator is used.
Set logging.raiseExceptions=True so that DATA_LOSS errors that occur in debug/info/other log calls are raise exceptions immediately.

As written, this will use Pause-Resume elasticity if Pathways is enabled.

This should be merged after #1335

Added the changes to the jobset for elastic training to enable elasticity. Added changes to launch_trainer so that the pause_resume decorator is used. Set logging.raiseExceptions=True so that DATA_LOSS errors that occur in debug/info/other log calls are raise exceptions immediately.

samos123 · 2025-09-02T21:26:44Z

axlearn/cloud/gcp/pathways_utils.py

+        # For elasticity, we want the slices to be able to restart many times.
+        # There is no way to set this to be unlimited so we set the backoffLimit
+        # very high.
+        backoffLimit *= 1000


Should this only be done when elasticity is enabled?

It is not necessary to restrict it (but I think it does make sense to restrict it).

_PATHWAYS_BACK_OFF_LIMIT = 32 so once either a slice has been restarted 32 times or the workload is restarted 32 times, GKE will fail due to the backoff limit.

Why it is not necessary:
For non-elastic workloads, a worker fails, gets restarted, the RM will kill all of the slices, the workload tries to access data at some point, the workload gets a DATA_LOSS exception that is not caught, the workload exits, the JobSet restarts.

What happens today:
Exactly the same as above except GKE will only fail due to the backoff limit after the workload is restarted 32 times (the slice may restart more than 32 times).

The only concern is that if there is a user code error, it will take very long for the job to eventually report failure. Which may cause user confusion.

This backoff limit increase is for the Pathways worker containers only. If there is a user code error, the JAX container will fail with the existing backoff limit. This allows worker containers to fail more than the JAX container. If a JAX container is connected to a worker container that fails, it will also fail unless elasticity is turned on.

muyangyuapple · 2025-09-09T22:35:16Z

axlearn/cloud/gcp/pathways_utils.py

+        # For elasticity, we want the slices to be able to restart many times.
+        # There is no way to set this to be unlimited so we set the backoffLimit
+        # very high.
+        backoffLimit *= 1000


The only concern is that if there is a user code error, it will take very long for the job to eventually report failure. Which may cause user confusion.

muyangyuapple · 2025-09-09T22:36:13Z

axlearn/common/launch_trainer.py

+        from pathwaysutils.elastic import manager
+        elastic_manager = manager.Manager()
+        max_retries = 5
+        timeout = 10 * 60  # ten minutes


What does this timeout impact? How does it affect the multi-host inference case?

This timeout argument is passed to wait_for_slices and is how long the JAX workload will wait for all slices to be ready after one (or more) of them fail before raising a TimeoutError. See wait_for_slices for more details.

The multi-host inference case should not enable elasticity or use elastic_manager.pause_resume. Instead, it should rely on LeaderWorkerSet for resiliency mechanisms.

lukebaumann requested review from a team as code owners August 22, 2025 23:40

lukebaumann force-pushed the elastic_training branch from f419557 to 9b49234 Compare August 22, 2025 23:42

Updated the pathwaysutils version

d21c8dd

lukebaumann force-pushed the elastic_training branch from 9b49234 to dc7f9f9 Compare August 26, 2025 18:13

lukebaumann force-pushed the elastic_training branch from dc7f9f9 to 68ace34 Compare August 26, 2025 18:21

lukebaumann mentioned this pull request Sep 2, 2025

Adding basic elastic training (pause-and-resume) #1256

Closed

samos123 reviewed Sep 2, 2025

View reviewed changes

muyangyuapple reviewed Sep 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds pause-resume elasticity if the proxy backend is used. #1334

Adds pause-resume elasticity if the proxy backend is used. #1334

Uh oh!

lukebaumann commented Aug 22, 2025 •

edited

Loading

Uh oh!

samos123 Sep 2, 2025

Uh oh!

lukebaumann Sep 3, 2025

Uh oh!

muyangyuapple Sep 9, 2025

Uh oh!

lukebaumann Sep 24, 2025

Uh oh!

muyangyuapple Sep 9, 2025

Uh oh!

muyangyuapple Sep 9, 2025

Uh oh!

lukebaumann Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adds pause-resume elasticity if the proxy backend is used. #1334

Are you sure you want to change the base?

Adds pause-resume elasticity if the proxy backend is used. #1334

Uh oh!

Conversation

lukebaumann commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samos123 Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

lukebaumann Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

lukebaumann Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

lukebaumann Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukebaumann commented Aug 22, 2025 •

edited

Loading