[Train] Add TPU multi-slice support to JaxTrainer #58629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

matthewdeng merged 49 commits into ray-project:master from ryanaoleary:jax-tpu-multi-slice

Jan 7, 2026

Contributor

ryanaoleary commented Nov 14, 2025

Description

This PR adds support in the JaxTrainer to schedule across multiple TPU slices using the ray.util.tpu public utilities.

To support this, this PR adds new AcceleratorConfigs to the V2 scaling config, which consolidate the accelerator related fields for TPU and GPU. When TPUAcceleratorConfig is specified, the JaxTrainer utilizes a SlicePlacementGroup to atomically reserve num_slices TPU slices of the desired topology, auto-detecting the required values for num_workers and resources_per_worker when unspecified.

TODO: I'll add some manual testing and usage examples in the comments.

Related issues

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

ryanaoleary requested review from a team as code owners

November 14, 2025 10:14

Contributor Author

ryanaoleary commented Nov 14, 2025

cc: @andrewsykim @chiayi @matthewdeng @liulehui

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces multi-slice TPU support for JaxTrainer by refactoring the accelerator configuration and leveraging ray.util.tpu.SlicePlacementGroup. The changes include a new AcceleratorConfig API, which provides a cleaner way to specify GPU and TPU resources, and deprecates older fields like use_gpu and use_tpu. The logic for reserving TPU slices and determining worker configurations is now encapsulated within SlicePlacementGroup, which correctly handles multi-slice reservations and auto-detects num_workers and resources_per_worker. The test suite has been significantly improved to cover various single-host, multi-host, and multi-slice scenarios. Overall, this is a well-structured and comprehensive update that greatly enhances TPU support in Ray Train. I have a couple of suggestions to address a potential runtime error and a syntax issue.

python/ray/util/tpu.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

cursor bot reviewed

View reviewed changes

python/ray/util/tpu.py Outdated Show resolved Hide resolved

cursor bot reviewed

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

ray-gardener bot added train community-contribution labels

liulehui requested a review from xyuzh

November 14, 2025 21:45

cursor bot reviewed

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

python/ray/_private/accelerators/tpu.py Show resolved Hide resolved

liulehui reviewed

View reviewed changes

Contributor

liulehui left a comment

thank you!!

python/ray/train/v2/api/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

ryanaoleary force-pushed the jax-tpu-multi-slice branch from 9ba6928 to 6ad813f Compare

December 3, 2025 08:39

ryanaoleary requested a review from liulehui

December 3, 2025 08:46

cursor bot reviewed

View reviewed changes

python/ray/train/v2/api/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/api/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/jax/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

cursor bot reviewed

View reviewed changes

python/ray/train/v2/api/config.py Show resolved Hide resolved

python/ray/util/tpu.py Outdated Show resolved Hide resolved

ryanaoleary mentioned this pull request

[Core] Update TPU utils for multi-slice compatibility #59136

Merged

cursor bot reviewed

View reviewed changes

python/ray/train/v2/api/config.py Outdated Show resolved Hide resolved

cursor bot reviewed

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

liulehui reviewed

View reviewed changes

Contributor

liulehui left a comment

🫶

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

python/ray/train/v2/api/config.py Show resolved Hide resolved

python/ray/train/v2/api/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/jax/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/jax/jax_trainer.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

cursor bot reviewed

View reviewed changes

python/ray/train/v2/jax/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

github-actions bot commented Dec 18, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions bot added the stale label

cursor bot reviewed

View reviewed changes

python/ray/train/v2/jax/config.py Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Show resolved Hide resolved

ryanaoleary force-pushed the jax-tpu-multi-slice branch from 1e28858 to 5e03ad6 Compare

December 19, 2025 01:46

cursor bot reviewed

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Show resolved Hide resolved

python/ray/util/tpu.py Outdated Show resolved Hide resolved

cursor bot reviewed

View reviewed changes

python/ray/train/v2/jax/config.py Show resolved Hide resolved

Contributor Author

ryanaoleary commented Dec 19, 2025

cc: @matthewdeng @liulehui I've resolved the outstanding comments and this PR is ready for another review. I'm fixing the comments on #59136 now so that it can be merged first, and then this PR will just contain the Ray Train changes.

cursor bot reviewed

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Show resolved Hide resolved

github-actions bot added unstale and removed stale labels

ryanaoleary and others added 2 commits

January 5, 2026 14:29


          Merge branch 'master' into jax-tpu-multi-slice

b3aa2ad


          Check before accessing v2 worker_group fields

1756d33

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Contributor Author

ryanaoleary commented Jan 5, 2026

@liulehui Thank you for the reviews!! I think the CI failure for test_jax_distributed_shutdown_timeout should be fixed with 1756d33.

cursor bot reviewed

View reviewed changes

python/ray/util/tpu.py Outdated Show resolved Hide resolved

matthewdeng reviewed

View reviewed changes

Contributor

matthewdeng left a comment

Thanks! Few minor remaining comments.

python/ray/train/v2/api/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/api/config.py Outdated Show resolved Hide resolved

python/ray/train/v2/jax/jax_trainer.py Show resolved Hide resolved

ryanaoleary and others added 2 commits

January 6, 2026 01:27


          Fix tests, remove config.py change, and add _validate_tpu_config

db4203f

Signed-off-by: ryanaoleary <ryanaoleary@google.com>


          Merge branch 'master' into jax-tpu-multi-slice

d0e80d3

cursor bot reviewed

View reviewed changes

python/ray/train/v2/tests/test_config.py Outdated Show resolved Hide resolved


          Remove unnecessary test from test_config

dfcc004

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary requested a review from matthewdeng

January 6, 2026 02:12

matthewdeng approved these changes

View reviewed changes

matthewdeng enabled auto-merge (squash)

January 6, 2026 02:26

github-actions bot added the go label


          Merge branch 'master' into jax-tpu-multi-slice

d9ee6bd

github-actions bot disabled auto-merge

January 6, 2026 03:47


          Add TPU util that we added to utility.rst

50f628f

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary requested a review from a team as a code owner

January 6, 2026 05:14

ryanaoleary and others added 7 commits

January 6, 2026 05:22


          Make health check less aggressive to reduce CI flakiness

f120306

Signed-off-by: ryanaoleary <ryanaoleary@google.com>


          Merge branch 'master' into jax-tpu-multi-slice

3348db3


          Fix fixture causing CI error

815ce63

Signed-off-by: ryanaoleary <ryanaoleary@google.com>


          Merge branch 'master' into jax-tpu-multi-slice

f3f7b2f


          Add missing fixture

46b5d1f

Signed-off-by: ryanaoleary <ryanaoleary@google.com>


          Trying to fix test startup error due to fixture (only happens in CI)

fabef4e

Signed-off-by: ryanaoleary <ryanaoleary@google.com>


          Merge branch 'master' into jax-tpu-multi-slice

b2bf780

matthewdeng enabled auto-merge (squash)

January 6, 2026 23:54

dayshah approved these changes

View reviewed changes

Contributor

dayshah left a comment

have a nit, but non-blocking can address in a follow-up if it makes sense @ryanaoleary

python/ray/util/tpu.py

+                      return max(1, math.ceil(num_workers / workers_per_slice))
+                  except Exception:
+                      # Fallback to 1 if calculation fails.
+                      return 1

Contributor

dayshah Jan 7, 2026

Why 1 on failed calculation or invalid inputs? I feel like it makes more sense to raise, but maybe missing something here

Contributor Author

ryanaoleary Jan 10, 2026

@dayshah I had it default to 1 because in the Ray Train code we call validate_tpu_config and get_tpu_worker_resources for the TPU inputs, which should avoid this ever raising since we validate that the num_workers and topology / accelerator type are compatible with valid inputs.

My thought was that if for some reason this call did fail in the controller (which is the only place it's called), that it'd be better to return a default slice of 1 rather than crash the controller. I think since I don't expect this case to happen regardless due to the validation, that I can update this util to Raise like you suggest since yeah that does make more sense.

matthewdeng merged commit 236d074 into ray-project:master

7 checks passed

AYou0207 pushed a commit to AYou0207/ray that referenced this pull request


          [Train] Add TPU multi-slice support to JaxTrainer (ray-project#58629)

bc2ee1a

## Description
This PR adds support in the `JaxTrainer` to schedule across multiple TPU
slices using the `ray.util.tpu` public utilities.

To support this, this PR adds new `AcceleratorConfig`s to the V2 scaling
config, which consolidate the accelerator related fields for TPU and
GPU. When `TPUAcceleratorConfig` is specified, the JaxTrainer utilizes a
`SlicePlacementGroup` to atomically reserve `num_slices` TPU slices of
the desired topology, auto-detecting the required values for
`num_workers` and `resources_per_worker` when unspecified.

TODO: I'll add some manual testing and usage examples in the comments.

## Related issues
ray-project#55162

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

ryanaoleary mentioned this pull request

[Train] JaxTrainer Implementation Tracking Issue #55162

Open

liulehui mentioned this pull request

[train][docs] update Jax doc to include GPU and multislice TPU support #60593

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

cursor[bot] cursor[bot] left review comments

matthewdeng matthewdeng approved these changes

liulehui liulehui approved these changes

dayshah dayshah approved these changes

xyuzh Awaiting requested review from xyuzh

+1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels

community-contribution go train unstale