[Core] Update TPU utils for multi-slice compatibility #59136

ryanaoleary · 2025-12-03T14:07:31Z

Description

This PR updates the TPU util library in order to be more compatible with multi-slice support implemented in Ray Train in #58629. For more context on why these changes are necessary, see the linked PR for how the updated utils are used.

The main changes are:

Added a helper get_tpu_version_from_type to get the TPU version (i.e. "v4", "v6e") from the accelerator type
Changed num_workers to num_hosts, since the latter is what the property previously returned and it's not guaranteed the number of workers will always match the number of hosts (i.e. you may choose to run 1 worker on each TPU chip in a host). Additionally, added a num_bundles field which exposes the number of bundles created in the placement group.
Added an optional field for users to specify the desired resources per bundle in the slice placement group. When unspecified, the SlicePlacementGroup defaults to placing one bundle on each TPU host. If specified, the SlicePlacementGroup will attempt to fit the desired bundle shape on the given topology, calculating and returning the required number of bundles/workers to do so. As part of this change, added a helper function get_tpu_worker_resources which determines the number of workers and unit resources to run on a SlicePlacementGroup.
Added properties such as head_placement_groups so that the caller can better manage all the placement groups created with slice_placement_group.
Added cleanup functions to avoid dangling placement groups
Added more unit test coverage and better error handling logic

Related issues

#55162

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2025-12-03T14:07:55Z

cc: @edoakes this separates the TPU util changes from #58629

gemini-code-assist

Code Review

This pull request significantly enhances multi-slice TPU support in Ray by introducing more flexible resource management, robust error handling with cleanup mechanisms, and better compatibility with multi-slice configurations. The changes are well-structured and include valuable additions like configurable resources per bundle and improved test coverage. I've identified a correctness issue related to case-insensitive parsing of accelerator types and suggested a refactoring to fix it. Additionally, I've pointed out an opportunity to improve maintainability by extracting duplicated logic into a helper function. Overall, this is a solid contribution to Ray's TPU capabilities.

python/ray/util/tpu.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Sparks0219 · 2025-12-09T01:22:14Z

python/ray/_private/accelerators/tpu.py

    }.union(_get_larger_3d_topologies(16, 16, 24)),
-    "v5litepod": {"2x8", "4x4", "4x8", "8x8", "8x16", "16x16"},
-    "v6e": {"2x8", "4x4", "4x8", "8x8", "8x16", "16x16"},
+    "v5litepod": {"1x1", "2x2", "2x4", "2x8", "4x4", "4x8", "8x8", "8x16", "16x16"},


just for my understanding, what determines the topologies supported, is it the TPU version? If so, why were new topologies added for an existing TPU version?

Yeah there are different supported topologies for each different TPU version, and it's just determined by the actual architecture in GKE. I added some new ones here because when this was initially added a few of the supported topologies were omitted: https://docs.cloud.google.com/tpu/docs/v5e#tpu-v5e-config. I'm just trying to get it to match the documentation now, the list of topologies is static and I don't expect it to update for existing TPU versions.

Sparks0219 · 2025-12-09T01:29:38Z

python/ray/_private/accelerators/tpu.py

+    Ex: "2x2x2" -> 8
+    """
+    total_chips = 1
+    for dim in topology.strip().lower().split("x"):


nit: not related to your change but would probably be better to just immediately.strip().lower() once in get_current_node_tpu_topology rather than do whenever we need to read the topology

Fixed this nit in c43bd17 since it sounds reasonable to me too.

Sparks0219 · 2025-12-09T01:51:05Z

python/ray/_private/accelerators/tpu.py

-        return f"{generation}-{num_chips}"
+        num_cores = num_chips * get_tpu_cores_per_chip(generation)
+
+        return f"{generation}-{num_cores}"


just wondering, does it matter whether we take into account chips or cores since since we also use the generation?

I don't think it really matters in terms of both would result in a unique name for the TPU configuration - but I made the change because the standard TPU Pod naming convention (e.g. v4-32) represents the total core count, while the topology (e.g. 2x2x4) represents the chip arrangement. We were previously using chips here which goes against the standard convention / what's used in all the TPU documentation on GKE.

Sparks0219 · 2025-12-09T01:57:50Z

python/ray/_private/accelerators/tpu.py

-        If accelerator_version is v5e or v6e AND topology product <= 8, the chips per host will just be the proudct. i.e. 1, 4, or 8
-        If accelerator_version is v5e or v6e AND topology product > 8, the chips per host will be 4
+        If accelerator_version is v5e or v6e:
+            If topology total chips < 8, return total chips (partial host).


Is partial host only applicable to these 2 accelerator versions? What does partial host mean?

Yeah those are the only 2 TPU generations that support partial or "sub" hosts. It just means that it's a topology configuration that doesn't use all the chips on a TPU VM: https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm#single-host-versus-multi-host.

Sparks0219 · 2025-12-09T02:08:48Z

python/ray/tests/test_tpu.py

        ("v4-2048", "8x8x16", True),
        ("v4-4", "16x16x16", False),
-        ("v5p-64", "4x4x4", True),
+        ("v5p-128", "4x4x4", True),


what was the motivation behind adjusting the accelerator_type/expected_result?

Just to get it to match the GKE documentation so there's no confusion: https://docs.cloud.google.com/tpu/docs/v4#large-topologies.

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2025-12-19T05:50:37Z

had to make some small changes to the Ray Train code which call these utilities for their tests to pass

python/ray/train/v2/_internal/callbacks/tpu_reservation_callback.py

python/ray/util/tpu.py

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

python/ray/util/tpu.py

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

bveeramani

Stamp for docs

) Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

[Core] Update TPU utils for multi-slice compatibility

d6fe9e9

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary requested a review from a team as a code owner December 3, 2025 14:07

Merge branch 'master' into enhance-multi-slice-placement-group

7f44392

ryanaoleary mentioned this pull request Dec 3, 2025

[Train] Add TPU multi-slice support to JaxTrainer #58629

Merged

gemini-code-assist bot reviewed Dec 3, 2025

View reviewed changes

python/ray/util/tpu.py Outdated Show resolved Hide resolved

python/ray/util/tpu.py Outdated Show resolved Hide resolved

python/ray/util/tpu.py Outdated Show resolved Hide resolved

cursor bot reviewed Dec 3, 2025

View reviewed changes

python/ray/util/tpu.py Outdated Show resolved Hide resolved

python/ray/util/tpu.py Outdated Show resolved Hide resolved

python/ray/util/tpu.py Outdated Show resolved Hide resolved

ryanaoleary and others added 3 commits December 3, 2025 06:17

Update python/ray/util/tpu.py

a703ebd

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

fix bugbot comments and validation

96ad81b

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Fix lint error

ce7b6d5

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ray-gardener bot added train Ray Train Related Issue core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Dec 3, 2025

edoakes assigned Sparks0219 and dayshah Dec 6, 2025

Sparks0219 reviewed Dec 9, 2025

View reviewed changes

ryanaoleary and others added 2 commits December 19, 2025 02:47

Fix nit to immediately strip().lower() when getting topology

c43bd17

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Merge branch 'master' into enhance-multi-slice-placement-group

edb50e4

ryanaoleary requested a review from Sparks0219 December 19, 2025 02:53

Fix JaxTrainer tests

9d6ffdf

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary requested a review from a team as a code owner December 19, 2025 05:45

cursor bot reviewed Dec 19, 2025

View reviewed changes

python/ray/train/v2/_internal/callbacks/tpu_reservation_callback.py Outdated Show resolved Hide resolved

python/ray/util/tpu.py Show resolved Hide resolved

ryanaoleary added 2 commits December 20, 2025 00:23

Improve error handling

4f393fc

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Fix unpacking logic

68786c3

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

cursor bot reviewed Dec 20, 2025

View reviewed changes

python/ray/util/tpu.py Show resolved Hide resolved

Fix PG instantiation

7a5a445

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

edoakes approved these changes Dec 22, 2025

View reviewed changes

edoakes added the go add ONLY when ready to merge, run all tests label Dec 22, 2025

edoakes enabled auto-merge (squash) December 22, 2025 22:30

Try and fix lint errors

f06ab7a

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary requested a review from a team as a code owner December 24, 2025 00:29

Merge branch 'master' into enhance-multi-slice-placement-group

18a3b38

github-actions bot disabled auto-merge December 24, 2025 00:29

ryanaoleary added 2 commits December 26, 2025 10:20

Merge branch 'master' into enhance-multi-slice-placement-group

0b94306

Merge branch 'master' into enhance-multi-slice-placement-group

4ef8ba9

bveeramani approved these changes Dec 29, 2025

View reviewed changes

Merge branch 'master' into enhance-multi-slice-placement-group

68c421e

justinvyu approved these changes Dec 29, 2025

View reviewed changes

dayshah approved these changes Dec 29, 2025

View reviewed changes

dayshah enabled auto-merge (squash) December 29, 2025 21:25

dayshah merged commit 9576361 into ray-project:master Dec 29, 2025
7 checks passed

[Core] Update TPU utils for multi-slice compatibility #59136

[Core] Update TPU utils for multi-slice compatibility #59136

Uh oh!

Conversation

ryanaoleary commented Dec 3, 2025

Description

Related issues

Uh oh!

ryanaoleary commented Dec 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sparks0219 Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary commented Dec 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Sparks0219 Dec 9, 2025 •

edited

Loading