Skip to content

Conversation

@ryanaoleary
Copy link
Contributor

Description

This PR updates the TPU util library in order to be more compatible with multi-slice support implemented in Ray Train in #58629. For more context on why these changes are necessary, see the linked PR for how the updated utils are used.

The main changes are:

  • Added a helper get_tpu_version_from_type to get the TPU version (i.e. "v4", "v6e") from the accelerator type
  • Changed num_workers to num_hosts, since the latter is what the property previously returned and it's not guaranteed the number of workers will always match the number of hosts (i.e. you may choose to run 1 worker on each TPU chip in a host). Additionally, added a num_bundles field which exposes the number of bundles created in the placement group.
  • Added an optional field for users to specify the desired resources per bundle in the slice placement group. When unspecified, the SlicePlacementGroup defaults to placing one bundle on each TPU host. If specified, the SlicePlacementGroup will attempt to fit the desired bundle shape on the given topology, calculating and returning the required number of bundles/workers to do so. As part of this change, added a helper function get_tpu_worker_resources which determines the number of workers and unit resources to run on a SlicePlacementGroup.
  • Added properties such as head_placement_groups so that the caller can better manage all the placement groups created with slice_placement_group.
  • Added cleanup functions to avoid dangling placement groups
  • Added more unit test coverage and better error handling logic

Related issues

#55162

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from a team as a code owner December 3, 2025 14:07
@ryanaoleary
Copy link
Contributor Author

cc: @edoakes this separates the TPU util changes from #58629

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances multi-slice TPU support in Ray by introducing more flexible resource management, robust error handling with cleanup mechanisms, and better compatibility with multi-slice configurations. The changes are well-structured and include valuable additions like configurable resources per bundle and improved test coverage. I've identified a correctness issue related to case-insensitive parsing of accelerator types and suggested a refactoring to fix it. Additionally, I've pointed out an opportunity to improve maintainability by extracting duplicated logic into a helper function. Overall, this is a solid contribution to Ray's TPU capabilities.

ryanaoleary and others added 3 commits December 3, 2025 06:17
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ray-gardener ray-gardener bot added train Ray Train Related Issue core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Dec 3, 2025
}.union(_get_larger_3d_topologies(16, 16, 24)),
"v5litepod": {"2x8", "4x4", "4x8", "8x8", "8x16", "16x16"},
"v6e": {"2x8", "4x4", "4x8", "8x8", "8x16", "16x16"},
"v5litepod": {"1x1", "2x2", "2x4", "2x8", "4x4", "4x8", "8x8", "8x16", "16x16"},
Copy link
Contributor

@Sparks0219 Sparks0219 Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding, what determines the topologies supported, is it the TPU version? If so, why were new topologies added for an existing TPU version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there are different supported topologies for each different TPU version, and it's just determined by the actual architecture in GKE. I added some new ones here because when this was initially added a few of the supported topologies were omitted: https://docs.cloud.google.com/tpu/docs/v5e#tpu-v5e-config. I'm just trying to get it to match the documentation now, the list of topologies is static and I don't expect it to update for existing TPU versions.

Ex: "2x2x2" -> 8
"""
total_chips = 1
for dim in topology.strip().lower().split("x"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not related to your change but would probably be better to just immediately.strip().lower() once in get_current_node_tpu_topology rather than do whenever we need to read the topology

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this nit in c43bd17 since it sounds reasonable to me too.

return f"{generation}-{num_chips}"
num_cores = num_chips * get_tpu_cores_per_chip(generation)

return f"{generation}-{num_cores}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering, does it matter whether we take into account chips or cores since since we also use the generation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it really matters in terms of both would result in a unique name for the TPU configuration - but I made the change because the standard TPU Pod naming convention (e.g. v4-32) represents the total core count, while the topology (e.g. 2x2x4) represents the chip arrangement. We were previously using chips here which goes against the standard convention / what's used in all the TPU documentation on GKE.

If accelerator_version is v5e or v6e AND topology product <= 8, the chips per host will just be the proudct. i.e. 1, 4, or 8
If accelerator_version is v5e or v6e AND topology product > 8, the chips per host will be 4
If accelerator_version is v5e or v6e:
If topology total chips < 8, return total chips (partial host).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is partial host only applicable to these 2 accelerator versions? What does partial host mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah those are the only 2 TPU generations that support partial or "sub" hosts. It just means that it's a topology configuration that doesn't use all the chips on a TPU VM: https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm#single-host-versus-multi-host.

("v4-2048", "8x8x16", True),
("v4-4", "16x16x16", False),
("v5p-64", "4x4x4", True),
("v5p-128", "4x4x4", True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the motivation behind adjusting the accelerator_type/expected_result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to get it to match the GKE documentation so there's no confusion: https://docs.cloud.google.com/tpu/docs/v4#large-topologies.

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from a team as a code owner December 19, 2025 05:45
@ryanaoleary
Copy link
Contributor Author

had to make some small changes to the Ray Train code which call these utilities for their tests to pass

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Dec 22, 2025
@edoakes edoakes enabled auto-merge (squash) December 22, 2025 22:30
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from a team as a code owner December 24, 2025 00:29
@github-actions github-actions bot disabled auto-merge December 24, 2025 00:29
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp for docs

@dayshah dayshah enabled auto-merge (squash) December 29, 2025 21:25
@dayshah dayshah merged commit 9576361 into ray-project:master Dec 29, 2025
7 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
)

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants