-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[Core] Update TPU utils for multi-slice compatibility #59136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Update TPU utils for multi-slice compatibility #59136
Conversation
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly enhances multi-slice TPU support in Ray by introducing more flexible resource management, robust error handling with cleanup mechanisms, and better compatibility with multi-slice configurations. The changes are well-structured and include valuable additions like configurable resources per bundle and improved test coverage. I've identified a correctness issue related to case-insensitive parsing of accelerator types and suggested a refactoring to fix it. Additionally, I've pointed out an opportunity to improve maintainability by extracting duplicated logic into a helper function. Overall, this is a solid contribution to Ray's TPU capabilities.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
| }.union(_get_larger_3d_topologies(16, 16, 24)), | ||
| "v5litepod": {"2x8", "4x4", "4x8", "8x8", "8x16", "16x16"}, | ||
| "v6e": {"2x8", "4x4", "4x8", "8x8", "8x16", "16x16"}, | ||
| "v5litepod": {"1x1", "2x2", "2x4", "2x8", "4x4", "4x8", "8x8", "8x16", "16x16"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my understanding, what determines the topologies supported, is it the TPU version? If so, why were new topologies added for an existing TPU version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there are different supported topologies for each different TPU version, and it's just determined by the actual architecture in GKE. I added some new ones here because when this was initially added a few of the supported topologies were omitted: https://docs.cloud.google.com/tpu/docs/v5e#tpu-v5e-config. I'm just trying to get it to match the documentation now, the list of topologies is static and I don't expect it to update for existing TPU versions.
| Ex: "2x2x2" -> 8 | ||
| """ | ||
| total_chips = 1 | ||
| for dim in topology.strip().lower().split("x"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not related to your change but would probably be better to just immediately.strip().lower() once in get_current_node_tpu_topology rather than do whenever we need to read the topology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this nit in c43bd17 since it sounds reasonable to me too.
| return f"{generation}-{num_chips}" | ||
| num_cores = num_chips * get_tpu_cores_per_chip(generation) | ||
|
|
||
| return f"{generation}-{num_cores}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering, does it matter whether we take into account chips or cores since since we also use the generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it really matters in terms of both would result in a unique name for the TPU configuration - but I made the change because the standard TPU Pod naming convention (e.g. v4-32) represents the total core count, while the topology (e.g. 2x2x4) represents the chip arrangement. We were previously using chips here which goes against the standard convention / what's used in all the TPU documentation on GKE.
| If accelerator_version is v5e or v6e AND topology product <= 8, the chips per host will just be the proudct. i.e. 1, 4, or 8 | ||
| If accelerator_version is v5e or v6e AND topology product > 8, the chips per host will be 4 | ||
| If accelerator_version is v5e or v6e: | ||
| If topology total chips < 8, return total chips (partial host). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is partial host only applicable to these 2 accelerator versions? What does partial host mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah those are the only 2 TPU generations that support partial or "sub" hosts. It just means that it's a topology configuration that doesn't use all the chips on a TPU VM: https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm#single-host-versus-multi-host.
| ("v4-2048", "8x8x16", True), | ||
| ("v4-4", "16x16x16", False), | ||
| ("v5p-64", "4x4x4", True), | ||
| ("v5p-128", "4x4x4", True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the motivation behind adjusting the accelerator_type/expected_result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to get it to match the GKE documentation so there's no confusion: https://docs.cloud.google.com/tpu/docs/v4#large-topologies.
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
had to make some small changes to the Ray Train code which call these utilities for their tests to pass |
python/ray/train/v2/_internal/callbacks/tpu_reservation_callback.py
Outdated
Show resolved
Hide resolved
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp for docs
Description
This PR updates the TPU util library in order to be more compatible with multi-slice support implemented in Ray Train in #58629. For more context on why these changes are necessary, see the linked PR for how the updated utils are used.
The main changes are:
get_tpu_version_from_typeto get the TPU version (i.e. "v4", "v6e") from the accelerator typenum_workerstonum_hosts, since the latter is what the property previously returned and it's not guaranteed the number of workers will always match the number of hosts (i.e. you may choose to run 1 worker on each TPU chip in a host). Additionally, added anum_bundlesfield which exposes the number of bundles created in the placement group.SlicePlacementGroupdefaults to placing one bundle on each TPU host. If specified, the SlicePlacementGroup will attempt to fit the desired bundle shape on the given topology, calculating and returning the required number of bundles/workers to do so. As part of this change, added a helper functionget_tpu_worker_resourceswhich determines the number of workers and unit resources to run on aSlicePlacementGroup.head_placement_groupsso that the caller can better manage all the placement groups created withslice_placement_group.Related issues
#55162