Add V2 sharding support and improve partition spec handling for multichip training #2

sshonTT · 2025-07-23T22:02:18Z

These changes are required to support multi-chip training for real models on the torch-xla side.

Added V2 OpSharding support in XlaShardingSpec, which is used internally by MpLoader for parallel input loading. The original implementation only supported V1 shardings.
Fixed environment variable parsing for CONVERT_SHLO_TO_SHARDY - previous logic treated values like "0" or "false" as truthy.
Added logic to compute dims, reshape_dims, and transpose_perm for V2 sharding based on mesh_shape and partition_spec.

The new logic now correctly handles cases that were previously unsupported:

case 1: mesh_shape=(2,1,1,1), partition_spec=(0,None,None,None)
-> dims=[2,1,1,1], reshape_dims=[2,1,1,1], transpose_perm=[0,1,2,3]

case 2: mesh_shape=(2,1,1,1), partition_spec=(0,)
-> dims=[2], reshape_dims=[2,1,1,1], transpose_perm=[0,1,2,3]

case 3: mesh_shape=(2,4), partition_spec=(0,None)
-> dims=[2,1,4], reshape_dims=[2,4], transpose_perm=[0,1]

hshahTT · 2025-07-24T14:24:22Z

Hey Sungjoon, could you add more details to the PR about why this is needed and an overview of the high-level changes? Also, are you sure this is needed "for GSPMD support"? The purpose of the V2 shardings is to enable Shardy support within torch-xla. Torch-xla should already natively support GSPMD without needing any changes.

sshonTT · 2025-07-24T14:43:03Z

Hey Sungjoon, could you add more details to the PR about why this is needed and an overview of the high-level changes? Also, are you sure this is needed "for GSPMD support"? The purpose of the V2 shardings is to enable Shardy support within torch-xla. Torch-xla should already natively support GSPMD without needing any changes.

Hey Het, thanks for the feedback — and sorry for the lack of detail earlier. I’ll update the commit message to include more context.

These changes are required to support multi-chip training for real models from the torch-xla side. Specifically:

We use MpLoader for parallel input loading, which internally calls xla_spec and _XLAC.XlaShardingSpec. That code currently creates an OpSharding using the v1 format, so I added v2 support alongside it.

I also updated how we handle os.environ.get('CONVERT_SHLO_TO_SHARDY', False), since the previous logic treated values like "0" or "false" as truthy, which was causing unexpected behavior.

And you're right — GSPMD itself is already supported in torch-xla. What I meant was GSPMD-based training, and I’ll clarify that in the updated message.

hshahTT · 2025-07-29T01:59:10Z

torch_xla/distributed/spmd/xla_sharding.py

+        initial_dims.append(1)
+
+    # 2. Start with the initial_dims.
+    dims = list(initial_dims)


I think you can just rename initial_dims to `dims and remove L179, since you're not using initial dims after this line.

Agreed! FWIW, it was working on the older branch (link), but not on the new one (link). I’m looking into it now. Unless it turns out to require changes in torch-xla, I probably won’t need anything, but if it does, I’ll ping you for a review.

…-chip training These changes are required to support multi-chip training for real models on the torch-xla side. - Added V2 OpSharding support in XlaShardingSpec, which is used internally by MpLoader for parallel input loading. The original implementation only supported V1 shardings. - Fixed environment variable parsing for CONVERT_SHLO_TO_SHARDY - previous logic treated values like "0" or "false" as truthy. - Added logic to compute dims, reshape_dims, and transpose_perm for V2 sharding based on mesh_shape and partition_spec. The new logic now correctly handles cases that were previously unsupported: case 1: mesh_shape=(2,1,1,1), partition_spec=(0,None,None,None) -> dims=[2,1,1,1], reshape_dims=[2,1,1,1], transpose_perm=[0,1,2,3] case 2: mesh_shape=(2,1,1,1), partition_spec=(0,) Ã-> dims=[2], reshape_dims=[2,1,1,1], transpose_perm=[0,1,2,3] case 3: mesh_shape=(2,4), partition_spec=(0,None) -> dims=[2,1,4], reshape_dims=[2,4], transpose_perm=[0,1]

…chip training (#2) * Add V2 sharding support and improve partition spec handling for multi-chip training These changes are required to support multi-chip training for real models on the torch-xla side. - Added V2 OpSharding support in XlaShardingSpec, which is used internally by MpLoader for parallel input loading. The original implementation only supported V1 shardings. - Fixed environment variable parsing for CONVERT_SHLO_TO_SHARDY - previous logic treated values like "0" or "false" as truthy. - Added logic to compute dims, reshape_dims, and transpose_perm for V2 sharding based on mesh_shape and partition_spec. The new logic now correctly handles cases that were previously unsupported: case 1: mesh_shape=(2,1,1,1), partition_spec=(0,None,None,None) -> dims=[2,1,1,1], reshape_dims=[2,1,1,1], transpose_perm=[0,1,2,3] case 2: mesh_shape=(2,1,1,1), partition_spec=(0,) Ã-> dims=[2], reshape_dims=[2,1,1,1], transpose_perm=[0,1,2,3] case 3: mesh_shape=(2,4), partition_spec=(0,None) -> dims=[2,1,4], reshape_dims=[2,4], transpose_perm=[0,1] * Fix formatting according to Torch-XLA style guide --------- Co-authored-by: Het Shah <hshah@tenstorrent.com>

sshonTT requested a review from hshahTT July 23, 2025 22:02

sshonTT force-pushed the sshon/xla-sharding-spec branch 2 times, most recently from b94f982 to f7ece6c Compare July 24, 2025 14:22

sshonTT force-pushed the sshon/xla-sharding-spec branch 2 times, most recently from c394d60 to b695da2 Compare July 24, 2025 19:50

sshonTT changed the title ~~Apply v2 version of sharding format for XlaShardingSpec module~~ Add V2 sharding support and improve partition spec handling for multichip training Jul 24, 2025

hshahTT requested changes Jul 29, 2025

View reviewed changes

sshonTT force-pushed the sshon/xla-sharding-spec branch from b695da2 to d5b1b3b Compare July 31, 2025 15:35

Fix formatting according to Torch-XLA style guide

5af06e1

hshahTT merged commit 7d989d1 into master Aug 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add V2 sharding support and improve partition spec handling for multichip training #2

Add V2 sharding support and improve partition spec handling for multichip training #2

Uh oh!

sshonTT commented Jul 23, 2025 •

edited

Loading

Uh oh!

hshahTT commented Jul 24, 2025

Uh oh!

sshonTT commented Jul 24, 2025

Uh oh!

hshahTT Jul 29, 2025

Uh oh!

sshonTT Jul 29, 2025

Uh oh!

Uh oh!

Add V2 sharding support and improve partition spec handling for multichip training #2

Add V2 sharding support and improve partition spec handling for multichip training #2

Uh oh!

Conversation

sshonTT commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hshahTT commented Jul 24, 2025

Uh oh!

sshonTT commented Jul 24, 2025

Uh oh!

hshahTT Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

sshonTT Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sshonTT commented Jul 23, 2025 •

edited

Loading