Improve compute cost for all2all #159

fmassa · 2025-09-26T12:46:15Z

All2all performs a memory copy in the following cases:

when the input is not contiguous https://github.com/pytorch/pytorch/blob/7441a1b9b1b01b39dca6fa67853d0fd2a3384cbe/torch/distributed/tensor/placement_types.py#L330-L331
when the shard dim is not 0 https://github.com/pytorch/pytorch/blob/7441a1b9b1b01b39dca6fa67853d0fd2a3384cbe/torch/csrc/distributed/c10d/Functional.cpp#L597
when the gather dim is not 0 https://github.com/pytorch/pytorch/blob/7441a1b9b1b01b39dca6fa67853d0fd2a3384cbe/torch/csrc/distributed/c10d/Functional.cpp#L617

For now, I assume that the input is not contiguous. I should improve this in the future.

Also refactors the read-write cost into a helper function to centralize things.

The all2all implementation performs additional input/output copies depending on the in_shard / out_shard dims, see https://github.com/pytorch/pytorch/blob/afdd4247a2251b3f4c2f4b402cb625f46d6784ba/torch/csrc/distributed/c10d/Functional.cpp#L597-L617 for more details

Need to figure out a way of deciding if the input is contiguous or not

fmassa · 2025-09-26T12:47:11Z

autoparallel/compute_estimation.py

    # suppose 70% efficiency for the operator
    compute_efficiency = 0.70
    compute_time = flops / gpu_flops * 1e6  # us
-    compute_time = max(compute_time / compute_efficiency, kernel_launch_overhead)


I removed this, but this is functionally the same as before because we already perform a max(..., kernel_launch_overhead) for the compute_read_write_time.

…sa/improve_a2a_cost

fmassa added 4 commits September 11, 2025 09:45

Add .contiguous cost as well

622e375

Need to figure out a way of deciding if the input is contiguous or not

Refactor into helper function

a7ffe01

Cleanup

44b2e3a

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 26, 2025

fmassa commented Sep 26, 2025

View reviewed changes

fmassa added 2 commits September 26, 2025 12:47

Merge branch 'main' of github.com:meta-pytorch/autoparallel into fmas…

885d1b1

…sa/improve_a2a_cost

Merge branch 'main' of github.com:meta-pytorch/autoparallel into fmas…

7c26bca

…sa/improve_a2a_cost

fmassa merged commit 716c19b into main Sep 28, 2025
6 checks passed

fmassa deleted the fmassa/improve_a2a_cost branch September 28, 2025 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve compute cost for all2all #159

Improve compute cost for all2all #159

Uh oh!

fmassa commented Sep 26, 2025

Uh oh!

fmassa Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve compute cost for all2all #159

Improve compute cost for all2all #159

Uh oh!

Conversation

fmassa commented Sep 26, 2025

Uh oh!

fmassa Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants