Skip to content

Conversation

@fmassa
Copy link
Contributor

@fmassa fmassa commented Sep 26, 2025

All2all performs a memory copy in the following cases:

For now, I assume that the input is not contiguous. I should improve this in the future.

Also refactors the read-write cost into a helper function to centralize things.

The all2all implementation performs additional input/output copies depending on the in_shard / out_shard dims, see https://github.com/pytorch/pytorch/blob/afdd4247a2251b3f4c2f4b402cb625f46d6784ba/torch/csrc/distributed/c10d/Functional.cpp#L597-L617 for more details
Need to figure out a way of deciding if the input is contiguous or not
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 26, 2025
# suppose 70% efficiency for the operator
compute_efficiency = 0.70
compute_time = flops / gpu_flops * 1e6 # us
compute_time = max(compute_time / compute_efficiency, kernel_launch_overhead)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this, but this is functionally the same as before because we already perform a max(..., kernel_launch_overhead) for the compute_read_write_time.

@fmassa fmassa merged commit 716c19b into main Sep 28, 2025
6 checks passed
@fmassa fmassa deleted the fmassa/improve_a2a_cost branch September 28, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants