DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

juyterman1000 · 2025-08-15T00:30:02Z

… meta key (max_mem)

sfc-gh-truwase · 2025-08-15T17:17:36Z

@juyterman1000 can you please address the formatting issue using https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

eternalNight · 2025-08-18T01:47:54Z

csrc/compile/z3.cpp

+            std::vector<int64_t> host_counts(world_size);
+            for (int i = 0; i < world_size; ++i) {
+                host_counts[i] = all_counts[i].to(torch::kCPU).item<int64_t>();
+                if (host_counts[i] > max_count) { max_count = host_counts[i]; }


Could you elaborate more on when ds_tensor.numel() of the same paramter can differ on different ranks? I think padding is already taken into account when the parameter is partitioned among the ranks (ref: https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L1664)

In case partition sizes do vary across ranks, can we fix that in partition_parameters.py to avoid synchronous communication here? launchAllGather() is on the critical path, so synchronous allgather can hurt performance.

Thanks for the sharp catch., I’ve removed the synchronous size‑allgather from the hot path in launchAllGather() and now use a fixed‑count NCCL allgather, trimming any end padding to the true param size. To keep things better without paying the runtime cost, I added a one‑time registration‑time assertion that shard sizes match across ranks, if there’s ever a mismatch, we’ll catch it at source rather than synchronize in the critical path. Changes are in the updated PR.

I think we can further optimize the code by making allgatherParam() allocating a buffer with padding in the first place (today it allocates a buffer of ds_shape which is the true size of the gathered parameter). With that we don't need any additional memcpy or GPU memory allocation/deallocation. Instead we can slice the gathered output_buf before returning it. My understanding is that torch can correctly track the refcount to the underlying buffers even living tensors use only part of them, but correct me if I'm wrong.

@eternalNight thanks for the suggestion. @juyterman1000 if you agree with this, do you want to address in a follow up PR? A benefit of a follow up PR is that it could document the perf benefit of the optimization separately from functionality.

Hi @juyterman1000,

Thank you for the PR! As some changes are unclear to me, can you explain a bit more?
You now added an assertion to ensure the even sharding, which totally makes sense to me. Do we still need the changes launchAllGather? The additional memory allocation and copy might cause the significant overhead in some cases.

@eternalNight Yes, we can allocate a buffer sized to world_size * shard_elems up front and slice it to the true size on return. PyTorch views hold a reference to the underlying storage; returning a sliced view does not break refcounting. We can cache the padded buffer per param to avoid repeat allocations. @sfc-gh-truwase . Agreed on the follow up.I’ll include micro-benchmarks showing the removal of 1 alloc + 1 memcpy per all-gather and any other gains. @tohtana With the even-sharding assertion in place, we don’t need the extra copy logic in launchAllGather(). We can issue a direct AllGather with a uniform shard ,elems count into the padded buffer and return a view of the first true_numel elements reshaped to the original param shape. The symmetric-memory path will stay as-is. This is to avoid additional copy overhead.

@juyterman1000 The path using symmetric memory is experimental and not well optimized. So we need to keep non-symmetric memory path as the choice for the best performance.
If the allocation and copy are for uneven partitioning and the assertion block such an uneven partitioning, why can't we remove them?

juyterman1000 · 2025-08-18T03:43:10Z

@juyterman1000 can you please address the formatting issue using https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

Thanks for the check. I followed the formatting prerequisites in the contributing guide and ran the full pre-commit suite. I’ve pushed the updates.

eternalNight · 2025-08-21T04:17:44Z

csrc/compile/z3.cpp

+            const int64_t shard_elems = ds_tensor.numel();
+
+            // Perform all-gather directly into the pre-allocated padded output buffer
+            ncclResult_t result = ncclAllGather(ds_tensor.flatten().data_ptr(),


Why replacing .contiguous() with .flatten()? .contiguous() makes sure that the underlying storage is contiguous which is required by nccl. .flatten() is a view-change and does not guarantee that.

Note: I believe the sharded tensors are already contiguous as they are already defragmented by DeepSpeedZeroOptimizer_Stage3.defragment(), but adding a .contiguous() does not hurt anyway and may help later when the layout of sharded tensors is changed.

eternalNight · 2025-08-21T04:28:45Z

csrc/compile/z3.cpp

+        }
+
+        at::Tensor output_buf;
+        if (param_registry_->hasGatheredParam(ds_id)) {


I'm not sure when isValid(ds_id) is false while hasGatheredParam(ds_id) is true. They are both set at the end of launchAllGather(), and releasing a gathered param will unset the valid flag in unregisterGatheredParam().

… meta key (max_mem) Signed-off-by: Abhishek <dalakotiashu150@gmail.com>

…s at registration (max_mem) Signed-off-by: Abhishek <dalakotiashu150@gmail.com>

…iew; launchAllGather issues direct NCCL AllGather for uniform shards; add registration-time uniform-shard validation Signed-off-by: Abhishek <dalakotiashu150@gmail.com>

Signed-off-by: Abhishek <dalakotiashu150@gmail.com>

juyterman1000 requested review from loadams and tjruwase as code owners August 15, 2025 00:30

juyterman1000 force-pushed the fix/dc-zero3-allgather-uneven-shards branch from eac514f to 1f39153 Compare August 15, 2025 00:32

eternalNight reviewed Aug 18, 2025

View reviewed changes

sfc-gh-truwase requested a review from tohtana August 18, 2025 16:13

eternalNight reviewed Aug 21, 2025

View reviewed changes

adalakoti90 added 4 commits August 21, 2025 20:11

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling…

2e6f1b8

… meta key (max_mem) Signed-off-by: Abhishek <dalakotiashu150@gmail.com>

DeepCompile ZeRO-3: remove size-allgather; enforce uniform shard size…

24ceb0b

…s at registration (max_mem) Signed-off-by: Abhishek <dalakotiashu150@gmail.com>

Z3: allgatherParam pre-allocates padded buffer and returns a sliced v…

94e3dd6

…iew; launchAllGather issues direct NCCL AllGather for uniform shards; add registration-time uniform-shard validation Signed-off-by: Abhishek <dalakotiashu150@gmail.com>

Z3: use .contiguous() for NCCL allgather send buffer; add comment

ffa2aba

Signed-off-by: Abhishek <dalakotiashu150@gmail.com>

juyterman1000 force-pushed the fix/dc-zero3-allgather-uneven-shards branch from 34df823 to ffa2aba Compare August 22, 2025 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

juyterman1000 commented Aug 15, 2025

Uh oh!

sfc-gh-truwase commented Aug 15, 2025

Uh oh!

eternalNight Aug 18, 2025

Uh oh!

juyterman1000 Aug 18, 2025

Uh oh!

eternalNight Aug 18, 2025

Uh oh!

sfc-gh-truwase Aug 20, 2025

Uh oh!

tohtana Aug 20, 2025

Uh oh!

juyterman1000 Aug 21, 2025

Uh oh!

tohtana Aug 21, 2025

Uh oh!

juyterman1000 commented Aug 18, 2025

Uh oh!

eternalNight Aug 21, 2025

Uh oh!

eternalNight Aug 21, 2025

Uh oh!

Uh oh!

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

Are you sure you want to change the base?

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

Conversation

juyterman1000 commented Aug 15, 2025

Uh oh!

sfc-gh-truwase commented Aug 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juyterman1000 commented Aug 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!