Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix memcpy issue on backward for zero-infinity #6670

Merged
merged 7 commits into from
Oct 31, 2024

Conversation

xylian86
Copy link
Contributor

This PR is similar to PR#5301, that optimizes the D2H time use pinned memory.

Previously, the D2H memcpy will be the bottleneck during the final backward pass of each iteration for ZeRO-Infinity(offload), as shown in Trace-1. The new version can eliminate the bottleneck, as shown in Trace-2.

Trace-1
image

Trace-2
image

cc @tjruwase

@xylian86 xylian86 requested a review from tjruwase as a code owner October 26, 2024 05:58
@loadams loadams requested a review from GuanhuaWang October 28, 2024 16:02
@loadams loadams enabled auto-merge October 30, 2024 23:44
@loadams loadams disabled auto-merge October 31, 2024 17:55
@loadams loadams merged commit ff1c543 into microsoft:master Oct 31, 2024
13 of 14 checks passed
@deepcharm
Copy link
Contributor

deepcharm commented Dec 11, 2024

Hi @xylian86, we have encountered significant performance degradation when applying this patch on Gaudi HPU accelerator.
With the patch partition_grads() takes approximately 2x time.

The issues that we see are:

  1. Device synchronize after the 1-st copy
  2. The 2-nd H2H copy takes very long time

Below is the partition_grads profile before the patch:

partition_grads_before

And with the patch:

partition_grads_with _pr6670

The above traces are taken for a small variation of BERT. For larger models, the performance degradation is more severe.

Could you please share some details of performance improvement that you observed? (i.e accelerator/workload/some stats). Thanks

@tjruwase
Copy link
Contributor

@deepcharm, thanks for reporting this degradation. It is strange since this PR provided good speedups on CUDA, so I am guessing this some device-specific effect.

While waiting for @xylian86 to respond, I wonder if you could test the following modification to L1508-1512, that removes non_blocking=True and get_accelerator().synchronize()

                        self.pinned_grad_buffer[:buffer_numel].copy_(grad_buffer.to(dtype=torch.float32))
                        fp32_grad_tensor.copy_(self.pinned_grad_buffer[:buffer_numel])

@xylian86
Copy link
Contributor Author

@deepcharm Thank you for reporting it. The setup from my side: 4 GH200 GPUs, GBS=128, MBS=16, ZeRO 3. Following this pull request, we observed a 4x reduction in backward pass time during the accumulation step, as demonstrated in the figure attached to the PR.

And as Tunji mentioned, it might be some device-specific effect, what is the bandwidth between CPU and GPU in your testing environment?? This PR is particularly effective for newer GPU clusters with high-bandwidth interconnects (PCIe 5.0).

@nelyahu
Copy link
Contributor

nelyahu commented Dec 12, 2024

@xylian86 which model did you use?

@xylian86
Copy link
Contributor Author

@nelyahu LlaMA 7B

@deepcharm
Copy link
Contributor

@xylian86 Thanks for providing the details! Indeed, it seems that it's a device-specific effect.

@tjruwase We've tried making the 1-st copy non-blocking while also removing the device synchronize.
Unfortunately, no change of performance was observed.

Can we add a zero config option (default=True) for this new behavior, so that HPU customers would be
able to disable it to improve the performance?

@deepcharm
Copy link
Contributor

Hi @tjruwase, if you can please approve/comment on adding a config option to enable/disable this feature? Many thanks

@tjruwase
Copy link
Contributor

tjruwase commented Jan 9, 2025

@deepcharm, apologies for the delayed response. Yes, we will add a ds_config to control this feature.

@tjruwase
Copy link
Contributor

tjruwase commented Jan 9, 2025

@deepcharm, do you know if ZeRO-1/2 on HPU experienced similar perf degradation with the earlier PR #5301?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants