-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix memcpy issue on backward for zero-infinity #6670
Conversation
Hi @xylian86, we have encountered significant performance degradation when applying this patch on Gaudi HPU accelerator. The issues that we see are:
Below is the partition_grads profile before the patch: And with the patch: The above traces are taken for a small variation of BERT. For larger models, the performance degradation is more severe. Could you please share some details of performance improvement that you observed? (i.e accelerator/workload/some stats). Thanks |
@deepcharm, thanks for reporting this degradation. It is strange since this PR provided good speedups on CUDA, so I am guessing this some device-specific effect. While waiting for @xylian86 to respond, I wonder if you could test the following modification to L1508-1512, that removes self.pinned_grad_buffer[:buffer_numel].copy_(grad_buffer.to(dtype=torch.float32))
fp32_grad_tensor.copy_(self.pinned_grad_buffer[:buffer_numel]) |
@deepcharm Thank you for reporting it. The setup from my side: 4 GH200 GPUs, GBS=128, MBS=16, ZeRO 3. Following this pull request, we observed a 4x reduction in backward pass time during the accumulation step, as demonstrated in the figure attached to the PR. And as Tunji mentioned, it might be some device-specific effect, what is the bandwidth between CPU and GPU in your testing environment?? This PR is particularly effective for newer GPU clusters with high-bandwidth interconnects (PCIe 5.0). |
@xylian86 which model did you use? |
@nelyahu LlaMA 7B |
@xylian86 Thanks for providing the details! Indeed, it seems that it's a device-specific effect. @tjruwase We've tried making the 1-st copy non-blocking while also removing the device synchronize. Can we add a zero config option (default= |
Hi @tjruwase, if you can please approve/comment on adding a config option to enable/disable this feature? Many thanks |
@deepcharm, apologies for the delayed response. Yes, we will add a ds_config to control this feature. |
@deepcharm, do you know if ZeRO-1/2 on HPU experienced similar perf degradation with the earlier PR #5301? |
This PR is similar to PR#5301, that optimizes the D2H time use pinned memory.
Previously, the D2H memcpy will be the bottleneck during the final backward pass of each iteration for ZeRO-Infinity(offload), as shown in Trace-1. The new version can eliminate the bottleneck, as shown in Trace-2.
Trace-1
Trace-2
cc @tjruwase