-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading #33154
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Ping me when this is ready for review! |
@ArthurZucker Ready! |
thanks @matthewdouglas ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry as the changes are exactly the same as what we had in #32276, could you explain what was resolved on main that no longer fails?
@ArthurZucker I've added more background to the description.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot all for clarifying!
@ArthurZucker @matthewdouglas I tried this fix but im having similar NCCL issues as what you had. Unfortunately your suggestion to upgrade to latest is not working. I understand you have some internal debugging discussions on this topic. Is it possible to share NCCL env settings and other package versions, that might shed light on the root cause? Update: found the root cause and it was not an NCCL issue. Have submitted a fix to TRL for it |
What does this PR do?
This PR fixes an issue with FSDP + CPU_RAM_EFFICIENT_LOADING where a copy of the parameters are loaded into CPU memory for each rank. The change offloads to CPU only for rank 0, and the rest on the meta device. On a typical 8x node this will dramatically decrease the system RAM overhead required to load a large model.
This is split from a previously reverted PR #32276 originally contributed by @winglian. The revert was due to issues we had with validating the change that have since been resolved.
The issue we encountered was specific to our cluster environment on AWS. With the AWS EFI plugin for NCCL, we encountered consistent hangs. If we upgrade NCCL from the version bundled with PyTorch (2.20.5) to NCCL 2.22.3 via
pip install nvidia-nccl-cu12==2.22.3
, this issue is resolved. (Internal discussion)Fixes #31721, #31577
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @LysandreJik