-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up prefetched parameters #6557
Conversation
…eepSpeed into tohtana/offload_zero_buffers
Please check if this PR fixes #5828. |
@tjruwase Using this PR branch, the repro in #5828 shows the message below but exits without throwing an error. I think this is expected as the model has a conditional branch and the execution order of modules changes.
|
@tjruwase I added the cleaning of the inflight parameter registry in |
Parameters prefetched by ZeRO3 are sometimes not used. This occurs when the actual sub-module execution differs from previous tracing. As a result, the state of the allgather handle for such a parameter remains
INFLIGHT
, causing functions likeempty_partition_cache
to detect it and throw an error.This PR resolves the issue by ensuring that communication finishes and the parameters are freed.
As this issue was mentioned in #6011, this includes the change of the branch. We need to merge #6011 first.