Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up prefetched parameters #6557

Merged
merged 37 commits into from
Oct 9, 2024
Merged

Conversation

tohtana
Copy link
Contributor

@tohtana tohtana commented Sep 21, 2024

Parameters prefetched by ZeRO3 are sometimes not used. This occurs when the actual sub-module execution differs from previous tracing. As a result, the state of the allgather handle for such a parameter remains INFLIGHT, causing functions like empty_partition_cache to detect it and throw an error.
This PR resolves the issue by ensuring that communication finishes and the parameters are freed.

As this issue was mentioned in #6011, this includes the change of the branch. We need to merge #6011 first.

@tjruwase
Copy link
Contributor

Please check if this PR fixes #5828.

@tohtana
Copy link
Contributor Author

tohtana commented Sep 27, 2024

Please check if this PR fixes #5828.

@tjruwase Using this PR branch, the repro in #5828 shows the message below but exits without throwing an error. I think this is expected as the model has a conditional branch and the execution order of modules changes.

Invalidate trace cache @ step 3: expected module 2, but got module 4

tohtana added a commit that referenced this pull request Oct 4, 2024
@tohtana tohtana enabled auto-merge October 8, 2024 15:42
@tohtana
Copy link
Contributor Author

tohtana commented Oct 8, 2024

@tjruwase I added the cleaning of the inflight parameter registry in _invalidate_trace as you suggested. This allows us to free the gathered (but unused) parameters earlier. However, I also kept it in reset_step.
This is why we don't detect deviations from the trace when some modules at the end of the trace remain unvisited. The original assertion in reset_step will still be triggered in that case.

@tohtana tohtana added this pull request to the merge queue Oct 9, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 9, 2024
@loadams loadams added this pull request to the merge queue Oct 9, 2024
Merged via the queue into master with commit 7d751ee Oct 9, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants