Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential memory issues when use deepspeed Z3 #6726

Merged
merged 9 commits into from
Nov 21, 2024

Conversation

wenbinc-Bin
Copy link
Contributor

@wenbinc-Bin wenbinc-Bin commented Nov 8, 2024

I had OOM problem when doing DPO training using zero3. It needs to call module twice in one training step, and second call is with no_grad().
The problem is caused by two bugs:

  1. "__n_available_params", which helps to control fetched parameters, becomes negative after release_and_reset_all() function.
  2. module.ds_grads_remaining becomes negative in backward() if we call module more than once in one training step.

I tried to create two patches to fix these issues.

@wenbinc-Bin wenbinc-Bin requested a review from tjruwase as a code owner November 8, 2024 05:07
@wenbinc-Bin
Copy link
Contributor Author

@microsoft-github-policy-service agree company=Intel

1 similar comment
@wenbinc-Bin
Copy link
Contributor Author

@microsoft-github-policy-service agree company=Intel

@tjruwase tjruwase requested a review from tohtana November 12, 2024 16:32
@tjruwase
Copy link
Contributor

@wenbinc-Bin, thanks for the PR. Are you able to provide unit tests for this?

@wenbinc-Bin
Copy link
Contributor Author

@wenbinc-Bin, thanks for the PR. Are you able to provide unit tests for this?

Ok, I try to add a unit test.

"ds_grads_remaining" is used to triger post_backward_function(). If the
module is called more than once in one training step, this variable will
be initialized every time. If backward() is also called multiple times,
"ds_grads_remaining" will be reduced to a negative number.
post_backward_function() will not be called as expected. This leads to
extra fetch operation or extra memory usage.

Set "ds_grads_remaining" to 0 only when it is not initialized to fix
this issue.`

Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
Fix a bug that after the first training step, allocated parameters
may bigger than "__max_n_available_params".

"__n_available_params" is set to 0 in reset_step() which is called in
backward(). All parameter are released in release_and_reset_all() which
is called in step(). "__n_available_params" is reduced when parameter is
released. These mean if step() is called after backward(),
"__n_available_params" will be reduced to a negative number.
"__n_available_params" is used to restrict fetched parameters, so
negative value leads to a problem that fetched parameter will be larger
than upper bound ("__max_n_available_params").

Move "__n_available_params = 0" to release_and_reset_all() to fix this
issue.

Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
If run model more than once in one training step, there may be issues.
Add unit test to catch these kinds of problems.

Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
@wenbinc-Bin
Copy link
Contributor Author

I saw unit-tests workflow failed. I fix the issue and update the PR.

@loadams loadams enabled auto-merge November 20, 2024 00:04
Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
auto-merge was automatically disabled November 20, 2024 02:30

Head branch was pushed to by a user without write access

@loadams loadams added this pull request to the merge queue Nov 21, 2024
Merged via the queue into microsoft:master with commit cd20a3b Nov 21, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants