Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential memory issues when use deepspeed Z3 #6726

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wenbinc-Bin
Copy link

@wenbinc-Bin wenbinc-Bin commented Nov 8, 2024

I had OOM problem when doing DPO training using zero3. It needs to call module twice in one training step, and second call is with no_grad().
The problem is caused by two bugs:

  1. "__n_available_params", which helps to control fetched parameters, becomes negative after release_and_reset_all() function.
  2. module.ds_grads_remaining becomes negative in backward() if we call module more than once in one training step.

I tried to create two patches to fix these issues.

@wenbinc-Bin
Copy link
Author

@microsoft-github-policy-service agree company=Intel

1 similar comment
@wenbinc-Bin
Copy link
Author

@microsoft-github-policy-service agree company=Intel

@@ -430,6 +430,7 @@ def release_and_reset_all(self, module: Module) -> None:
# there's a hook execution issue
param.ds_active_sub_modules.clear()
self.__release_param(param)
self.__n_available_params = 0
Copy link
Contributor

@tjruwase tjruwase Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a hack to me. I think it would better to understand why the following decrement is insufficient or incorrect.

self.__n_available_params -= param.ds_numel

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained in detail in my commit log. I copy it here:
"__n_available_params" is set to 0 in reset_step() which is called in
backward(). All parameter are released in release_and_reset_all() which
is called in step(). "__n_available_params" is reduced when parameter is
released. These mean if step() is called after backward(),
"__n_available_params" will be reduced to a negative number.
"__n_available_params" is used to restrict fetched parameters, so
negative value leads to a problem that fetched parameter will be larger
than upper bound ("__max_n_available_params").

Copy link
Author

@wenbinc-Bin wenbinc-Bin Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, "__n_available_params" is set to 0 in reset_step() and this leads to wrong value for following release operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. Yes, setting _n_available_params to 0 in reset_step() is a bug since it is inconsistent with the fetched parameter count. Strictly speaking, _n_ available_params should be modified by gather (increment) and release (decrement) operations. Since this a pre-existing bug, it is not a requirement that you fix. But it would be greatly appreciated if you could do so. Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will update the PR later with unit test together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found there is a reason to set __n_available_params=0. This variable becomes negative somehow after second step. I need to investigative.

@tjruwase
Copy link
Contributor

@wenbinc-Bin, thanks for the PR. Are you able to provide unit tests for this?

@wenbinc-Bin
Copy link
Author

@wenbinc-Bin, thanks for the PR. Are you able to provide unit tests for this?

Ok, I try to add a unit test.

"ds_grads_remaining" is used to triger post_backward_function(). If the
module is called more than once in one training step, this variable will
be initialized every time. If backward() is also called multiple times,
"ds_grads_remaining" will be reduced to a negative number.
post_backward_function() will not be called as expected. This leads to
extra fetch operation or extra memory usage.

Set "ds_grads_remaining" to 0 only when it is not initialized to fix
this issue.`

Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
Fix a bug that after the first training step, allocated parameters
may bigger than "__max_n_available_params".

"__n_available_params" is set to 0 in reset_step() which is called in
backward(). All parameter are released in release_and_reset_all() which
is called in step(). "__n_available_params" is reduced when parameter is
released. These mean if step() is called after backward(),
"__n_available_params" will be reduced to a negative number.
"__n_available_params" is used to restrict fetched parameters, so
negative value leads to a problem that fetched parameter will be larger
than upper bound ("__max_n_available_params").

Do not set "__n_available_params" in reset_step(). It should be modified
by gather and release operation.

Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
If run model more than once in one training step, there may be issues.
Add unit test to catch these kinds of problems.

Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants