-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix potential memory issues when use deepspeed Z3 #6726
base: master
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree company=Intel |
1 similar comment
@microsoft-github-policy-service agree company=Intel |
@@ -430,6 +430,7 @@ def release_and_reset_all(self, module: Module) -> None: | |||
# there's a hook execution issue | |||
param.ds_active_sub_modules.clear() | |||
self.__release_param(param) | |||
self.__n_available_params = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a hack to me. I think it would better to understand why the following decrement is insufficient or incorrect.
self.__n_available_params -= param.ds_numel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained in detail in my commit log. I copy it here:
"__n_available_params" is set to 0 in reset_step() which is called in
backward(). All parameter are released in release_and_reset_all() which
is called in step(). "__n_available_params" is reduced when parameter is
released. These mean if step() is called after backward(),
"__n_available_params" will be reduced to a negative number.
"__n_available_params" is used to restrict fetched parameters, so
negative value leads to a problem that fetched parameter will be larger
than upper bound ("__max_n_available_params").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, "__n_available_params" is set to 0 in reset_step() and this leads to wrong value for following release operation.
self.__n_available_params = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation. Yes, setting _n_available_params
to 0 in reset_step()
is a bug since it is inconsistent with the fetched parameter count. Strictly speaking, _n_ available_params
should be modified by gather (increment) and release (decrement) operations. Since this a pre-existing bug, it is not a requirement that you fix. But it would be greatly appreciated if you could do so. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will update the PR later with unit test together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found there is a reason to set __n_available_params=0
. This variable becomes negative somehow after second step. I need to investigative.
@wenbinc-Bin, thanks for the PR. Are you able to provide unit tests for this? |
Ok, I try to add a unit test. |
"ds_grads_remaining" is used to triger post_backward_function(). If the module is called more than once in one training step, this variable will be initialized every time. If backward() is also called multiple times, "ds_grads_remaining" will be reduced to a negative number. post_backward_function() will not be called as expected. This leads to extra fetch operation or extra memory usage. Set "ds_grads_remaining" to 0 only when it is not initialized to fix this issue.` Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
Fix a bug that after the first training step, allocated parameters may bigger than "__max_n_available_params". "__n_available_params" is set to 0 in reset_step() which is called in backward(). All parameter are released in release_and_reset_all() which is called in step(). "__n_available_params" is reduced when parameter is released. These mean if step() is called after backward(), "__n_available_params" will be reduced to a negative number. "__n_available_params" is used to restrict fetched parameters, so negative value leads to a problem that fetched parameter will be larger than upper bound ("__max_n_available_params"). Do not set "__n_available_params" in reset_step(). It should be modified by gather and release operation. Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
If run model more than once in one training step, there may be issues. Add unit test to catch these kinds of problems. Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
I had OOM problem when doing DPO training using zero3. It needs to call module twice in one training step, and second call is with no_grad().
The problem is caused by two bugs:
I tried to create two patches to fix these issues.