-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix qadam NAN problem #654
Conversation
0572355
to
35e6ddc
Compare
This pr is ready, please take a look. |
343516e
to
088d0ca
Compare
I think maybe it's better to keep the 'weight_decay' and set it to 0 by default, for consistency with the original Adam optimizer. |
Thanks @ganshaoduo , wondering how do we use |
Well spotted! Actually, according to the current implementation, we only apply the 'weight_decay' in the warm-up phase but not in the compression phase. This change of About the necessity of "weight_decay", I would say it does not have a direct impact on the convergence. All it does is to add a small value to the gradient, which won't hurt the theoretical analysis of Qadam in my view. Therefore, the question here is whether we add the 'weight_decay' operation here such that we use the same weight decay during both warm-up and compression phases, OR we delete the weight_decay operation here such that we do not consider the weight decay at all. I would prefer the former. |
Thanks for your detailed explanations, Shaoduo. Yes, I added |
The current implementation of Any opinions from other reviewers? |
After discussions offline, we decided to rollback the modifications for |
fix #485