-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Change customize_loss_grad
to use_default_grad_scale
.
#10223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update the transformer model?
@@ -46,6 +46,10 @@ def __init__(self, | |||
improve performance in some cases, defalut False. | |||
share_vars_from(ParallelExecutor, default None): If provied, | |||
it will share variables from the specified ParallelExecutor. | |||
use_default_grad_scale(bool, default True): If set True, a default | |||
scale value equal to `1./device_count` would be multiplied to | |||
the gradients. Otherwise, a customized scale value should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to gradients of each device? and then aggregated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, followed the comment.
use_default_grad_scale(bool, default True): If set True, a default | ||
scale value equal to `1./device_count` would be multiplied to | ||
the gradients. Otherwise, a customized scale value should be | ||
feeded to the network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feeded->fed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments and will update transformer after this PR merged.
use_default_grad_scale(bool, default True): If set True, a default | ||
scale value equal to `1./device_count` would be multiplied to | ||
the gradients. Otherwise, a customized scale value should be | ||
feeded to the network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -46,6 +46,10 @@ def __init__(self, | |||
improve performance in some cases, defalut False. | |||
share_vars_from(ParallelExecutor, default None): If provied, | |||
it will share variables from the specified ParallelExecutor. | |||
use_default_grad_scale(bool, default True): If set True, a default | |||
scale value equal to `1./device_count` would be multiplied to | |||
the gradients. Otherwise, a customized scale value should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, followed the comment.
Resolves #10219