-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Add LARS to SGD and Momentum Optimizers (#6811) #7788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
af04a8f
to
b7e819d
Compare
paddle/operators/momentum_op.cc
Outdated
@@ -93,6 +100,9 @@ velocity = mu * velocity + gradient \\ | |||
if (use\_nesterov): \\ | |||
param = param - gradient * learning\_rate + mu * velocity * learning\_rate \\ | |||
else: \\ | |||
if (use\_lcoal\_lr): \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can put if (use\_lcoal\_lr):
in the same line of else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"param = param - learning_rate * velocity. " is called under "else."
if “if (use_lcoal_lr)” put under the same line of else , "param = param - learning_rate * velocity. " needs to write twice.
original logic:
else: \
if (use_lcoal_lr): \
learning_rate *= local_gw_ratio * sqrt(sumsq(param))
/ (sqrt(sumsq(gradient))+ weight_decay * sqrt(sumsq(param))) \
param = param - learning_rate * velocity. \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I mean use else if {...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
change the comment from
$$
velocity = mu * velocity + gradient \
if (use_nesterov): \
param = param - gradient * learning_rate + mu * velocity * learning_rate \
else: \
param = param - learning_rate * velocity. \
$$
to
$$
velocity = mu * velocity + gradient \
if (use_nesterov): \
param = param - gradient * learning_rate + mu * velocity * learning_rate \
else if (use_lcoal_lr): \
learning_rate *= local_gw_ratio * sqrt(sumsq(param))
/ (sqrt(sumsq(gradient))+ weight_decay * sqrt(sumsq(param))) \
param = param - learning_rate * velocity. \
else: \
param = param - learning_rate * velocity. \
$$
paddle/operators/momentum_op.cc
Outdated
.SetDefault(false); | ||
AddAttr<float>("local_gw_ratio", "(float) LARS coefficient") | ||
.SetDefault(0.001); | ||
AddAttr<float>("weight_decay", "(float) weight decay").SetDefault(0.0005); | ||
AddComment(R"DOC( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the reference of LARS in the operator's doc and a bit more details.
paddle/operators/momentum_op.cc
Outdated
.SetDefault(false); | ||
AddAttr<float>("local_gw_ratio", "(float) LARS coefficient") | ||
.SetDefault(0.001); | ||
AddAttr<float>("weight_decay", "(float) weight decay").SetDefault(0.0005); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(float) weight decay
=> (float) LARS weight decay
paddle/operators/sgd_op.cc
Outdated
|
||
$$param\_out = param - learning\_rate * grad$$ | ||
|
||
This optimizer has a flag for LARS(Layer-Wise Adaptive Rate Scaling) for Dense prarameter update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
paddle/operators/momentum_op.h
Outdated
if (use_local_lr) { | ||
Eigen::TensorFixedSize<T, Eigen::Sizes<>, Eigen::RowMajor, | ||
Eigen::DenseIndex> | ||
p_norm = p.square().sum().sqrt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigning to tensor may cause Eigen to do evaluation here, which may cause performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the careful review! if "auto" is used instead of "Tensor", we can not perform p_norm > 0 and g_norm > 0 check due to lazy evaluation, there will be compilation error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can add a epsilon to avoid zero-division.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird if I use "auto" to calculate "param_out" as below, some elements in c++ output will be "-nan" and can't match python testcase output. yet the g_norm/p_norm/local_lr output are the same in both python and c++.
if (use_local_lr) {
auto p_norm = p.square().sum().sqrt();
auto g_norm = g.square().sum().sqrt();
auto local_lr = lr[0] * local_gw_ratio * (p_norm + epsilon) /
(g_norm + weight_decay * p_norm + epsilon);
o = p - local_lr * g;
std::cout << o;
} else {
o = p - lr[0] * g;
}
here is python output for param_out
[0.69548035 0.28497267 0.22662182 0.5497418 0.718321 0.4220689 0.9801117 0.68425083]
here is c++ output for param_out
[ 0.69548 0.286139 -2.00666e+21 0.551315 -4.41094e+34 0.423106 0.980764 0.68483]
if I set the matrix size of weight and grad very small such as (4,5), the test_sgd_op.py unit test can pass,
I'm afraid it's due to the expression is a little long and lead to such error. if I use lr[0] to replace local_lr, such param_out discrepancy won't occur.
Any suggestion or comments? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I fallback to use tensor to calculate local_lr as below, param_out in both c++ and python can match and test case can pass.
T local_lr = lr[0];
if (use_local_lr) {
Eigen::TensorFixedSize<T, Eigen::Sizes<>, Eigen::RowMajor,
Eigen::DenseIndex> p_norm = p.square().sum().sqrt();
Eigen::TensorFixedSize<T, Eigen::Sizes<>, Eigen::RowMajor,
Eigen::DenseIndex> g_norm = g.square().sum().sqrt();
local_lr = lr[0] * local_gw_ratio * (p_norm(0) + epsilon) /
(g_norm(0) + weight_decay * p_norm(0) + epsilon);
}
o = p - local_lr * g;
std::cout << o;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use "auto" to calculate local_lr an param_out as below, local_lr can match in c++ print and python unit test case.
But for "o = p - local_lr * g", some elements in "o" are -nan in c++ print. I add eval for local_lr such as "o = p - local_lr.eval() * g", some element in "o" are very strange negative values as mentioned in comments above.
whether or not I add epsilon, the python test case " TestSGDLARSOp" and "TestMomentumOp3" failed.
So I fall back to use Eigen tensor to calculate local_lr and removed epsilon. python unit test can pass. Any comments or suggestion? Thanks!
if (use_local_lr) {
auto p_norm = p.square().sum().sqrt();
auto g_norm = g.square().sum().sqrt();
auto local_lr = lr[0] * local_gw_ratio * (p_norm + epsilon) /
(g_norm + weight_decay * p_norm + epsilon);
o = p - local_lr * g;
std::cout << o;
} else {
o = p - lr[0] * g;
}
paddle/operators/momentum_op.h
Outdated
@@ -44,11 +47,28 @@ class MomentumOpKernel : public framework::OpKernel<T> { | |||
auto g = framework::EigenVector<T>::Flatten(*grad); | |||
auto* lr = learning_rate->data<T>(); | |||
|
|||
T local_lr = lr[0]; | |||
if (use_local_lr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see the corresponding CUDA implement for this feature. So it's more proper if you can implement this as a "CPU_ONLY_KERNEL" or add CUDA support in both sgd and momentum op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will implement it as a CPU_ONLY_KERNEL firstly. Thanks.
CI fails because the CUDA implement is missing. |
ADD LARS to SGD and Momentum(no nesterov) optimizers for dense parameters update
close due to #10374 |
ADD LARS to SGD and Momentum(no nesterov) optimizers for dense parameters update
@typhoonzero
LARSforFluid_ResNet50.pptx
related #6811