Skip to content

Add LARS to SGD and Momentum Optimizers (#6811) #7788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

moting9
Copy link

@moting9 moting9 commented Jan 23, 2018

ADD LARS to SGD and Momentum(no nesterov) optimizers for dense parameters update
@typhoonzero
LARSforFluid_ResNet50.pptx
related #6811

@typhoonzero typhoonzero self-requested a review January 23, 2018 09:12
@moting9 moting9 force-pushed the develop branch 3 times, most recently from af04a8f to b7e819d Compare January 24, 2018 07:17
@@ -93,6 +100,9 @@ velocity = mu * velocity + gradient \\
if (use\_nesterov): \\
param = param - gradient * learning\_rate + mu * velocity * learning\_rate \\
else: \\
if (use\_lcoal\_lr): \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can put if (use\_lcoal\_lr): in the same line of else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"param = param - learning_rate * velocity. " is called under "else."
if “if (use_lcoal_lr)” put under the same line of else , "param = param - learning_rate * velocity. " needs to write twice.

original logic:
else: \
if (use_lcoal_lr): \
learning_rate *= local_gw_ratio * sqrt(sumsq(param))
/ (sqrt(sumsq(gradient))+ weight_decay * sqrt(sumsq(param))) \
param = param - learning_rate * velocity. \

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I mean use else if {...}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
change the comment from
$$
velocity = mu * velocity + gradient \
if (use_nesterov): \
param = param - gradient * learning_rate + mu * velocity * learning_rate \
else: \
param = param - learning_rate * velocity. \
$$
to
$$
velocity = mu * velocity + gradient \
if (use_nesterov): \
param = param - gradient * learning_rate + mu * velocity * learning_rate \
else if (use_lcoal_lr): \
learning_rate *= local_gw_ratio * sqrt(sumsq(param))
/ (sqrt(sumsq(gradient))+ weight_decay * sqrt(sumsq(param))) \
param = param - learning_rate * velocity. \
else: \
param = param - learning_rate * velocity. \
$$

.SetDefault(false);
AddAttr<float>("local_gw_ratio", "(float) LARS coefficient")
.SetDefault(0.001);
AddAttr<float>("weight_decay", "(float) weight decay").SetDefault(0.0005);
AddComment(R"DOC(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the reference of LARS in the operator's doc and a bit more details.

.SetDefault(false);
AddAttr<float>("local_gw_ratio", "(float) LARS coefficient")
.SetDefault(0.001);
AddAttr<float>("weight_decay", "(float) weight decay").SetDefault(0.0005);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(float) weight decay => (float) LARS weight decay


$$param\_out = param - learning\_rate * grad$$

This optimizer has a flag for LARS(Layer-Wise Adaptive Rate Scaling) for Dense prarameter update.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

if (use_local_lr) {
Eigen::TensorFixedSize<T, Eigen::Sizes<>, Eigen::RowMajor,
Eigen::DenseIndex>
p_norm = p.square().sum().sqrt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the careful review! if "auto" is used instead of "Tensor", we can not perform p_norm > 0 and g_norm > 0 check due to lazy evaluation, there will be compilation error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add a epsilon to avoid zero-division.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird if I use "auto" to calculate "param_out" as below, some elements in c++ output will be "-nan" and can't match python testcase output. yet the g_norm/p_norm/local_lr output are the same in both python and c++.

  if (use_local_lr) {
    auto p_norm = p.square().sum().sqrt();
    auto g_norm = g.square().sum().sqrt();
    auto local_lr = lr[0] * local_gw_ratio * (p_norm + epsilon) /
                    (g_norm + weight_decay * p_norm + epsilon);
    o = p - local_lr * g;
    std::cout << o;
  } else {
    o = p - lr[0] * g;
  }

here is python output for param_out
[0.69548035 0.28497267 0.22662182 0.5497418 0.718321 0.4220689 0.9801117 0.68425083]

here is c++ output for param_out
[ 0.69548 0.286139 -2.00666e+21 0.551315 -4.41094e+34 0.423106 0.980764 0.68483]

if I set the matrix size of weight and grad very small such as (4,5), the test_sgd_op.py unit test can pass,

I'm afraid it's due to the expression is a little long and lead to such error. if I use lr[0] to replace local_lr, such param_out discrepancy won't occur.

Any suggestion or comments? Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I fallback to use tensor to calculate local_lr as below, param_out in both c++ and python can match and test case can pass.
T local_lr = lr[0];
if (use_local_lr) {
Eigen::TensorFixedSize<T, Eigen::Sizes<>, Eigen::RowMajor,
Eigen::DenseIndex> p_norm = p.square().sum().sqrt();
Eigen::TensorFixedSize<T, Eigen::Sizes<>, Eigen::RowMajor,
Eigen::DenseIndex> g_norm = g.square().sum().sqrt();
local_lr = lr[0] * local_gw_ratio * (p_norm(0) + epsilon) /
(g_norm(0) + weight_decay * p_norm(0) + epsilon);
}
o = p - local_lr * g;
std::cout << o;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use "auto" to calculate local_lr an param_out as below, local_lr can match in c++ print and python unit test case.
But for "o = p - local_lr * g", some elements in "o" are -nan in c++ print. I add eval for local_lr such as "o = p - local_lr.eval() * g", some element in "o" are very strange negative values as mentioned in comments above.
whether or not I add epsilon, the python test case " TestSGDLARSOp" and "TestMomentumOp3" failed.
So I fall back to use Eigen tensor to calculate local_lr and removed epsilon. python unit test can pass. Any comments or suggestion? Thanks!

if (use_local_lr) {
auto p_norm = p.square().sum().sqrt();
auto g_norm = g.square().sum().sqrt();
auto local_lr = lr[0] * local_gw_ratio * (p_norm + epsilon) /
(g_norm + weight_decay * p_norm + epsilon);
o = p - local_lr * g;
std::cout << o;
} else {
o = p - lr[0] * g;
}

@@ -44,11 +47,28 @@ class MomentumOpKernel : public framework::OpKernel<T> {
auto g = framework::EigenVector<T>::Flatten(*grad);
auto* lr = learning_rate->data<T>();

T local_lr = lr[0];
if (use_local_lr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see the corresponding CUDA implement for this feature. So it's more proper if you can implement this as a "CPU_ONLY_KERNEL" or add CUDA support in both sgd and momentum op.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will implement it as a CPU_ONLY_KERNEL firstly. Thanks.

@typhoonzero
Copy link
Contributor

CI fails because the CUDA implement is missing.

ADD LARS to SGD and Momentum(no nesterov) optimizers for dense parameters update
@luotao1
Copy link
Contributor

luotao1 commented Jun 19, 2018

close due to #10374

@luotao1 luotao1 closed this Jun 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants