Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

fix optimizer bug in CPP-Package #9809

Merged
merged 7 commits into from
Feb 27, 2018
Merged

fix optimizer bug in CPP-Package #9809

merged 7 commits into from
Feb 27, 2018

Conversation

chsin
Copy link
Contributor

@chsin chsin commented Feb 16, 2018

Description

Regards issue #9800. The bug was that CPP-Package won't allow more than one optimizer to be created. See comments for more details.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented:
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

I narrowed my example code down to this snip of code:

#include "mxnet-cpp/MxNetCpp.h"
int main(int argc, char** argv) {
  Optimizer* opt = OptimizerRegistry::Find("sgd");
  opt = OptimizerRegistry::Find("adam");
  return 0;
}

To produce this error:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [18:02:48] /scratch/cahsin/repo/incubator-mxnet/cpp-package/include/mxnet-cpp/optimizer.hpp:141: Check failed: cmap().count(name) == 0 (1 vs. 0) sgd already registered

Stack trace returned 7 entries:
[bt] (0) ./mlp_cpu(_ZN4dmlc10StackTraceEv+0x42) [0x411b41]
[bt] (1) ./mlp_cpu(_ZN4dmlc15LogMessageFatalD1Ev+0x1b) [0x411db1]
[bt] (2) ./mlp_cpu() [0x41538f]
[bt] (3) ./mlp_cpu() [0x414de8]
[bt] (4) ./mlp_cpu() [0x411231]
[bt] (5) /lib64/libc.so.6(__libc_start_main+0xfd) [0x3449c1ed1d]
[bt] (6) ./mlp_cpu() [0x410ad9]

Aborted (core dumped)

I am using adam there as an example, but any two calls to OptimizerRegistry::Find will result in the error. The problem is here:
https://github.com/apache/incubator-mxnet/blob/master/cpp-package/include/mxnet-cpp/optimizer.hpp#L127

    inline Optimizer* OptimizerRegistry::Find(const std::string& name) {
      MXNETCPP_REGISTER_OPTIMIZER(sgd, SGDOptimizer);
      MXNETCPP_REGISTER_OPTIMIZER(ccsgd, SGDOptimizer);  // For backward compatibility
      MXNETCPP_REGISTER_OPTIMIZER(rmsprop, RMSPropOptimizer);
      MXNETCPP_REGISTER_OPTIMIZER(adam, AdamOptimizer);
      MXNETCPP_REGISTER_OPTIMIZER(adagrad, AdaGradOptimizer);
      MXNETCPP_REGISTER_OPTIMIZER(adadelta, AdaDeltaOptimizer);
      MXNETCPP_REGISTER_OPTIMIZER(signum, SignumOptimizer);
      auto it = cmap().find(name);
      if (it == cmap().end())
        return nullptr;
      return it->second();
    }

https://github.com/apache/incubator-mxnet/blob/master/cpp-package/include/mxnet-cpp/optimizer.h#L133

    #define MXNETCPP_REGISTER_OPTIMIZER(Name, OptimizerType)\
      OptimizerRegistry::__REGISTER__(#Name, [](){return new OptimizerType();})

https://github.com/apache/incubator-mxnet/blob/master/cpp-package/include/mxnet-cpp/optimizer.hpp#L141

    inline int OptimizerRegistry::__REGISTER__(const std::string& name, OptimizerCreator creator) {
      CHECK_EQ(cmap().count(name), 0) << name << " already registered";
      cmap().emplace(name, std::move(creator));
      return 0;
    }

This means users can only call OptimizerRegistry::Find once because every time they want to create an optimizer object, OptimizerRegistry::Find re-registers everything. This is abnormal. If you look into what the Python API does optimizer.py, you observe that def register(klass) isn't called by def create_optimizer(name, **kwargs). In fact, when the CPP-Package was first added in Integrate cpp package (#5251), the registering of optimizers was not placed into the creation:

std::map<std::string, OptimizerCreator> OptimizerRegistry::cmap_;

MXNETCPP_REGISTER_OPTIMIZER(sgd, SGDOptimizer);
MXNETCPP_REGISTER_OPTIMIZER(ccsgd, SGDOptimizer);  // For backward compatibility

Optimizer* OptimizerRegistry::Find(const std::string& name) {
  auto it = cmap_.find(name);
  if (it == cmap_.end())
    return nullptr;
  return it->second();
}

The embedding of register in Find happened with the merge of [cpp-package] Fix multiple definition issue. Since there is a justification for the strange structure, I just put in a small fix that would register the optimizer just once.

@chsin chsin changed the title register optimizers only once register optimizers only once in CPP-Package Feb 16, 2018
@marcoabreu
Copy link
Contributor

Could you please add a test for this case?

@chsin
Copy link
Contributor Author

chsin commented Feb 16, 2018

@marcoabreu, OK, where are the instructions for adding a unit test to the cpp package? I see that in https://github.com/apache/incubator-mxnet/blob/master/cpp-package/tests/ci_test.sh
there's just one test and it's in /cpp-package/example? Shouldn't the tests be in /cpp-package/tests? Does CI just build everything in /cpp-package/example so anything there can be run by /cpp-package/tests/ci_test.sh after the appropriate edit to https://github.com/apache/incubator-mxnet/blob/master/cpp-package/example/CMakeLists.txt?

@marcoabreu
Copy link
Contributor

Oh you're right, we don't have a structure to test cpp package properly. My bad...

But yes, everything in cpp-package should be compiled. Feel free to play around with it and give me heads up if you need assistance. https://github.com/apache/incubator-mxnet/blob/master/cpp-package/tests/ci_test.sh is the right place to start.

@chsin chsin requested a review from marcoabreu as a code owner February 16, 2018 23:57
@marcoabreu
Copy link
Contributor

marcoabreu commented Feb 17, 2018

@larroy could you review?

@chsin
Copy link
Contributor Author

chsin commented Feb 17, 2018

@marcoabreu, I don't understand how the Python test would fail when I never made an edit to the Python package? Could you explain?

@chsin
Copy link
Contributor Author

chsin commented Feb 17, 2018

@marcoabreu, I never touched the Python and Gluon package, why do I get this:

======================================================================
FAIL: test_gluon_model_zoo_gpu.test_training
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 150, in test_training
    assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2)
  File "/workspace/python/mxnet/test_utils.py", line 495, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 76.641708 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of maximum error:(9, 304), a=0.754152, b=-0.052508
 a: array([[ 0.45648926, -0.19809955,  0.35798055, ...,  1.0708873 ,
        -0.3380539 , -0.22070615],
       [ 0.42138988,  0.02357741,  0.38420224, ...,  1.0584493 ,...
 b: array([[ 0.28696495, -0.17991166,  0.37088192, ...,  1.0100358 ,
        -0.3728746 , -0.14392784],
       [ 0.27283525,  0.03558442,  0.45698166, ...,  0.9900025 ,...
-------------------- >> begin captured logging << --------------------
urllib3.connectionpool: DEBUG: Starting new HTTP connection (1): data.mxnet.io
urllib3.connectionpool: DEBUG: http://data.mxnet.io:80 "GET /data/val-5k-256.rec HTTP/1.1" 200 150874780
root: INFO: downloaded http://data.mxnet.io/data/val-5k-256.rec into data/val-5k-256.rec successfully
--------------------- >> end captured logging << ---------------------

@marcoabreu
Copy link
Contributor

Since we are using randomized data in our unit tests, they tend to be flaky by nature. I have retriggered the build and create an issue for this test #9820

@chsin
Copy link
Contributor Author

chsin commented Feb 19, 2018

@marcoabreu, test was made, checks passes, now what?

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change in the code, otherwise LGTM. Would like to have this by a person familiar with the interface reviewed


MXNotifyShutdown();
return ret;
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double return statement

MXNETCPP_REGISTER_OPTIMIZER(adagrad, AdaGradOptimizer);
MXNETCPP_REGISTER_OPTIMIZER(adadelta, AdaDeltaOptimizer);
MXNETCPP_REGISTER_OPTIMIZER(signum, SignumOptimizer);
if (cmap().empty()) {
Copy link
Contributor

@lebeg lebeg Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are checking all optimizers at once, but wouldn't it be better to check every optimizer separately? AFAICT Registry::__REGISTER__ is used, but Registry::__REGISTER_OR_GET__ would make more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizers are all registered at once in the original CPP-Package code and in the Python code using decorators. In the original code #5251, they were all registered at once in the file, so when the program was initialized, all optimizers would already be registered (which is the case in Python), not in the first call to the function Find, but that caused multiple definition errors, so in #5545, the were all registered at once in the first call to Find, which caused the inability to have more than one optimizer bc they would be re-registered with each call, so this change makes sure they are registered once. Why check every optimizer separately when they should all be registered at the same time and only once?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange for a look-up function to register optimizers. Would it make more sense to have a macro to handle the multiple-definition problem for built-in optimizers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure what you mean, but it sounds like you want to revert to the original code, but that would break #5545. @sifmelcara why in #5545 did you
"3. In optimizer.hpp, I delayed the registration of sgd optimizer to the first call of OptimizerRegistry::Find."?

@chsin
Copy link
Contributor Author

chsin commented Feb 20, 2018

@ marcoabreu, made the change, now what?

@marcoabreu
Copy link
Contributor

marcoabreu commented Feb 20, 2018 via email

@chsin
Copy link
Contributor Author

chsin commented Feb 21, 2018

@marcoabreu, I don't know who you're waiting for, so I'll ask the CPP-Package creator: @lx75249, Xin Li, you wrote the original CPP package, are you OK with this change?

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved in terms of CI

@marcoabreu
Copy link
Contributor

I'm not familiar with the CPP package so I can't really review this part.

@chsin chsin changed the title register optimizers only once in CPP-Package fix optimizer bug in CPP-Package Feb 21, 2018
@sifmelcara
Copy link
Contributor

sifmelcara commented Feb 22, 2018

Hello! Just got a ping from @chsin for this PR. I'm the author of the ODR fix.
After some git blame, I found this bug was accidentally introduced by 1c22bc7 which tries to fix a compiler warning of return value not used... But we need static int __make_ ## OptimizerType ## _ ## Name ## __ = for a reason: it is used to make sure optimizers are only registered once.
So either add back static int __make_ ## OptimizerType ## _ ## Name ## __ = or merge this PR should be able to fix #9800 .

However, as @szha mentioned, I wonder if there is a better way to register the optimizers? I guess we can use global inline variable but that would require C++17

@sifmelcara
Copy link
Contributor

Hmm, in fact I think adding back static int __make_ ## OptimizerType ## _ ## Name ## __ = is a little bit better because local static variable's initialization is guaranteed to be thread safe.

@marcoabreu
Copy link
Contributor

@larroy

@chsin
Copy link
Contributor Author

chsin commented Feb 23, 2018

But we'll still have the weirdness of initializing with Find that @szha is unhappy about..

@conopt
Copy link
Contributor

conopt commented Feb 26, 2018

Unfortunately we don't have static constructors in c++, and that's why the initialization becomes so weird. Emulating a static constructor will make the code more confusing, but registering optimizers in Find is as acceptable as creating an instance in singleton getters.

@piiswrong
Copy link
Contributor

This is at least better than the current solution. So merged

@piiswrong piiswrong merged commit 7a0509d into apache:master Feb 27, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* opt register only once

* lint pass

* opt register only once

lint pass

* add test_optimizer.cpp

* fix warning

* typo
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* opt register only once

* lint pass

* opt register only once

lint pass

* add test_optimizer.cpp

* fix warning

* typo
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants