Fix a bug in getting MKLDNN memory #10731

zheng-da · 2018-04-27T22:42:21Z

Description

This is to fix a bug in #10580
The bug happens when an NDArray already has a MKLDNN memory with a special MKLDNN format. In some cases (group convolution), the MKLDNN memory uses 5 dimensions, while the NDArray still uses 4 dimensions. In this case, when GetMKLDNNData is called, SetMKLMem will discard the original MKLDNN memory due to incompatible dimensions between MKLDNN memory and NDArray. This messes up the data in the NDArray.

The test code in the PR can actually reproduce the bug in #10580

This PR also tries to add C++ unit tests to cover all possible combinations with getting MKLDNN memory from an NDArray.

Thank @TaoLv @pengzhao-intel for finding the root cause of this bug.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

marcoabreu · 2018-04-27T23:12:16Z

tests/python/gpu/test_gluon_model_zoo_gpu.py

-        gpu_max_val = np.max(np.abs(gpu_out.asnumpy()))
-        eprint(model_name + ": CPU " + str(max_val) + ", GPU " + str(gpu_max_val))
-        assert_almost_equal(out / max_val, gpu_out.asnumpy() / max_val, rtol=1e-3, atol=1e-3)
+        for i in range(5):


Could you elaborate this change?

this is one way of reproducing the bug in #10580.

zheng-da · 2018-04-28T23:42:20Z

@marcoabreu can you review this PR?

marcoabreu

.

marcoabreu · 2018-04-30T06:55:29Z

ci/docker/runtime_functions.sh

    cmake \
        -DUSE_CUDA=1               \
        -DUSE_CUDNN=1              \
        -DUSE_MKLML_MKL=1          \
        -DUSE_MKLDNN=1             \
        -DCMAKE_BUILD_TYPE=Release \
+        -DARCH_OPT_FLAGS="-mtune=generic" \


This was never a problem before and is counter intuitive to the user. It should be possible to compile on one instance type and execute on another instance type without having to specify options like this one. I'd appreciate it if you could track this down.

this problem is caused by how your CI system is set up. It's compiled on C5, which supports AVX512, and runs on g3, which doesn't support AVX512. MKLDNN by default tries to optimize specifically for an architecture where it is compiled. There is similar setup for Makefile.

I see, so MKLDNN auto-assumes the runtime architecture based on the runtime-architecture. I think this will cause quite some problems for our users in production environments, considering everybody will build their binaries in a build fleet and deploy it in a production fleet with different configuration and hardware.

Could you please elaborate the flow how users are being made aware of this problem and how we instruct them to solve this problem? Also, where is ARCH_OPT_FLAGS being consumed?

I just edit CMakeLists.txt to turn on "-mtune=generic" for MKLDNN by default.

Would be good if @piiswrong @szha @cjolivier01 could look into the possible impact of that change

the same thing has been used in prepare_mkldnn.sh, which is used by Makefile.

https://github.com/apache/incubator-mxnet/blob/master/prepare_mkldnn.sh#L96

The difference here is that the linked script is only for the compilation of MKLDNN - maybe @pengzhao-intel can elaborate. Here, you're changing the compilation of the entire MXNet library.

If this is what you worry about, I have changed CMakeLists.txt to only compile MKLDNN with the flag.
https://github.com/apache/incubator-mxnet/pull/10731/files#diff-af3b638bc2a3e6c650974192a53c7291R162

marcoabreu · 2018-04-30T06:56:58Z

ci/docker/runtime_functions.sh

        -G Ninja                   \
        /work/mxnet

    ninja -v
+    # libmkldnn.so.0 is a link file. We need an actual binary file named libmkldnn.so.0.
+    cp 3rdparty/mkldnn/src/libmkldnn.so.0 3rdparty/mkldnn/src/libmkldnn.so.0.tmp
+    mv 3rdparty/mkldnn/src/libmkldnn.so.0.tmp 3rdparty/mkldnn/src/libmkldnn.so.0


This is a no-op. Please elaborate

This isn't no-op. I think my comment explain this. libmkldnn.so.0 is a link file. We need an actual binary file named libmkldnn.so.0. Jenkins can't pack a link file.

I see, so you're making use of the fact that cp automatically resolves the symlink? I think a cleaner way would be to resolve the symlink explicitely and then override the symlink file with the original file rather than relying on implicit actions.

If you know a cleaner way, I'm happy to do so.

https://stackoverflow.com/questions/7665/how-to-resolve-symbolic-links-in-a-shell-script?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

What we need here is to turn a link file to a regular file, instead of finding the path to the regular file. It's a different thing. Jenkins can't wrapper a link file and mxnet wants a library file named libmkldnn.so.0, instead of libmkldnn.so.0.13 (which is what libmkldnn.so.0 points to). If what you find can turn libmkldnn.so.0 (a link file) to a regular file with the same name with a single command, can you please provide the command?

https://stackoverflow.com/a/13396140/3062895

so the code will be something like this.

cp --remove-destination 3rdparty/mkldnn/src/`readlink 3rdparty/mkldnn/src/libmkldnn.so.0` 3rdparty/mkldnn/src/libmkldnn.so.0

do you think this is a preferred way or less confusing way?

To be honest, I actually prefer the existing solution over the one-liner.

Alright, no strong feelings from my side

marcoabreu · 2018-04-30T06:58:04Z

src/ndarray/ndarray.cc

+    for (int i = 0; i < desc2.data.ndims; i++)
+      required_shape[i] = desc2.data.dims[i];
+    NDArray reshaped = MKLDNNDataReshape(required_shape);
+    const mkldnn::memory *ret = reshaped.GetMKLDNNData();


Where is ret being destroyed?

The memory is managed by MKLDNNStream. You can take a look at the implementation of GetMKLDNNData

marcoabreu · 2018-04-30T06:58:41Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

@@ -287,10 +286,16 @@ class MKLDNNStream {
    return !net.empty();
  }

-  void Submit() {
-    if (!net.empty())
+  void Submit(bool cleanup = true) {


Please describe this argument

zheng-da · 2018-05-01T03:03:31Z

@marcoabreu do you have other comments?

szha · 2018-05-02T03:45:08Z

ci/docker/runtime_functions.sh

@@ -313,6 +313,8 @@ build_ubuntu_amalgamation_min() {
 build_ubuntu_gpu_cmake_mkldnn() {
    set -ex
    cd /work/build
+    # We need to use generic archtecture. Otherwise, MKLDNN compiled in one
+    # CPU architecture (e.g., C5) can't run on another architecture (e.g., g3).


What do these comments apply to here?

I need to remove this. Originally, I set ARCH_OPT_FLAGS here. but I moved it to CMakeLists.txt.
https://github.com/apache/incubator-mxnet/pull/10731/files/01b37cecf0a57ea9d943aae0c079305d4de8452d#diff-af3b638bc2a3e6c650974192a53c7291R162
I need to move the comments as well.

zheng-da · 2018-05-02T20:22:04Z

@marcoabreu do you have more comments?

marcoabreu

LGTM. I'm only concerned about this optimization part and its impact. It would be good if you could make some kind of benchmark that ensures we're not losing any performance with that method.

piiswrong · 2018-05-03T17:27:49Z

intel people saids generic is ok. I'm going to assume they know what they are talking about

This reverts commit 4ba436b.

* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments.

* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments. remove unnecessary tests.

* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments.

zheng-da added 14 commits April 27, 2018 22:02

test inference multiple times.

081de9e

Fix a bug in GetMKLDNNData().

42d64b4

Update comments.

3ff7475

Handle all cases for GetMKLDNNDataReorder

b746362

avoid unnecessary message.

60e9c6d

Add C++ unit test for NDArray.

c1c68a1

Fix a minor bug.

244149b

Unit tests on GetMKLDNNDataReorder.

8de717e

Fix lint error.

735f1d3

Add more test cases.

03a742f

add comments for the test code.

a21af4e

Reorganize test code.

cd1a074

Fix cpp tests.

1987430

test.

ff8bd94

zheng-da requested a review from marcoabreu as a code owner April 27, 2018 22:42

marcoabreu reviewed Apr 27, 2018

View reviewed changes

Add a new Jenkins compile task.

8835d77

zheng-da force-pushed the debug_mkldnn2 branch from 3cdee71 to 8835d77 Compare April 28, 2018 07:22

zheng-da added 6 commits April 28, 2018 00:39

Update jenkins.

150d38a

update jenkins.

afdae0a

Fix a Jenkins.

cfc38e6

Fix jenkins.

beb7c27

Fix jenkins.

859005c

Fix CMake for MKLDNN.

65917bc

zheng-da requested a review from szha as a code owner April 28, 2018 09:52

zheng-da added 5 commits April 28, 2018 10:05

Fix jenkins.

45683e2

update jenkins.

69b06d3

update CMake.

b9b360a

Fix cmake.

a737d4d

update CI.

e8b604a

add comment.

faafd08

zheng-da mentioned this pull request Apr 29, 2018

inference results unstable in mxnet_mkl-1.2.0b20180416 #10580

Closed

marcoabreu suggested changes Apr 30, 2018

View reviewed changes

zheng-da added 2 commits April 30, 2018 08:21

add comments.

4d441e6

cmake builds mkldnn with -mtune=generic by default.

01b37ce

szha reviewed May 2, 2018

View reviewed changes

adjust comments.

1c5e15e

marcoabreu approved these changes May 3, 2018

View reviewed changes

piiswrong merged commit 4ba436b into apache:master May 3, 2018

marcoabreu added a commit that referenced this pull request May 3, 2018

Revert "Fix a bug in getting MKLDNN memory (#10731)"

f19872d

This reverts commit 4ba436b.

anirudh2290 mentioned this pull request May 3, 2018

CMake ignores USE_MKLDNN flag for 1.2.0.RC[0-2] releases #10801

Closed

zheng-da deleted the debug_mkldnn2 branch September 29, 2018 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in getting MKLDNN memory #10731

Fix a bug in getting MKLDNN memory #10731

zheng-da commented Apr 27, 2018 •

edited

Loading

marcoabreu Apr 27, 2018

zheng-da Apr 27, 2018

zheng-da commented Apr 28, 2018

marcoabreu left a comment

marcoabreu Apr 30, 2018

zheng-da Apr 30, 2018

marcoabreu Apr 30, 2018

zheng-da Apr 30, 2018 •

edited

Loading

marcoabreu May 1, 2018

zheng-da May 1, 2018

zheng-da May 1, 2018

marcoabreu May 1, 2018

zheng-da May 1, 2018

marcoabreu Apr 30, 2018

zheng-da Apr 30, 2018

marcoabreu Apr 30, 2018

zheng-da Apr 30, 2018

marcoabreu May 1, 2018

zheng-da May 1, 2018 •

edited

Loading

szha May 2, 2018

zheng-da May 2, 2018 •

edited

Loading

szha May 2, 2018

marcoabreu May 3, 2018

marcoabreu Apr 30, 2018

zheng-da Apr 30, 2018

marcoabreu Apr 30, 2018

zheng-da commented May 1, 2018

szha May 2, 2018

zheng-da May 2, 2018

zheng-da commented May 2, 2018

marcoabreu left a comment

piiswrong commented May 3, 2018

Fix a bug in getting MKLDNN memory #10731

Fix a bug in getting MKLDNN memory #10731

Conversation

zheng-da commented Apr 27, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented Apr 28, 2018

marcoabreu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da Apr 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da May 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da May 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented May 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented May 2, 2018

marcoabreu left a comment

Choose a reason for hiding this comment

piiswrong commented May 3, 2018

zheng-da commented Apr 27, 2018 •

edited

Loading

zheng-da Apr 30, 2018 •

edited

Loading

zheng-da May 1, 2018 •

edited

Loading

zheng-da May 2, 2018 •

edited

Loading