Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix a bug in getting MKLDNN memory #10731

Merged
merged 30 commits into from
May 3, 2018
Merged

Conversation

zheng-da
Copy link
Contributor

@zheng-da zheng-da commented Apr 27, 2018

Description

This is to fix a bug in #10580
The bug happens when an NDArray already has a MKLDNN memory with a special MKLDNN format. In some cases (group convolution), the MKLDNN memory uses 5 dimensions, while the NDArray still uses 4 dimensions. In this case, when GetMKLDNNData is called, SetMKLMem will discard the original MKLDNN memory due to incompatible dimensions between MKLDNN memory and NDArray. This messes up the data in the NDArray.

The test code in the PR can actually reproduce the bug in #10580

This PR also tries to add C++ unit tests to cover all possible combinations with getting MKLDNN memory from an NDArray.

Thank @TaoLv @pengzhao-intel for finding the root cause of this bug.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@zheng-da zheng-da requested a review from marcoabreu as a code owner April 27, 2018 22:42
gpu_max_val = np.max(np.abs(gpu_out.asnumpy()))
eprint(model_name + ": CPU " + str(max_val) + ", GPU " + str(gpu_max_val))
assert_almost_equal(out / max_val, gpu_out.asnumpy() / max_val, rtol=1e-3, atol=1e-3)
for i in range(5):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one way of reproducing the bug in #10580.

@zheng-da zheng-da requested a review from szha as a code owner April 28, 2018 09:52
@zheng-da
Copy link
Contributor Author

@marcoabreu can you review this PR?

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

cmake \
-DUSE_CUDA=1 \
-DUSE_CUDNN=1 \
-DUSE_MKLML_MKL=1 \
-DUSE_MKLDNN=1 \
-DCMAKE_BUILD_TYPE=Release \
-DARCH_OPT_FLAGS="-mtune=generic" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was never a problem before and is counter intuitive to the user. It should be possible to compile on one instance type and execute on another instance type without having to specify options like this one. I'd appreciate it if you could track this down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this problem is caused by how your CI system is set up. It's compiled on C5, which supports AVX512, and runs on g3, which doesn't support AVX512. MKLDNN by default tries to optimize specifically for an architecture where it is compiled. There is similar setup for Makefile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so MKLDNN auto-assumes the runtime architecture based on the runtime-architecture. I think this will cause quite some problems for our users in production environments, considering everybody will build their binaries in a build fleet and deploy it in a production fleet with different configuration and hardware.

Could you please elaborate the flow how users are being made aware of this problem and how we instruct them to solve this problem? Also, where is ARCH_OPT_FLAGS being consumed?

Copy link
Contributor Author

@zheng-da zheng-da Apr 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just edit CMakeLists.txt to turn on "-mtune=generic" for MKLDNN by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good if @piiswrong @szha @cjolivier01 could look into the possible impact of that change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same thing has been used in prepare_mkldnn.sh, which is used by Makefile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference here is that the linked script is only for the compilation of MKLDNN - maybe @pengzhao-intel can elaborate. Here, you're changing the compilation of the entire MXNet library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is what you worry about, I have changed CMakeLists.txt to only compile MKLDNN with the flag.
https://github.com/apache/incubator-mxnet/pull/10731/files#diff-af3b638bc2a3e6c650974192a53c7291R162

-G Ninja \
/work/mxnet

ninja -v
# libmkldnn.so.0 is a link file. We need an actual binary file named libmkldnn.so.0.
cp 3rdparty/mkldnn/src/libmkldnn.so.0 3rdparty/mkldnn/src/libmkldnn.so.0.tmp
mv 3rdparty/mkldnn/src/libmkldnn.so.0.tmp 3rdparty/mkldnn/src/libmkldnn.so.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a no-op. Please elaborate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't no-op. I think my comment explain this. libmkldnn.so.0 is a link file. We need an actual binary file named libmkldnn.so.0. Jenkins can't pack a link file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so you're making use of the fact that cp automatically resolves the symlink? I think a cleaner way would be to resolve the symlink explicitely and then override the symlink file with the original file rather than relying on implicit actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you know a cleaner way, I'm happy to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@zheng-da zheng-da May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we need here is to turn a link file to a regular file, instead of finding the path to the regular file. It's a different thing. Jenkins can't wrapper a link file and mxnet wants a library file named libmkldnn.so.0, instead of libmkldnn.so.0.13 (which is what libmkldnn.so.0 points to). If what you find can turn libmkldnn.so.0 (a link file) to a regular file with the same name with a single command, can you please provide the command?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@zheng-da zheng-da May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the code will be something like this.

cp --remove-destination 3rdparty/mkldnn/src/`readlink 3rdparty/mkldnn/src/libmkldnn.so.0` 3rdparty/mkldnn/src/libmkldnn.so.0

do you think this is a preferred way or less confusing way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I actually prefer the existing solution over the one-liner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, no strong feelings from my side

for (int i = 0; i < desc2.data.ndims; i++)
required_shape[i] = desc2.data.dims[i];
NDArray reshaped = MKLDNNDataReshape(required_shape);
const mkldnn::memory *ret = reshaped.GetMKLDNNData();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is ret being destroyed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory is managed by MKLDNNStream. You can take a look at the implementation of GetMKLDNNData

@@ -287,10 +286,16 @@ class MKLDNNStream {
return !net.empty();
}

void Submit() {
if (!net.empty())
void Submit(bool cleanup = true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe this argument

@zheng-da
Copy link
Contributor Author

zheng-da commented May 1, 2018

@marcoabreu do you have other comments?

@@ -313,6 +313,8 @@ build_ubuntu_amalgamation_min() {
build_ubuntu_gpu_cmake_mkldnn() {
set -ex
cd /work/build
# We need to use generic archtecture. Otherwise, MKLDNN compiled in one
# CPU architecture (e.g., C5) can't run on another architecture (e.g., g3).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these comments apply to here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to remove this. Originally, I set ARCH_OPT_FLAGS here. but I moved it to CMakeLists.txt.
https://github.com/apache/incubator-mxnet/pull/10731/files/01b37cecf0a57ea9d943aae0c079305d4de8452d#diff-af3b638bc2a3e6c650974192a53c7291R162
I need to move the comments as well.

@zheng-da
Copy link
Contributor Author

zheng-da commented May 2, 2018

@marcoabreu do you have more comments?

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm only concerned about this optimization part and its impact. It would be good if you could make some kind of benchmark that ensures we're not losing any performance with that method.

@piiswrong piiswrong merged commit 4ba436b into apache:master May 3, 2018
@piiswrong
Copy link
Contributor

intel people saids generic is ok. I'm going to assume they know what they are talking about

marcoabreu added a commit that referenced this pull request May 3, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 7, 2018
* test inference multiple times.

* Fix a bug in GetMKLDNNData().

* Update comments.

* Handle all cases for GetMKLDNNDataReorder

* avoid unnecessary message.

* Add C++ unit test for NDArray.

* Fix a minor bug.

* Unit tests on GetMKLDNNDataReorder.

* Fix lint error.

* Add more test cases.

* add comments for the test code.

* Reorganize test code.

* Fix cpp tests.

* test.

* Add a new Jenkins compile task.

* Update jenkins.

* update jenkins.

* Fix a Jenkins.

* Fix jenkins.

* Fix jenkins.

* Fix CMake for MKLDNN.

* Fix jenkins.

* update jenkins.

* update CMake.

* Fix cmake.

* update CI.

* add comment.

* add comments.

* cmake builds mkldnn with -mtune=generic by default.

* adjust comments.
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request May 29, 2018
* test inference multiple times.

* Fix a bug in GetMKLDNNData().

* Update comments.

* Handle all cases for GetMKLDNNDataReorder

* avoid unnecessary message.

* Add C++ unit test for NDArray.

* Fix a minor bug.

* Unit tests on GetMKLDNNDataReorder.

* Fix lint error.

* Add more test cases.

* add comments for the test code.

* Reorganize test code.

* Fix cpp tests.

* test.

* Add a new Jenkins compile task.

* Update jenkins.

* update jenkins.

* Fix a Jenkins.

* Fix jenkins.

* Fix jenkins.

* Fix CMake for MKLDNN.

* Fix jenkins.

* update jenkins.

* update CMake.

* Fix cmake.

* update CI.

* add comment.

* add comments.

* cmake builds mkldnn with -mtune=generic by default.

* adjust comments.
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* test inference multiple times.

* Fix a bug in GetMKLDNNData().

* Update comments.

* Handle all cases for GetMKLDNNDataReorder

* avoid unnecessary message.

* Add C++ unit test for NDArray.

* Fix a minor bug.

* Unit tests on GetMKLDNNDataReorder.

* Fix lint error.

* Add more test cases.

* add comments for the test code.

* Reorganize test code.

* Fix cpp tests.

* test.

* Add a new Jenkins compile task.

* Update jenkins.

* update jenkins.

* Fix a Jenkins.

* Fix jenkins.

* Fix jenkins.

* Fix CMake for MKLDNN.

* Fix jenkins.

* update jenkins.

* update CMake.

* Fix cmake.

* update CI.

* add comment.

* add comments.

* cmake builds mkldnn with -mtune=generic by default.

* adjust comments.
zheng-da added a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 8, 2018
* test inference multiple times.

* Fix a bug in GetMKLDNNData().

* Update comments.

* Handle all cases for GetMKLDNNDataReorder

* avoid unnecessary message.

* Add C++ unit test for NDArray.

* Fix a minor bug.

* Unit tests on GetMKLDNNDataReorder.

* Fix lint error.

* Add more test cases.

* add comments for the test code.

* Reorganize test code.

* Fix cpp tests.

* test.

* Add a new Jenkins compile task.

* Update jenkins.

* update jenkins.

* Fix a Jenkins.

* Fix jenkins.

* Fix jenkins.

* Fix CMake for MKLDNN.

* Fix jenkins.

* update jenkins.

* update CMake.

* Fix cmake.

* update CI.

* add comment.

* add comments.

* cmake builds mkldnn with -mtune=generic by default.

* adjust comments.
zheng-da added a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 11, 2018
* test inference multiple times.

* Fix a bug in GetMKLDNNData().

* Update comments.

* Handle all cases for GetMKLDNNDataReorder

* avoid unnecessary message.

* Add C++ unit test for NDArray.

* Fix a minor bug.

* Unit tests on GetMKLDNNDataReorder.

* Fix lint error.

* Add more test cases.

* add comments for the test code.

* Reorganize test code.

* Fix cpp tests.

* test.

* Add a new Jenkins compile task.

* Update jenkins.

* update jenkins.

* Fix a Jenkins.

* Fix jenkins.

* Fix jenkins.

* Fix CMake for MKLDNN.

* Fix jenkins.

* update jenkins.

* update CMake.

* Fix cmake.

* update CI.

* add comment.

* add comments.

* cmake builds mkldnn with -mtune=generic by default.

* adjust comments.
zheng-da added a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 13, 2018
* test inference multiple times.

* Fix a bug in GetMKLDNNData().

* Update comments.

* Handle all cases for GetMKLDNNDataReorder

* avoid unnecessary message.

* Add C++ unit test for NDArray.

* Fix a minor bug.

* Unit tests on GetMKLDNNDataReorder.

* Fix lint error.

* Add more test cases.

* add comments for the test code.

* Reorganize test code.

* Fix cpp tests.

* test.

* Add a new Jenkins compile task.

* Update jenkins.

* update jenkins.

* Fix a Jenkins.

* Fix jenkins.

* Fix jenkins.

* Fix CMake for MKLDNN.

* Fix jenkins.

* update jenkins.

* update CMake.

* Fix cmake.

* update CI.

* add comment.

* add comments.

* cmake builds mkldnn with -mtune=generic by default.

* adjust comments.

remove unnecessary tests.
anirudh2290 pushed a commit that referenced this pull request Jun 13, 2018
* test inference multiple times.

* Fix a bug in GetMKLDNNData().

* Update comments.

* Handle all cases for GetMKLDNNDataReorder

* avoid unnecessary message.

* Add C++ unit test for NDArray.

* Fix a minor bug.

* Unit tests on GetMKLDNNDataReorder.

* Fix lint error.

* Add more test cases.

* add comments for the test code.

* Reorganize test code.

* Fix cpp tests.

* test.

* Add a new Jenkins compile task.

* Update jenkins.

* update jenkins.

* Fix a Jenkins.

* Fix jenkins.

* Fix jenkins.

* Fix CMake for MKLDNN.

* Fix jenkins.

* update jenkins.

* update CMake.

* Fix cmake.

* update CI.

* add comment.

* add comments.

* cmake builds mkldnn with -mtune=generic by default.

* adjust comments.

remove unnecessary tests.
zheng-da added a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* test inference multiple times.

* Fix a bug in GetMKLDNNData().

* Update comments.

* Handle all cases for GetMKLDNNDataReorder

* avoid unnecessary message.

* Add C++ unit test for NDArray.

* Fix a minor bug.

* Unit tests on GetMKLDNNDataReorder.

* Fix lint error.

* Add more test cases.

* add comments for the test code.

* Reorganize test code.

* Fix cpp tests.

* test.

* Add a new Jenkins compile task.

* Update jenkins.

* update jenkins.

* Fix a Jenkins.

* Fix jenkins.

* Fix jenkins.

* Fix CMake for MKLDNN.

* Fix jenkins.

* update jenkins.

* update CMake.

* Fix cmake.

* update CI.

* add comment.

* add comments.

* cmake builds mkldnn with -mtune=generic by default.

* adjust comments.
@zheng-da zheng-da deleted the debug_mkldnn2 branch September 29, 2018 21:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants