-
Notifications
You must be signed in to change notification settings - Fork 6.8k
mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL #8532
Comments
@ykim362 what's your recommendation on this? MKL2018 doesn't distribute .a, which means that I can't hide any of its symbols when loading. |
@b0noI - how do we resolve the issue on DLAMI? |
@szha I am discussing with MKL team to figure out how to handle this situation. I will get back to you, soon. |
@fhieber Could you provide your numpy (conda) version? |
@szha, Can we link libiomp5.a statically and then strip the symbols from the final libmxnet.so? I am theorizing (I think correctly) that two completely separate versions of openmp can coexist when they apply only to two different modules. |
@cjolivier01 how can this work, when the data structures passed to the functions of the library change? This would only work if both versions of openmp would guarantee binary compatibility. Is that the case? |
@larroy Not sure what you're trying to say here. If the symbols are stripped from libmxnet.so, then the libnumpy.so (or whatever it's called)'s omp.so and libmxnet.so (with its internal omp pool) won't see each other at all, much less call into each other. What you'll get is numpy using a separate thread pool than libmxnet.so, which probably doesn't matter much because they generally won't be running in parallel anyway. |
@cjolivier01 When I discussed this issue with MKL team, it is not much feasible to make static MKLML given that it has lots of binary dependencies. Another aspect is that there is a possibility that two separate OpenMP calls (from numpy and from MXNet with MKLML) could degrade the performance. We are working on several experimentations. I would like to update the result next week. |
@cjolivier01 I see. I thought the problem was linking with the same symbols. I found this related to the issue: https://software.intel.com/en-us/inference-engine-devguide-known-issues |
@cjolivier01 From the experimentations, the behavior varies case by case. Some combinations of versions work fine. Some are not. So, the workaround @larroy pointed (LD_PRELOAD env variable) might be a good option. Also, it is recommended to use MKL-enabled MXNet with non-MKL python to get the best performance for neural network workloads. |
@ykim362 our goal is to be able to provide good out-of-the-box experience for MKL-enabled mxnet. Though the workaround might be good for experienced users, we cannot control the environment in which mxnet runs for pip users. Thus, I'm still leaning towards the way @cjolivier01 proposed above if possible. Do you mind sharing more on the dependencies of mklml to help figure out the feasibility of static-linking? |
@szha I totally agree with you regarding the OOB experience! Actually, our concern is that static linking doesn't solve the problem. How about trying rpath compile option? @fhieber Could you provide some more detailed information to re-generate this situation? The behavior varies, so I would like to test with the exactly same environment. |
@ykim362 could you elaborate? Currently I'm using '${ORIGIN}' in rpath for mkl builds already, and shipping mkl shared objects along with libmxnet.so. I don't think this prevents other omp from being loaded first and hence doesn't solve our problem. Are you referring to a different solution that's based on rpath? |
MKL dev here. Can somebody please verify that the pip packages are not linked to OpenMP (Intel's libiomp5) statically? If any one of them is, then this would be a problem. If every single package links to OpenMP dynamically, this should probably work (disclaimer: this needs to be verified; the libmkl_rt.so on Linux loads libiomp5.so via dlopen and I'm not 100% sure what would happen if the library is loaded multiple times for example using RTLD_LOCAL...) |
@rsdubtso Hi. I can verify that mxnet pip package is linked dynamically to |
Hm. Seems like the numpy package also uses dynamic linking to libiomp5. Do you have a small repro that I could try? (Please do include all the necessary package installation steps -- I'm pretty new to all this stuff...) PS. I also tried installing mxnet for python2.7 conda and it seems to be missing libmxnet.so completely... should I be using conda for python3? |
@fhieber Could you provide the conda version you used or some more detailed environment when you got this problem? mxnet-mkl==0.12.0 and 0.12.1 work fine with conda in my tests. |
@ykim362 I just tried this again in a Docker image with
I still observe the same crash:
|
@fhieber @rsdubtso is correct. It caused by multiple omp lib. The workaround is to set KMP_DUPLICATE_LIB_OK=TRUE . I am considering the final solution now. Will come back soon :) Error log:
After setting the environment, the problem is gone.
|
@pengzhao-intel any update? |
Unless intel provides a static library for libiomp5 built with -fPIC flag, or drop the dependency on it, I don't see anything we can do from the mxnet side. Suggestions are welcome. |
@pengzhao-intel thanks for the update. @ashokei let us know how things progress and how this change impacts performance. |
@zheng-da is your openmp linking issue related to this in any way; Does your solution/workaround fix this ? |
no, i don't have a workaround so far. |
any updates on this? |
@tdomhan after several investigations, I realized it's not easy to resolve this issue under the current setting. We have to change some building logic. |
"may cause crashes or silently produce incorrect results" in the message stated by Felix above made me not so eager to try this out, rather than using the non-MKL version of MXNet. |
Understand, even we don't encounter the crashes and incorrect results by the workaround. |
thanks a lot for looking into this! :) |
The numpy from conda includes the mkl package which conflicts with MXNET. Thus, updating the numpy in conda to 1.14 will resolve this issue. @tdomhan @fhieber please try again in your environment.
|
I finally got back to this after a while. I no longer observe the libomp-related error mentioned in the original issue, but I am observing process deadlocks with the following numpy/mxnet configuration in Sockeye:
If mkl-optimized numpy is installed via anaconda (as shown above) and using mxnet-mkl==1.3.0.post0 on a Mac laptop, the Sockeye subprocess spawned at a checkpoint (to decode the validation data set), is unable to spawn and the main process deterministically hangs. When debugging, it seems that it fails to spawn the subprocess. |
@fhieber thanks for the feedback. |
Thanks @pengzhao-intel, here is a minimal example to reproduce the issue.
(this will train a tiny model on the setup.py file, but will hang once reached 100 updates and spawns a CheckpointDecoder subprocess to decode 2 sentences of the validation data.
If you set If you run
and run the same training with Likewise, if you replace mxnet-mkl with mxnet:
and run the same training, no hanging will occur. |
@fhieber thanks for the details information. Because this is a new issue, do you mind close this one and open a new one? |
@pengzhao-intel this is not a Sockeye issue. I posted a minimal reproducible example in #12710 |
Possibly related #12160 |
…-cu90 without MKL due to apache/mxnet#8532 cr https://cr.amazon.com/r/7923806/
The problem has been fixed with the latest numpy in conda. Closing, feel free to re-open if there is any other issue. |
We have observed crashes with any mkl-enabled pip package of mxnet-0.12.0 in combination with numpy if installed through conda (which by default also uses MKL).
In this case, mxnet trainings crash with the following error message:
Numpy from conda links against the libmkl_rt.so, distributed through conda:
whereas MXNet links to its own .so:
This prevents people from using numpy w/ MKL in combination with mxnet-mkl==0.12.0.
The text was updated successfully, but these errors were encountered: