Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL #8532

Closed
fhieber opened this issue Nov 3, 2017 · 40 comments
Closed

mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL #8532

fhieber opened this issue Nov 3, 2017 · 40 comments

Comments

@fhieber
Copy link
Contributor

fhieber commented Nov 3, 2017

We have observed crashes with any mkl-enabled pip package of mxnet-0.12.0 in combination with numpy if installed through conda (which by default also uses MKL).

In this case, mxnet trainings crash with the following error message:

OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

Numpy from conda links against the libmkl_rt.so, distributed through conda:

libmkl_rt.so => /opt/conda/lib/python3.6/site-packages/numpy/core/../../../../libmkl_rt.so (0x00007f05256e0000)
	libmkl_rt.so => /opt/conda/lib/python3.6/site-packages/numpy/linalg/../../../../libmkl_rt.so (0x00007f367e1d5000)
	libmkl_rt.so => /opt/conda/lib/python3.6/site-packages/numpy/linalg/../../../../libmkl_rt.so (0x00007fd39e751000)

whereas MXNet links to its own .so:

ldd /opt/conda/lib/python3.6/site-packages/mxnet/libmxnet.so
[...]
	libmklml_intel.so => /opt/conda/lib/python3.6/site-packages/mxnet/libmklml_intel.so (0x00007f8c85b94000)
	libiomp5.so => /opt/conda/lib/python3.6/site-packages/mxnet/libiomp5.so (0x00007f8c857f1000)
[...]

This prevents people from using numpy w/ MKL in combination with mxnet-mkl==0.12.0.

@zhreshold
Copy link
Member

@szha

@szha
Copy link
Member

szha commented Nov 3, 2017

@ykim362 what's your recommendation on this? MKL2018 doesn't distribute .a, which means that I can't hide any of its symbols when loading.

@srochel
Copy link
Contributor

srochel commented Nov 4, 2017

@b0noI - how do we resolve the issue on DLAMI?

@ykim362
Copy link
Contributor

ykim362 commented Nov 4, 2017

@szha I am discussing with MKL team to figure out how to handle this situation. I will get back to you, soon.

@ykim362
Copy link
Contributor

ykim362 commented Nov 6, 2017

@fhieber Could you provide your numpy (conda) version?

@cjolivier01
Copy link
Member

@szha, Can we link libiomp5.a statically and then strip the symbols from the final libmxnet.so? I am theorizing (I think correctly) that two completely separate versions of openmp can coexist when they apply only to two different modules.

@larroy
Copy link
Contributor

larroy commented Jan 6, 2018

@cjolivier01 how can this work, when the data structures passed to the functions of the library change? This would only work if both versions of openmp would guarantee binary compatibility. Is that the case?

@cjolivier01
Copy link
Member

cjolivier01 commented Jan 6, 2018

@larroy Not sure what you're trying to say here. If the symbols are stripped from libmxnet.so, then the libnumpy.so (or whatever it's called)'s omp.so and libmxnet.so (with its internal omp pool) won't see each other at all, much less call into each other.

What you'll get is numpy using a separate thread pool than libmxnet.so, which probably doesn't matter much because they generally won't be running in parallel anyway.

@ykim362
Copy link
Contributor

ykim362 commented Jan 6, 2018

@cjolivier01 When I discussed this issue with MKL team, it is not much feasible to make static MKLML given that it has lots of binary dependencies. Another aspect is that there is a possibility that two separate OpenMP calls (from numpy and from MXNet with MKLML) could degrade the performance. We are working on several experimentations. I would like to update the result next week.

@larroy
Copy link
Contributor

larroy commented Jan 6, 2018

@cjolivier01 I see. I thought the problem was linking with the same symbols. I found this related to the issue: https://software.intel.com/en-us/inference-engine-devguide-known-issues

@ykim362
Copy link
Contributor

ykim362 commented Jan 10, 2018

@cjolivier01 From the experimentations, the behavior varies case by case. Some combinations of versions work fine. Some are not. So, the workaround @larroy pointed (LD_PRELOAD env variable) might be a good option. Also, it is recommended to use MKL-enabled MXNet with non-MKL python to get the best performance for neural network workloads.

@szha
Copy link
Member

szha commented Jan 10, 2018

@ykim362 our goal is to be able to provide good out-of-the-box experience for MKL-enabled mxnet. Though the workaround might be good for experienced users, we cannot control the environment in which mxnet runs for pip users. Thus, I'm still leaning towards the way @cjolivier01 proposed above if possible.

Do you mind sharing more on the dependencies of mklml to help figure out the feasibility of static-linking?

@ykim362
Copy link
Contributor

ykim362 commented Jan 10, 2018

@szha I totally agree with you regarding the OOB experience! Actually, our concern is that static linking doesn't solve the problem. How about trying rpath compile option?

@fhieber Could you provide some more detailed information to re-generate this situation? The behavior varies, so I would like to test with the exactly same environment.

@szha
Copy link
Member

szha commented Jan 10, 2018

@ykim362 could you elaborate? Currently I'm using '${ORIGIN}' in rpath for mkl builds already, and shipping mkl shared objects along with libmxnet.so. I don't think this prevents other omp from being loaded first and hence doesn't solve our problem. Are you referring to a different solution that's based on rpath?

@rsdubtso
Copy link

MKL dev here. Can somebody please verify that the pip packages are not linked to OpenMP (Intel's libiomp5) statically? If any one of them is, then this would be a problem. If every single package links to OpenMP dynamically, this should probably work (disclaimer: this needs to be verified; the libmkl_rt.so on Linux loads libiomp5.so via dlopen and I'm not 100% sure what would happen if the library is loaded multiple times for example using RTLD_LOCAL...)

@szha
Copy link
Member

szha commented Jan 11, 2018

@rsdubtso Hi. I can verify that mxnet pip package is linked dynamically to libiomp5.so. We include libiomp5.so in the pip package using the rpath trick described above.

@rsdubtso
Copy link

Hm. Seems like the numpy package also uses dynamic linking to libiomp5. Do you have a small repro that I could try? (Please do include all the necessary package installation steps -- I'm pretty new to all this stuff...)

PS. I also tried installing mxnet for python2.7 conda and it seems to be missing libmxnet.so completely... should I be using conda for python3?

@ykim362
Copy link
Contributor

ykim362 commented Jan 22, 2018

@fhieber Could you provide the conda version you used or some more detailed environment when you got this problem? mxnet-mkl==0.12.0 and 0.12.1 work fine with conda in my tests.

@fhieber
Copy link
Contributor Author

fhieber commented Jan 23, 2018

@ykim362 I just tried this again in a Docker image with mxnet-cu90mkl==1.0.0.post4 and the following conda info output:

               platform : linux-64
          conda version : 4.3.30
       conda is private : False
      conda-env version : 4.3.30
    conda-build version : not installed
         python version : 3.6.3.final.0
       requests version : 2.18.4
       root environment : /opt/conda  (writable)
    default environment : /opt/conda
       envs directories : /opt/conda/envs
                          /root/.conda/envs
          package cache : /opt/conda/pkgs
                          /root/.conda/pkgs
           channel URLs : https://repo.continuum.io/pkgs/main/linux-64
                          https://repo.continuum.io/pkgs/main/noarch
                          https://repo.continuum.io/pkgs/free/linux-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/r/linux-64
                          https://repo.continuum.io/pkgs/r/noarch
                          https://repo.continuum.io/pkgs/pro/linux-64
                          https://repo.continuum.io/pkgs/pro/noarch
            config file : None
             netrc file : None
           offline mode : False
             user-agent : conda/4.3.30 requests/2.18.4 CPython/3.6.3 Linux/4.9.58-18.51.amzn1.x86_64 / glibc/2.17
                UID:GID : 0:0

I still observe the same crash:

> OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
> OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Mar 14, 2018

@fhieber @rsdubtso is correct. It caused by multiple omp lib. The workaround is to set KMP_DUPLICATE_LIB_OK=TRUE .

I am considering the final solution now. Will come back soon :)

Error log:

[INFO:sockeye.training] Using bucketing. Default max_seq_len=(60, 60)
[INFO:main] Optimizer: adam
[INFO:main] Optimizer Parameters: {'wd': 0.0, 'learning_rate': 0.0003, 'lr_scheduler': LearningRateSchedulerPlateauReduce(reduce_factor=0.50, reduce_num_not_improved=0), 'clip_gradient': 1.0, 'rescale_grad': 1.0}
[INFO:main] kvstore: device
[INFO:main] Gradient Compression: None
[INFO:main] Decode and Evaluate Device(s): cpu(0)
[INFO:sockeye.model] Saved config to "/home/chenxiny/newstest/wmt_model/config"
OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
Aborted

After setting the environment, the problem is gone.
New log:

[INFO:sockeye.utils] Sockeye version 1.16.2 commit 762ce78e4e49b9ba5d14eb0a48d97f19c8807707
[INFO:sockeye.utils] MXNet version 1.1.0
[INFO:sockeye.utils] Command: /home/chenxiny/anaconda3/lib/python3.6/site-packages/sockeye/train.py -s corpus.tc.BPE.de -t corpus.tc.BPE.en -vs newstest2016.tc.BPE.de -vt newstest2016.tc.BPE.en --num-embed 256 --rnn-num-hidden 512 --rnn-attention-type dot --max-seq-len 60 --decode-and-evaluate 500 -o wmt_model
[INFO:sockeye.utils] Arguments: Namespace(allow_missing_params=False, batch_size=64, batch_type='sentence', bucket_width=10, checkpoint_frequency=1000, cnn_activation_type='glu', cnn_hidden_dropout=0.0, cnn_kernel_width=(3, 5), cnn_num_hidden=512, cnn_positional_embedding_type='learned', cnn_project_qkv=False, conv_embed_add_positional_encodings=False, conv_embed_dropout=0.0, conv_embed_max_filter_width=8, conv_embed_num_filters=(200, 200, 250, 250, 300, 300, 300, 300), conv_embed_num_highway_layers=4, conv_embed_output_dim=None, conv_embed_pool_stride=5, decode_and_evaluate=500, decode_and_evaluate_device_id=None, decode_and_evaluate_use_cpu=False, decoder='rnn', device_ids=[-1], disable_device_locking=False, embed_dropout=(0.0, 0.0), embed_weight_init='default', encoder='rnn', fill_up='replicate', gradient_clipping_threshold=1.0, gradient_clipping_type='abs', gradient_compression_threshold=0.5, gradient_compression_type=None, initial_learning_rate=0.0003, keep_last_params=-1, kvstore='device', label_smoothing=0.0, layer_normalization=False, learning_rate_decay_optimizer_states_reset='off', learning_rate_decay_param_reset=False, learning_rate_half_life=10, learning_rate_reduce_factor=0.5, learning_rate_reduce_num_not_improved=3, learning_rate_schedule=None, learning_rate_scheduler_type='plateau-reduce', learning_rate_warmup=0, lock_dir='/tmp', loss='cross-entropy', loss_normalization_type='valid', max_num_checkpoint_not_improved=8, max_num_epochs=None, max_seq_len=(60, 60), max_updates=None, metrics=['perplexity'], min_num_epochs=None, momentum=None, monitor_pattern=None, monitor_stat_func='mx_default', no_bucketing=False, num_embed=(256, 256), num_layers=(1, 1), num_words=(50000, 50000), optimized_metric='perplexity', optimizer='adam', optimizer_params=None, output='wmt_model', overwrite_output=False, params=None, prepared_data=None, quiet=False, rnn_attention_coverage_num_hidden=1, rnn_attention_coverage_type='count', rnn_attention_in_upper_layers=False, rnn_attention_mhdot_heads=None, rnn_attention_num_hidden=None, rnn_attention_type='dot', rnn_attention_use_prev_word=False, rnn_cell_type='lstm', rnn_context_gating=False, rnn_decoder_hidden_dropout=0.0, rnn_decoder_state_init='last', rnn_dropout_inputs=(0.0, 0.0), rnn_dropout_recurrent=(0.0, 0.0), rnn_dropout_states=(0.0, 0.0), rnn_encoder_reverse_input=False, rnn_first_residual_layer=2, rnn_forget_bias=0.0, rnn_h2h_init='orthogonal', rnn_num_hidden=512, rnn_residual_connections=False, seed=13, shared_vocab=False, source='corpus.tc.BPE.de', source_vocab=None, target='corpus.tc.BPE.en', target_vocab=None, transformer_activation_type='relu', transformer_attention_heads=8, transformer_dropout_act=0.0, transformer_dropout_attention=0.0, transformer_dropout_prepost=0.0, transformer_feed_forward_num_hidden=2048, transformer_model_size=512, transformer_positional_embedding_type='fixed', transformer_postprocess=('drn', 'drn'), transformer_preprocess=('', ''), use_cpu=False, use_tensorboard=False, validation_source='newstest2016.tc.BPE.de', validation_target='newstest2016.tc.BPE.en', weight_decay=0.0, weight_init='xavier', weight_init_scale=2.34, weight_init_xavier_factor_type='in', weight_init_xavier_rand_type='uniform', weight_normalization=False, weight_tying=False, weight_tying_type='trg_softmax', word_min_count=(1, 1))
[INFO:sockeye.utils] Attempting to acquire 1 GPUs of 1 GPUs. The requested devices are: [-1]
[INFO:sockeye.utils] Acquired GPU 0.
[INFO:main] Training Device(s): GPU [0]
[INFO:sockeye.vocab] Building vocabulary from dataset(s): ['corpus.tc.BPE.de']
[INFO:sockeye.vocab] Vocabulary: types: 22309/22309/22309/22313 (initial/min_pruned/max_pruned/+special) [min_frequency=1, max_num_types=50000]
[INFO:sockeye.vocab] Building vocabulary from dataset(s): ['corpus.tc.BPE.en']
[INFO:sockeye.vocab] Vocabulary: types: 16757/16757/16757/16761 (initial/min_pruned/max_pruned/+special) [min_frequency=1, max_num_types=50000]
[INFO:sockeye.data_io] ===============================
[INFO:sockeye.data_io] Creating training data iterator
[INFO:sockeye.data_io] ===============================
[INFO:sockeye.data_io] 193369 sequences of maximum length (60, 60) in '/home/chenxiny/newstest/corpus.tc.BPE.de' and '/home/chenxiny/newstest/corpus.tc.BPE.en'.
[INFO:sockeye.data_io] Mean training target/source length ratio: 0.99 (+-0.45)
[INFO:sockeye.data_io] Tokens: source 4563838 target 4466824
[INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100%
[INFO:sockeye.data_io] 193369 sequences across 6 buckets
[INFO:sockeye.data_io] 6631 sequences did not fit into buckets and were discarded
[INFO:sockeye.data_io] Bucket (10, 10): 32318 samples in 505 batches of 64, ~362.2 tokens/batch.
[INFO:sockeye.data_io] Bucket (20, 20): 46870 samples in 733 batches of 64, ~949.5 tokens/batch.
[INFO:sockeye.data_io] Bucket (30, 30): 49430 samples in 773 batches of 64, ~1511.9 tokens/batch.
[INFO:sockeye.data_io] Bucket (40, 40): 35885 samples in 561 batches of 64, ~2070.4 tokens/batch.
[INFO:sockeye.data_io] Bucket (50, 50): 19636 samples in 307 batches of 64, ~2618.3 tokens/batch.

@szha
Copy link
Member

szha commented Apr 6, 2018

@pengzhao-intel any update?

@szha szha added Installation and removed Bug labels Apr 6, 2018
@szha
Copy link
Member

szha commented Apr 6, 2018

Unless intel provides a static library for libiomp5 built with -fPIC flag, or drop the dependency on it, I don't see anything we can do from the mxnet side. Suggestions are welcome.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Apr 6, 2018

@szha, @ashokei is talking with MKL team for using libgomp in mxnet so it can avoid the dependency of libiomp.

@ashokei any update and please correct me if my understanding is not correct.

@szha
Copy link
Member

szha commented Apr 6, 2018

@pengzhao-intel thanks for the update. @ashokei let us know how things progress and how this change impacts performance.

@ashokei
Copy link
Contributor

ashokei commented Apr 10, 2018

@zheng-da is your openmp linking issue related to this in any way; Does your solution/workaround fix this ?

@zheng-da
Copy link
Contributor

no, i don't have a workaround so far.

@tdomhan
Copy link
Contributor

tdomhan commented Jun 26, 2018

any updates on this?

@pengzhao-intel
Copy link
Contributor

@tdomhan after several investigations, I realized it's not easy to resolve this issue under the current setting. We have to change some building logic.
Do you still encounter this error? Does the workaround KMP_DUPLICATE_LIB_OK=TRUE make sense?
I will raise the internal discussion with MKL team and back soon.

@tdomhan
Copy link
Contributor

tdomhan commented Jun 27, 2018

"may cause crashes or silently produce incorrect results" in the message stated by Felix above made me not so eager to try this out, rather than using the non-MKL version of MXNet.

@pengzhao-intel
Copy link
Contributor

Understand, even we don't encounter the crashes and incorrect results by the workaround.
I am working on this item now and hope we can resolve this issue clearly :)

@tdomhan
Copy link
Contributor

tdomhan commented Jun 27, 2018

thanks a lot for looking into this! :)

@pengzhao-intel
Copy link
Contributor

The numpy from conda includes the mkl package which conflicts with MXNET.
The good news is the latest mkl package resolved this issue from mkl-2018.0.3.

Thus, updating the numpy in conda to 1.14 will resolve this issue.

@tdomhan @fhieber please try again in your environment.
Feel free to let me know if you encountered any other issues.

The following packages will be downloaded:
package | build
---------------------------|-----------------
mkl-2018.0.3 | 1 198.7 MB
The following NEW packages will be INSTALLED:
blas: 1.0-mkl
mkl_fft: 1.0.1-py36h3010b51_0
mkl_random: 1.0.1-py36h629b387_0
numpy-base: 1.14.5-py36hdbf6ddf_0
The following packages will be UPDATED:
ca-certificates: 2017.08.26-h1d4fec5_0 --> 2018.03.07-0
certifi: 2017.7.27.1-py36h8b7b77e_0 --> 2018.4.16-py36_0
conda: 4.4.11-py36_0 --> 4.5.4-py36_0
llvmlite: 0.20.0-py36_0 --> 0.23.2-py36hdbcaa40_0
mkl: 2018.0.0-hb491cac_4 --> 2018.0.3-1
numba: 0.35.0-np113py36_10 --> 0.38.1-py36h04863e7_0
numpy: 1.13.3-py36ha12f23b_0 --> 1.14.5-py36hcd700cb_0
openssl: 1.0.2l-h077ae2c_5 --> 1.0.2o-h20670df_0

@pengzhao-intel
Copy link
Contributor

@fhieber @tdomhan could you help confirm the fix?
If the problem is solved, we can close this bug finally ;)

@fhieber
Copy link
Contributor Author

fhieber commented Sep 28, 2018

I finally got back to this after a while. I no longer observe the libomp-related error mentioned in the original issue, but I am observing process deadlocks with the following numpy/mxnet configuration in Sockeye:

conda list | grep numpy
numpy                     1.15.1           py36h6a91979_0
numpy-base                1.15.1           py36h8a80b8c_0
conda list | grep mkl
blas                      1.0                         mkl
mkl                       2019.0                      118
mkl_fft                   1.0.4            py36h5d10147_1
mkl_random                1.0.1            py36h5d10147_1
mxnet-mkl                 1.3.0.post0               <pip>

If mkl-optimized numpy is installed via anaconda (as shown above) and using mxnet-mkl==1.3.0.post0 on a Mac laptop, the Sockeye subprocess spawned at a checkpoint (to decode the validation data set), is unable to spawn and the main process deterministically hangs. When debugging, it seems that it fails to spawn the subprocess.
However, when using either mxnet==1.3.0.post0 (no mkl) or pip-installed numpy (no mkl), everything works just fine.

@pengzhao-intel
Copy link
Contributor

@fhieber thanks for the feedback.
Could you give me a test case to reproduce this issue?

@fhieber
Copy link
Contributor Author

fhieber commented Sep 28, 2018

Thanks @pengzhao-intel, here is a minimal example to reproduce the issue.
You can run this:

conda install mkl
conda install numpy
pip install mxnet-mkl
git clone https://github.com/awslabs/sockeye.git
cd sockeye
python -m sockeye.train --num-layers 1 -s setup.py -t setup.py -vs setup.py -vt setup.py -o test_model --batch-size 5 --batch-type sentence --num-embed 8 --transformer-model-size 8 --overwrite-output --use-cpu --transformer-attention-heads 1 --checkpoint-frequency 100 --decode-and-evaluate 2

(this will train a tiny model on the setup.py file, but will hang once reached 100 updates and spawns a CheckpointDecoder subprocess to decode 2 sentences of the validation data.
This will hang with the last log line being:

[INFO:sockeye.training] Starting process: Decoder-1

If you set --decode-and-evaluate 0, no decoder subprocess will be started at each checkpoint, and training runs fine.

If you run

conda uninstall mkl
conda uninstall numpy
pip install numpy

and run the same training with --decode-and-evaluate > 0, no hanging will occur.

Likewise, if you replace mxnet-mkl with mxnet:

pip uninstall mxnet-mkl
pip install mxnet

and run the same training, no hanging will occur.

@pengzhao-intel
Copy link
Contributor

@fhieber thanks for the details information.

Because this is a new issue, do you mind close this one and open a new one?
BTW, the issue is observed from sockeye first, I suggest we can create an issue in sockeye repo first and go back MXNet repo when we figure out the specific bug.

@fhieber
Copy link
Contributor Author

fhieber commented Oct 1, 2018

@pengzhao-intel this is not a Sockeye issue. I posted a minimal reproducible example in #12710

@lebeg
Copy link
Contributor

lebeg commented Nov 22, 2018

Possibly related #12160

artemsok pushed a commit to artemsok/sockeye that referenced this issue Jan 21, 2019
@pengzhao-intel
Copy link
Contributor

The problem has been fixed with the latest numpy in conda.

Closing, feel free to re-open if there is any other issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests