-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Subprocess Deadlock with mxnet-mkl #12710
Comments
@mxnet-label-bot [MKL] |
Actually, there is no need for the subprocess to be a Python process. replacing The hang does not occur if one comments the following lines from the code: mod.forward(batch)
mod.backward()
mod.update() So it seems that mxnet-mkl is somehow preventing any subprocess forking. |
Thanks @fhieber to raise this issue. I will take a look after China holiday from 1st to 7th OCT. |
thanks for looking into the issue! |
Just as a note: In the installation steps above one needs to add a
|
We start to look at the issue and will back soon :) |
thanks you! We are currently experimenting with a workaround that can be found here: The bottom line is to create a forkserver with a clean python interpreter process before MXNet is imported. If we use that forkserver for forking our decoder process we do not observe the behavior. That said, it is still concerning that one can no longer fork after MXNet with MKL was imported. |
@tdomhan we can reproduce the issue on MacOS, no problem on Linux. As you mentioned, the issue happens between conda numpy-mkl with fork but the normal version of numpy is OK. And if we use os.system() to execute the cmd, it's also fine. Still debugging, but it looks like this is a cross-platform and software issues. I will contact with mkl team for some feedbacks. |
Hi folks, Best regards, |
Hi folks, @tdomhan, Can you please run the application with MKL_VERBOSE=1, this will help us to determine the MKL version and threading layer. Please, let me know what you find out! Best regards, |
Hi @mzhukova, thanks for the workaround with MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180710 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x7f9cc9432f80,1,0x7f9cc9432f80,1) 8.17ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2
MKL_VERBOSE Intel(R) MKL 2018.0 Update 3 Product build 20180406 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SGEMM(T,N,5,32,16,0x700005cf0738,0x7f9ccb85ddc0,16,0x7f9cca4e4c00,16,0x700005cf0740,0x7f9ccb9d8a40,5) 175.07us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2
MKL_VERBOSE SAXPY(5,0x700005cf0738,0x7f9ccb85df00,1,0x7f9ccb9d8a40,1) 10.67us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2
[...followed by an infinite amount of lines similar to the last the one above...] |
Hi @fhieber , Best regards, |
Close this issue for now. Please feel free to reopen it if you are facing more problems with it. |
Description
mxnet-mkl hangs indefinitely when trying to spawn subprocesses. This is a recent issue we are observing with Sockeye and may be related to #8532, but it can be reproduced without Sockeye (see below).
Environment info (Required)
conda install mkl ; conda install numpy
Minimum reproducible example
The following code reliably reproduces the deadlock/indefinite hang in the main process.
It creates a minimal module and 'trains' for 500 iterations, spawning a subprocess every 100 iterations. The main process is supposed to wait until the subprocess finishes before starting the next one.
code.py:
Steps to reproduce
conda install mkl
conda install numpy
pip install mxnet-mkl --no-deps
python3 code.py
What have you tried to solve it?
Replacing
mxnet-mkl
withmxnet
or conda's numpy with pip-installed numpy (conda uninstall numpy; conda uninstall mkl; pip install numpy
) resolves the issue and the output is as expected:The text was updated successfully, but these errors were encountered: