Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Subprocess Deadlock with mxnet-mkl #12710

Closed
fhieber opened this issue Oct 1, 2018 · 14 comments
Closed

Subprocess Deadlock with mxnet-mkl #12710

fhieber opened this issue Oct 1, 2018 · 14 comments
Labels

Comments

@fhieber
Copy link
Contributor

fhieber commented Oct 1, 2018

Description

mxnet-mkl hangs indefinitely when trying to spawn subprocesses. This is a recent issue we are observing with Sockeye and may be related to #8532, but it can be reproduced without Sockeye (see below).

Environment info (Required)

  • Python 3.6.6
  • MacOs
  • mxnet-mkl==1.3.0.post0
  • Anaconda Numpy (with MKL optimization): conda install mkl ; conda install numpy

Minimum reproducible example

The following code reliably reproduces the deadlock/indefinite hang in the main process.
It creates a minimal module and 'trains' for 500 iterations, spawning a subprocess every 100 iterations. The main process is supposed to wait until the subprocess finishes before starting the next one.

code.py:

import subprocess
import sys

import mxnet as mx

if __name__ == '__main__':

    if len(sys.argv) > 1:
        print("TESTING")
        test = True
        iterations = 50
    else:
        print("TRAINING")
        test = False
        iterations = 500

    x = mx.sym.Variable('x')
    y = mx.sym.Variable('y')

    sym = mx.sym.FullyConnected(x, num_hidden=5)
    sym = mx.sym.SoftmaxOutput(sym, y)

    x_data = mx.nd.uniform(0, 1, (32, 16))
    y_data = mx.nd.zeros((32, 5))
    batch = mx.io.DataBatch(data=[x_data], label=[y_data])

    mod = mx.mod.Module(sym, data_names=['x'], label_names=['y'])
    mod.bind(data_shapes=[mx.io.DataDesc('x', shape=x_data.shape)],
             label_shapes=[mx.io.DataDesc('y', shape=y_data.shape)],
             for_training=True, grad_req='write' if not test else 'null')
    mod.init_params()
    mod.init_optimizer()
    process = None
    for i in range(iterations):
        mod.forward(batch)
        if not test:
            mod.backward()
            mod.update()
        if i % 100 == 0 and i > 0:
            print(i)
            if not test:
                if process:
                    print("Waiting for process")
                    process.wait()
                cmd = [sys.executable, sys.argv[0], 'test']
                print("Starting process: '%s'" % " ".join(cmd))
                process = subprocess.Popen(cmd)
    if process:
        process.wait()

Steps to reproduce

  1. conda install mkl
  2. conda install numpy
  3. pip install mxnet-mkl --no-deps
  4. python3 code.py

What have you tried to solve it?

Replacing mxnet-mkl with mxnet or conda's numpy with pip-installed numpy (conda uninstall numpy; conda uninstall mkl; pip install numpy) resolves the issue and the output is as expected:

TRAINING
100
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
200
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
300
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
400
Waiting for process
TESTING
Starting process: '/Users/fhieber/miniconda3/bin/python3 sockeye/process_test.py test'
TESTING
@vandanavk
Copy link
Contributor

@mxnet-label-bot [MKL]

@marcoabreu marcoabreu added the MKL label Oct 1, 2018
@fhieber
Copy link
Contributor Author

fhieber commented Oct 1, 2018

Actually, there is no need for the subprocess to be a Python process. replacing cmd = [sys.executable, sys.argv[0], 'test'] with cmd = ['ls'] produces the same hanging.

The hang does not occur if one comments the following lines from the code:

mod.forward(batch)
mod.backward()
mod.update()

So it seems that mxnet-mkl is somehow preventing any subprocess forking.
If the above code example is run in a debugger, the hanging occurs in the call to self._execute_child(...) in subprocess.py, line 1268.

@pengzhao-intel
Copy link
Contributor

Thanks @fhieber to raise this issue. I will take a look after China holiday from 1st to 7th OCT.

@fhieber fhieber changed the title Process Deadlock with mxnet-mkl and mkl-optimized numpy Subprocess Deadlock with mxnet-mkl Oct 2, 2018
@tdomhan
Copy link
Contributor

tdomhan commented Oct 2, 2018

thanks for looking into the issue!

@tdomhan
Copy link
Contributor

tdomhan commented Oct 2, 2018

Just as a note: In the installation steps above one needs to add a --no-deps to the mxnet installation to make sure the conda numpy version, which uses mkl, will not be overwritten by the pip version:

  1. conda install mkl
  2. conda install numpy
  3. pip install mxnet-mkl --no-deps
  4. python3 code.py

@pengzhao-intel
Copy link
Contributor

We start to look at the issue and will back soon :)

@tdomhan
Copy link
Contributor

tdomhan commented Oct 12, 2018

thanks you! We are currently experimenting with a workaround that can be found here:
https://github.com/awslabs/sockeye/tree/forkserver

The bottom line is to create a forkserver with a clean python interpreter process before MXNet is imported. If we use that forkserver for forking our decoder process we do not observe the behavior. That said, it is still concerning that one can no longer fork after MXNet with MKL was imported.

@pengzhao-intel
Copy link
Contributor

@tdomhan we can reproduce the issue on MacOS, no problem on Linux.

As you mentioned, the issue happens between conda numpy-mkl with fork but the normal version of numpy is OK. And if we use os.system() to execute the cmd, it's also fine.

Still debugging, but it looks like this is a cross-platform and software issues. I will contact with mkl team for some feedbacks.

@akalinki
Copy link

akalinki commented Nov 7, 2018

Hi folks,
In a case when you're using mxnet with mkl, can you please set environment variable MKL_VERBOSE=1 and share the output?

Best regards,
Alexander

@mzhukova
Copy link

Hi folks,

@tdomhan, Can you please run the application with MKL_VERBOSE=1, this will help us to determine the MKL version and threading layer.
I guess your issue can be related to Intel OpenMP + fork(), see numpy/numpy#10060.
So, you can also try the workaround -- set KMP_INIT_AT_FORK to false.

Please, let me know what you find out!

Best regards,
Maria

@fhieber
Copy link
Contributor Author

fhieber commented Nov 15, 2018

Hi @mzhukova, thanks for the workaround with KMP_INIT_AT_FORK=false! This seems to fix the hanging issue for me.
Here's the version information when running with MKL_VERBOSE=1:

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180710 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x7f9cc9432f80,1,0x7f9cc9432f80,1) 8.17ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
MKL_VERBOSE Intel(R) MKL 2018.0 Update 3 Product build 20180406 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, OSX 2.30GHz lp64 intel_thread
MKL_VERBOSE SGEMM(T,N,5,32,16,0x700005cf0738,0x7f9ccb85ddc0,16,0x7f9cca4e4c00,16,0x700005cf0740,0x7f9ccb9d8a40,5) 175.07us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
MKL_VERBOSE SAXPY(5,0x700005cf0738,0x7f9ccb85df00,1,0x7f9ccb9d8a40,1) 10.67us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:2
[...followed by an infinite amount of lines similar to the last the one above...]

@mzhukova
Copy link

Hi @fhieber ,
So, the MKL indeed uses OpenMP threading in this case, which is the root cause of the hang that you observe.
numpy can be forced to use sequential or tbb threading by corresponding MKL_THREADING_LAYER settings. However, as mxnet uses libmklml which support only intel_thread, the good option here will be to use this workaround KMP_INIT_AT_FORK=false.
You can also check version and may be try to update the intel-openmp, as this issue can be already fixed in one of the latest releases.

Best regards,
Maria

@pengzhao-intel
Copy link
Contributor

really appreciate for the help @mzhukova @akalinki

@lanking520
Copy link
Member

Close this issue for now. Please feel free to reopen it if you are facing more problems with it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants