-
Notifications
You must be signed in to change notification settings - Fork 6.7k
When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980
Description
Problem
Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent.
When compiled with MKL present (MKL is found in /opt/intel), MXNet calls MKL for dot and batch_dot and DNNL for fully_connected. These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below).
This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL. However, aa1074d upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: uxlfoundation/oneDNN@3049150 . (MKLDNN has since been renamed DNNL.)
Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround).
Testing
Tested with MXNet cfb474b. Compiled with mostly-default cmake settings:
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..Then when I run
export MKL_VERBOSE=1
export MKLDNN_VERBOSE=1
python3
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 3.00GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x1a0fdc0,1,0x1a0fdc0,1) 1.47ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:24
>>> a = mx.nd.ones(shape=(2,2))
>>> mx.nd.FullyConnected(a,a,num_hidden=2,no_bias=True)
dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb2ic2oc2,74.9971
[[2. 2.]
[2. 2.]]
<NDArray 2x2 @cpu(0)>
>>> a = mx.nd.ones(shape=(2,2,2))
>>> mx.nd.batch_dot(a,a)
MKL_VERBOSE SGEMM_BATCH(N,N,0x7fc3238b809c,0x7fc3238b80a0,0x7fc3238b80a4,0x7fc3238b80b4,0x7fc228010b90,0x7fc3238b80a8,0x7fc22800f770,0x7fc3238b80ac,0x7fc3238b80b8,0x7fc2280190e0,0x7fc3238b80b0,0x7fc3238b7fc8,0x7 363.79us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:24
[[[2. 2.]
[2. 2.]]
[[2. 2.]
[2. 2.]]]
>>> mx.nd.dot(a,a)
MKL_VERBOSE SGEMM(N,N,4,4,2,0x7fc3238b8198,0x7fc2280043c0,4,0x7fc2280043c0,2,0x7fc3238b81a0,0x7fc228004580,4) 8.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:24
[[[[2. 2.]
[2. 2.]]
[[2. 2.]
[2. 2.]]]
[[[2. 2.]
[2. 2.]]
[[2. 2.]
[2. 2.]]]]
<NDArray 2x2x2x2 @cpu(0)>
You can see DNNL is called for FullyConnected while MKL is called for dot and batch_dot.
Performance impact
I timed Sockeye decoding. Commit aa1074d made decoding 22% slower (416.878s up from 342.037s for b5d07e3) even with MKL installed in /opt/intel/.
| Commit | Compilation | Time(s) |
|---|---|---|
| b5d07e3 (before MKLDNN 1.0 change) | Default | 342.037 |
| aa1074d (MKLDNN 1.0 change) | Default | 416.878 |
| aa1074d (MKLDNN 1.0 change) | Workaround | 343.706 |
| cfb474b Recent | Default | 385.587 |
| cfb474b Recent | Workaround | 312.509 |
(Default compilation is cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..; workaround compilation is below.)
Tested on a Skylake Xeon, c5.9xlarge Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz with OMP_NUM_THREADS=4.
Workaround
Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on.
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..which compiles but then triggers a link error at runtime OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack
So I kludged it with export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so and was then able to use MXNet at runtime. There's probably a cleaner way of fixing the linkage.
Recommended fix
When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.