When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

# Problem
Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent.  

When compiled with MKL present (MKL is found in `/opt/intel`), MXNet calls MKL for `dot` and `batch_dot` and DNNL for `fully_connected`.  These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below).  

This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL.  However, aa1074dc1704d3732ab205c43d48083ef8c69680 upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: https://github.com/oneapi-src/oneDNN/commit/304915096d1def19999b963a60569ec46a882c16 .  (MKLDNN has since been renamed DNNL.)

Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround).  

# Testing 
Tested with MXNet cfb474ba743d5ea85161bf19875488f4cb409d3c.  Compiled with mostly-default cmake settings:
```bash
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
```

Then when I run 
```
export MKL_VERBOSE=1
export MKLDNN_VERBOSE=1
python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 3.00GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x1a0fdc0,1,0x1a0fdc0,1) 1.47ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
>>> a = mx.nd.ones(shape=(2,2))
>>> mx.nd.FullyConnected(a,a,num_hidden=2,no_bias=True)
dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb2ic2oc2,74.9971

[[2. 2.]
 [2. 2.]]
<NDArray 2x2 @cpu(0)>
>>> a = mx.nd.ones(shape=(2,2,2))
>>> mx.nd.batch_dot(a,a)
MKL_VERBOSE SGEMM_BATCH(N,N,0x7fc3238b809c,0x7fc3238b80a0,0x7fc3238b80a4,0x7fc3238b80b4,0x7fc228010b90,0x7fc3238b80a8,0x7fc22800f770,0x7fc3238b80ac,0x7fc3238b80b8,0x7fc2280190e0,0x7fc3238b80b0,0x7fc3238b7fc8,0x7 363.79us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24

[[[2. 2.]
  [2. 2.]]

 [[2. 2.]
  [2. 2.]]]
>>> mx.nd.dot(a,a)
MKL_VERBOSE SGEMM(N,N,4,4,2,0x7fc3238b8198,0x7fc2280043c0,4,0x7fc2280043c0,2,0x7fc3238b81a0,0x7fc228004580,4) 8.52us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24

[[[[2. 2.]
   [2. 2.]]

  [[2. 2.]
   [2. 2.]]]


 [[[2. 2.]
   [2. 2.]]

  [[2. 2.]
   [2. 2.]]]]
<NDArray 2x2x2x2 @cpu(0)>
```
You can see DNNL is called for `FullyConnected` while MKL is called for `dot` and `batch_dot`.  

# Performance impact
I timed Sockeye decoding.  Commit https://github.com/apache/incubator-mxnet/commit/aa1074dc1704d3732ab205c43d48083ef8c69680 made decoding 22% slower (416.878s up from 342.037s for b5d07e30321da47d604b99048c1b57c03ec819b0) even with MKL installed in `/opt/intel/`.  

| Commit | Compilation | Time(s) |
| --- | --- | --- |
| b5d07e3 (before MKLDNN 1.0 change) | Default | 342.037 | 
| aa1074d (MKLDNN 1.0 change) | Default | 416.878 |
| aa1074d (MKLDNN 1.0 change) | Workaround | 343.706 |
| cfb474ba Recent | Default | 385.587 |
| cfb474ba Recent | Workaround | 312.509 |

(Default compilation is `cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..`; workaround compilation is below.)

Tested on a Skylake Xeon, c5.9xlarge `Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz` with `OMP_NUM_THREADS=4`.

# Workaround
Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on.
```bash
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
```
which compiles but then triggers a link error at runtime `OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack`
So I kludged it with `export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so` and was then able to use MXNet at runtime.  There's probably a cleaner way of fixing the linkage.  

# Recommended fix
When compiled with MKL, MXNet should call MKL directly from `FullyConnected` like it already does for `dot` and `batch_dot`.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980

Problem

Testing

Performance impact

Workaround

Recommended fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Commit	Compilation	Time(s)
`b5d07e3` (before MKLDNN 1.0 change)	Default	342.037
`aa1074d` (MKLDNN 1.0 change)	Default	416.878
`aa1074d` (MKLDNN 1.0 change)	Workaround	343.706
`cfb474b` Recent	Default	385.587
`cfb474b` Recent	Workaround	312.509

When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980

Description

Problem

Testing

Performance impact

Workaround

Recommended fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions