-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MKL2017 implemented layers integration to improve IA perf. on mxnet #3581
Conversation
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Since mxnet update the new inception-bn model do not need to add padding patch to use old model Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
Signed-off-by: Lingyan <lingyan.guo@intel.com>
res_convolutionBwdBias[dnnResourceDiffDst] = | ||
bwdb_top_diff->get_converted_prv(grad.dptr_, false); | ||
if (bwdb_bias_diff->conversion_needed()) { | ||
MKL_DLOG(INFO) << "MKLCONV: bwd bias diff needs converted "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't show these unless debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
code looks good to me. |
please make sure test passes when USE_MKL2017=1:
Currently it fails for me with:
|
Eric, thanks for quick response. Debug msg and lint clean now. But I can not reproduce your issue in my side when running nosetest though there have two other issues when running unittest to follow up as below. I am using a clean CenOS and only enable USE_MKL2017=1 but keep USE_BLAS=atlas. Could you pls let me know your env. and more details then I can do further debug? BTW, I will get back to you soon about below 2 issues. Actually if change 1e-6 to 1e-4 for deconv_grad, then it will pass. FAIL: test_executor.test_dotTraceback (most recent call last): FAIL: test_operator.test_deconvolutionTraceback (most recent call last): Ran 90 tests in 21.202s |
@@ -15,6 +20,19 @@ Operator* CreateOp<cpu>(ConvolutionParam param, int dtype, | |||
std::vector<TShape> *out_shape, | |||
Context ctx) { | |||
Operator *op = NULL; | |||
#if MXNET_USE_MKL2017 == 1 | |||
if ((param.dilate[0] == 1 && param.dilate[1] == 1) | |||
&& param.kernel.ndim() == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does mkl convolution support grouping? if not should also check for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, MKL support grouping. we use groups conv API in mkl_convolution-inl.h like below:
dnnGroupsConvolutionCreateForward
dnnGroupsConvolutionCreateBackwardData
I'm using USE_CUDA=1, USE_CUDNN=1, USE_MKL2017=1 |
Can you try to not change anything but just USE_MKL2017=1 in make/config.mk to see if still have issue? BTW, what HW are you running? |
tried only use mkl, python tests/python/unittest/test_operator.py gives same error: [13:03:37] /scratch/intel-mxnet/dmlc-core/include/dmlc/logging.h:235: [13:03:37] src/engine/./threaded_engine.h:306: [13:03:37] src/operator/mkl/mkl_memory.cc:47: Check failed: (status) == (E_SUCCESS) Failed creation convert_to_int with status -1 for buffer: fwd_filter_data @ MKLConvolutionOp An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging. An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging. |
I know your issue. You pre-installed old MKL lib before which does not support MKL dnn API yet. By default, if you installed anaconda2, there will a old MKL lib in it. You need to
Pls let me know if it works. |
didn't work. |
could you show me your output from |
Did you try "conda remove mkl mkl-service"? This group feature is supported in MKL20160801 or later. I know MKL env. setting is annoying. For quick test, you can remove anaconda2 then test out. If you set LD_LIBRARY_PATH to what you downloaded MKLML package, the version should be MKL20160801. My log as below: [root@epp-hpc-h4-18 mxnet]# nosetests --verbose tests/python/unittest/test_operator.py FAIL: test_operator.test_deconvolutionTraceback (most recent call last): Ran 41 tests in 7.571s FAILED (failures=1) |
Ok it works. |
Thanks for suggestion. I did some debug. Actually this is because if anaconda is installed with mkl, numpy and related python will call the mkl and iomp5 lib in anaconda which is pre-defined. Because of this, the old MKL version in anaconda will be initialized and loaded so when calling new API, it will be wrong. [18:49:27] /home/zhenlin/Mxnet/upstream/intel-mxnet/intel-mxnet/dmlc-core/include/dmlc/logging.h:235: [18:49:27] src/engine/./threaded_engine.h:306: [18:49:27] src/operator/mkl/mkl_memory.cc:51: Check failed: (status) == (E_SUCCESS) Failed creation convert_to_int with status -1 for buffer: fwd_filter_data @ MKL It is not issue of mkldnn compatible with anaconda. It is the issue when installing 2 different MKL version in system to cause annoying. The quick way to resolve this is remove mkl in anaconda until anaconda integrate the latest MKL2017. Pls refer to https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations Since I believe anaconda will integrate the new MKL2017 soon, this issue will disappear automatically. I will put this into README as known issue. Also I will continue to look at this to see if have better w/a. But I think this should not block the PR merge, right? Pls let me know. |
hi all: FAIL: test_executor.test_dotTraceback (most recent call last): |
For FAIL: test_executor.test_dot issue |
ok. Please fix the deconvolution test and we can merge On Oct 20, 2016 6:25 PM, "Zhenlin" notifications@github.com wrote:
|
Hi @piiswrong ,
Pls review the latest MKL2017 implemented layers patches including conv, relu, lrn, batch-norm, concat, elementwise-sum, pooling and fc (total 8 layers). PR#3458 related code is merged into this. This version is putting MKL buffer inside layers and do conversion for input and output. Once your review is done, I will submit another PR to provide much higher perf. patch to pass MKL buffer between layers.
Pls kindly review ASAP.
Thanks