-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
919626d
to
d035177
Compare
merge with official master
d035177
to
2f64725
Compare
2f64725
to
739eaa2
Compare
ping @zheng-da |
Change-Id: Iefe9f720de719ec2e2f5d24a006602425136711b
Change-Id: Idbe94e21f1e2ddf711523767194b95beda19b120
Change-Id: I545628ff68a54cb01b7fef323dc3de9bd47b1a19
Change-Id: I1e9bf1b9b44bae52068a9c564dff037851e896e5
Change-Id: I9e52651bd830b8cb5d2f193076ef51606c9056f9
Change-Id: I0496fa2394ee036d05c58f3abc1d74af544c7bca
Change-Id: I8347527ec1271b1518921a74e3581d7d84187429
739eaa2
to
f7b9d30
Compare
@ashokei could you also help review this PR? |
@@ -129,6 +129,8 @@ void LRNComputeExCPU(const nnvm::NodeAttrs &attrs, | |||
MKLDNN_OPCHECK_INIT(false, 1, inputs, outputs); | |||
MKLDNNLRNForward(ctx, param, inputs[0], req[0], outputs[0]); | |||
MKLDNN_OPCHECK_RUN(LRNCompute<cpu>, attrs, ctx, inputs, req, outputs); | |||
// Copy outputs[1] from opcheck reference as backward check needs it. | |||
MKLDNN_OPCHECK_COPY_RESULT(outputs, std::vector<size_t>{1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to it for all operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LRN is a special operator that MKLDNN only generates output[0], but for default cpu backward computing, it requires output[1] as well, making opcheck fails eventually. After copying output[1], this problem can be fixed.
Maybe you're thinking if it's necessary for all operators. Firstly, we only found this issue on LRN. Secondly, it would make results different before and after enabling opcheck from accuracy losing accumulation. Thirdly, it would hurt opcheck performance a lot. So I prefer not applying this to all operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining. i just didn't understand why we needed this.
src/operator/nn/mkldnn/mkldnn_act.cc
Outdated
fw_pdesc); | ||
return bw_pdesc; | ||
}); | ||
LOG(INFO) << "Unsupported data type for MKLDNN activation"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should use LOG(FATAL) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems you changed the one above, but not this one.
mkldnn::eltwise_forward::primitive_desc fw_pdesc(fw_desc, cpu_engine); | ||
mkldnn::eltwise_backward::desc bw_desc(alg, diff_md, data_md, 0.0); | ||
mkldnn::eltwise_backward::primitive_desc bw_pdesc(bw_desc, cpu_engine, | ||
fw_pdesc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code shouldn't run here, right? can we return an empty descriptor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be a unreachable code. As it's hard to create a empty descriptor(mkldnn doesn't provide such a function to do that), I think fatal error is enough.
} | ||
|
||
void SetDataNewMem(const mkldnn::memory &out_grad, const mkldnn::memory &weight, | ||
const mkldnn::memory &in_grad) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
|
||
void SetWeightNewMem(const mkldnn::memory &data, | ||
const mkldnn::memory &out_grad, | ||
const mkldnn::memory &in_grad_weight) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
@@ -133,13 +134,23 @@ void MKLDNNLRNFwd::_Init(const LRNParam ¶m, | |||
|
|||
void MKLDNNLRNFwd::SetDataHandle(const NDArray &in_data, | |||
const NDArray &out_data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why in_data
is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in line 140:
this->SetDataHandle(in_data, out_data_mem);
@@ -82,6 +82,101 @@ inline static mkldnn::inner_product_backward_weights::primitive_desc GetIPBwdWei | |||
} | |||
} | |||
|
|||
class MKLDNNFullyConnectForward { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you take the caching of FullyConnected forward out and submit another PR for this?
my concern is that this PR is for caching backward. mixing it with all other backward operators might cause confusing later. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will move FullyConnected forward out this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FullyConnected forward PR see #11611.
@ZhennanQin sorry for the late reply. I don't see big problems in the PR. |
changes look ok for me, great performance improvement! Since these are primarily for perf.optimization i assume existing unit-tests should be ok for functionality checking. |
@marcoabreu Tests are covered by 2 parts:
This PR is like a refactoring work, which doesn't add any new feature or functionality, so above tests are enough. |
Excellent, thanks a lot! |
002ad65
to
89bafa8
Compare
@zheng-da @marcoabreu Code is refactored based on comments:
If there's no more comment, could you please approve and merge this ASAP? Thanks. |
Ping @zheng-da @anirudh2290 |
Thanks for the contribution @ZhennanQin ! |
This is reviewed. We agreed to merge this to 1.4 instead of in 1.3. |
@zheng-da The 1.3 branch is cut. I think it's the time to merge the PR into master now. BTW, we will rebase the code soon :) |
@pengzhao-intel yes. After you rebase successfully, i'll browse through the code again. |
@zheng-da Code is rebased, please have a look. Thanks for engagement. |
src/operator/nn/mkldnn/mkldnn_act.cc
Outdated
*diff_dst_memory, | ||
*diff_src_memory.second)); | ||
}); | ||
TmpMemMgr::Get()->Init(ctx.requested[activation::kTempSpace]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you move it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't remember the details, maybe it's changed by mistake. I've changed it back.
everything else looks fine to me. |
@azai91 @eric-haibin-lin could you also help review the PR? |
@azai91 @eric-haibin-lin @anirudh2290 please review. This has been out for review for a long time... |
* Enable primitive allocation cache for _backward_Activation. Change-Id: I545628ff68a54cb01b7fef323dc3de9bd47b1a19 * Enable primitive allocation cache for _backward_Deconvolution. Change-Id: I1e9bf1b9b44bae52068a9c564dff037851e896e5 * Enable primitive allocation cache for _backward_Pooling. Change-Id: Idbe94e21f1e2ddf711523767194b95beda19b120 * Enable primitive allocation cache for _backward_LRN. Change-Id: Iefe9f720de719ec2e2f5d24a006602425136711b * Enable primitive allocation cache for _backward_BatchNorm. Change-Id: I9e52651bd830b8cb5d2f193076ef51606c9056f9 * Enable primitive allocation cache for _backward_Convolution Change-Id: I0496fa2394ee036d05c58f3abc1d74af544c7bca * Enable primitive allocation cache for _backward_Fully_Connected Change-Id: I8347527ec1271b1518921a74e3581d7d84187429 * remove fc forward and fix indent problem * remove fc forward and fix convolution indent problem * Change log level to FATAL for unreachable code in mkldnn_act.cc * remove fc forward and fix convolution indent problem * remove useless hint in fc * Merge branch 'master' into backward_op_cache * Empty commit to retrigger the CI. * Change LOG(INFO) to LOG(FATAL) for unreachable code in mkldnn_act.cc * Fix build issue after code merge. * Fix lint after merge * Fix mkldnn act.
Hi all,
For MKLDNN, creating operator primitive and corresponding memory will spend a lot of time, for many scenarios, the created primitive can be reused if it meets requirement for next computing. In this PR, we implemented the caching mechanism for most backward operators, just like the optimization have done for forward operator. Correctness test and accuracy test are all PASS. Performance test shown that most models can get speed up in training. Please review this, Thanks.
@zheng-da, @marcoabreu, @azai91, @TaoLv, @pengzhao-intel
This PR covers below commits.
Enable primitive allocation cache for _backward_LRN
Enable primitive allocation cache for _backward_Pooling.
Enable primitive allocation cache for _backward_Activation.
Enable primitive allocation cache for _backward_Deconvolution.
Enable primitive allocation cache for _backward_BatchNorm.
Enable primitive allocation cache for _backward_Fully_Connected.
Enable primitive allocation cache for _backward_Convolution.
Correntness test result:
unit_test - PASS
Train model with dummy data when MXNET_MKLDNN_DEBUG=1
Accuracy test result:
CiFAR10 + ResNet50 ( Convergence ): Validation accuracy of 99 epochs ( top1: 67.88%, top5: 96.26%)
CiFAR10 + VGG16 ( Convergence ): Validation accuracy of 74 epochs ( top1: 82.56%, top5: 98.52% )
Performance test result:
Skylake 8180 2sockets training BS=32 (img/s)
Total Time consumption breakdown(ms)