Fixed misc issues when testing training code with dnnl provider

* fix conv_grad dnnl tests with dilation to run dnnl execution provider * update mnist training sample to accept convolution type models convolution models require the input shape to be {1, 28, 28} instead of the flat {728} image that is used for the gemm models this will enable models that require the different shape by adding `--model_type conv` to the command line when running the mnist sample. (while testing a workaround was used see #4762) * Disable weight caching in dnnl conv operator when using training When training we can not use cached weights because the weight will be updated each run. This re-enables dnnl Conv and ConvGrad Ops. The weight caching was the source of the error from Conv when training. * Fix issues found when building grad ops on Linux * The dnnl_convgrad code was over using the scope operator causing a compilation problem. * The dnnl_maxpoolgrad code had a logic error that is was comparing with the source description when it should have been comparing with the destination despription. * Update BUILD.md so it shows DNNL for training * Updated the table of contents. Since the same providers are listed twice. Once for Infrance and again for Training an HTML anchor was added to distinguish the second header from the first for the TOC. * Fix build failure when not using --enable-training build option * reorganize the gradient operators so they are grouped together * Fix issues found when running onnx_backend_test_series.py * Pooling code only supports 2 outputs when built with --enable-training * Address code review feedback * class member variables end in underscore_ * use dst instead of dist to match pattern use elsewhere in DNNL code. * Remove workaround that was introduced to handle problems running convolution based training models. See issue #4762 Signed-off-by: George Nash <george.nash@intel.com>
microsoft · mrry · Jan 30, 2021 · Jul 22, 2020 · Oct 6, 2020 · Oct 14, 2020
commit bd7b6728f23fd9d3df53a2c8ba6f091296d2983d
diff --git a/BUILD.md b/BUILD.md
@@ -37,6 +37,11 @@
     * [iOS](#iOS)
 
 **[Training](#Training)**
+ * [Baseline CPU](#baseline-cpu)
+   * Traning Enabled Execution Providers
+     * [NVIDIA CUDA](#cuda-training)
+     * [ROCM](#ROCM)
+	 * [Intel DNNL/MKL-ML](#dnnl-training)
 
 ***
 # Inferencing
@@ -133,6 +138,7 @@ GCC 4.x and below are not supported.
 |**Use OpenMP**|--use_openmp|OpenMP will parallelize some of the code for potential performance improvements. This is not recommended for running on single threads.|
 |**Build using parallel processing**|--parallel|This is strongly recommended to speed up the build.|
 |**Build Shared Library**|--build_shared_lib||
+|**Enable Training support**|--enable_training||
 
 ### APIs and Language Bindings
 |API|Command|Additional details|
@@ -1178,8 +1184,32 @@ Dockerfile instructions are available [here](./dockerfiles#migraphx)
 ***
 
 # Training
-## CUDA
-### Prerequisites
+
+## Baseline CPU
+
+### Build Instructions
+To build ORT with training support add `--enable_training` build instruction.
+
+All other build options are the same for inferencing as they are for training.
+
+#### Windows
+```
+.\build.bat --config RelWithDebInfo --build_shared_lib --parallel --enable_training
+```
+
+The default Windows CMake Generator is Visual Studio 2017, but you can also use the newer Visual Studio 2019 by passing
+`--cmake_generator "Visual Studio 16 2019"` to `.\build.bat`
+
+
+#### Linux/macOS
+```
+./build.sh --config RelWithDebInfo --build_shared_lib --parallel --enable_training
+```
+
+## Training Enabled Execution Providers
+
+### <a id="cuda-training">CUDA</a>
+#### Prerequisites
 
 The default NVIDIA GPU build requires CUDA runtime libraries installed on the system:
 
@@ -1191,7 +1221,7 @@ The default NVIDIA GPU build requires CUDA runtime libraries installed on the sy
 
 These dependency versions should reflect what is in [Dockerfile.training](./dockerfiles/Dockerfile.training).
 
-### Build instructions
+#### Build instructions
 
 1. Checkout this code repo with `git clone https://github.com/microsoft/onnxruntime`
 
@@ -1213,8 +1243,8 @@ These dependency versions should reflect what is in [Dockerfile.training](./dock
 
     This produces the .whl file in `./build/Linux/RelWithDebInfo/dist` for ONNX Runtime Training.
 
-## ROCM
-### Prerequisites
+### ROCM
+#### Prerequisites
 
 The default AMD GPU build requires ROCM software toolkit installed on the system:
 
@@ -1234,3 +1264,18 @@ These dependency versions should reflect what is in [Dockerfile.training](./dock
    * Run `./build.sh --config RelWithDebInfo --enable_training --build_wheel --use_rocm --rocm_home /opt/rocm --nccl_home /opt/rocm --mpi_home <location for openmpi>`
 
     This produces the .whl file in `./build/Linux/RelWithDebInfo/dist` for ONNX Runtime Training.
+
+### <a id="dnnl-training"> DNNL and MKLML </a>
+
+#### Build Instructions
+##### Linux
+
+`./build.sh --enable_training --use_dnnl`
+
+##### Windows
+
+`.\build.bat --enable_training --use_dnnl`
+
+Add `--build_wheel` to build the ONNX Runtime wheel
+
+This will produce a .whl file in `build/Linux/RelWithDebInfo/dist` for ONNX Runtime Training
diff --git a/onnxruntime/core/providers/dnnl/dnnl_execution_provider.h b/onnxruntime/core/providers/dnnl/dnnl_execution_provider.h
@@ -92,12 +92,12 @@ class DNNLExecutionProvider : public IExecutionProvider {
   // Note if the a DnnlKernel already exists this will replace the existing kernel with the
   // new kernel. This was done so the latest kernel is always placed in the map.
   void SetForwardKernel(onnxruntime::NodeIndex key, std::shared_ptr<ort_dnnl::DnnlKernel> kernel) {
-    fwd_kernal_map[key] = kernel;
+    fwd_kernal_map_[key] = kernel;
   }
 
   // Fetch the kerenel using the NodeIndex
   std::shared_ptr<ort_dnnl::DnnlKernel> GetForwardKernal(onnxruntime::NodeIndex key) {
-    return fwd_kernal_map.at(key);
+    return fwd_kernal_map_.at(key);
   }
 
   std::stack<std::shared_ptr<ort_dnnl::DnnlKernel>> fwd_conv_stack;
@@ -119,7 +119,7 @@ class DNNLExecutionProvider : public IExecutionProvider {
   // running in training mode.The backward Kernels need access the forward kernals; typically
   // to obtain the forward primitive description but it may be need for other items like
   // accessing workspace memory.
-  std::map<onnxruntime::NodeIndex, std::shared_ptr<ort_dnnl::DnnlKernel>> fwd_kernal_map;
+  std::map<onnxruntime::NodeIndex, std::shared_ptr<ort_dnnl::DnnlKernel>> fwd_kernal_map_;
 #endif
   // SUBGRAPH
  private:
@@ -164,9 +164,13 @@ class DNNLExecutionProvider : public IExecutionProvider {
       if (node_inputs[0]->Shape() != nullptr && node_inputs[0]->Shape()->dim_size() < 3) {
         supported = false;
       }
-
+      #ifdef ENABLE_TRAINING
       if (node->OutputDefs().size() > 2)
         supported = false;
+      #else
+      if (node->OutputDefs().size() > 1)
+        supported = false;
+      #endif
 
     }
     return supported;
@@ -194,7 +198,7 @@ class DNNLExecutionProvider : public IExecutionProvider {
 
  private:
   // supported Dnnl Operators
-  std::set<std::string> dnnl_ops_ = {/*"Conv", "ConvGrad",*/ "BatchNormalization", "Relu", "ReluGrad", "Sum",
+  std::set<std::string> dnnl_ops_ = {"Conv", "ConvGrad", "BatchNormalization", "Relu", "ReluGrad", "Sum",
                                      "AveragePool", "GlobalMaxPool", "GlobalAveragePool", "MaxPool", "MaxPoolGrad", "LRN"};
 
   mutable std::unordered_map<std::string, std::shared_ptr<ort_dnnl::Subgraph>> mkl_subgraphs_;

diff --git a/onnxruntime/core/providers/dnnl/subgraph/dnnl_conv.h b/onnxruntime/core/providers/dnnl/subgraph/dnnl_conv.h
@@ -461,7 +461,6 @@ class DnnlConv : public DnnlKernel {
     auto xdim = tensor_shape.size();
 
     TensorShape W(xshape, xdim);
-    const T* filter_data = const_cast<T*>(ort.GetTensorData<T>(input_tensor));
 
     const int group_mkl = static_cast<int>(group_);
 
@@ -474,6 +473,7 @@ class DnnlConv : public DnnlKernel {
       filter_dims_mkl.insert(filter_dims_mkl.end(), W.GetDims().begin() + 1, W.GetDims().end());
     }
 
+    const T* filter_data = const_cast<T*>(ort.GetTensorData<T>(input_tensor));
     {
       // lock to make sure reordering is done only once
       std::lock_guard<OrtMutex> lock(provider_->GetMutex());
@@ -499,7 +499,12 @@ class DnnlConv : public DnnlKernel {
               .execute(dnnl_engine_gpu_, src, *filter_dst_mem);
         }
 
+    // Do not use cached weights if running training since weight is changed each iteration
+#ifndef ENABLE_TRAINING
         provider_->SetWeightsMemoryBuffer(mklnode_ptr_->weight_name, filter_dst_mem);
+#else
+        filter_dst_mem_ = filter_dst_mem;
+#endif // !ENABLE_TRAINING
       }
     }
   }
@@ -518,6 +523,8 @@ class DnnlConv : public DnnlKernel {
       const OrtValue* binput_tensor = ort.KernelContext_GetInput(context, input_index + 2);
       bias_data = const_cast<T*>(ort.GetTensorData<T>(binput_tensor));
     }
+    // Do not use cached weights if running training
+#ifndef ENABLE_TRAINING
     std::shared_ptr<dnnl::memory> filter_dst_mem = provider_->GetWeightsMemoryBuffer(mklnode_ptr_->weight_name);
     if (filter_dst_mem == nullptr) {
       ReorderWeights(api, context, dnnl_engine_cpu_);
@@ -530,8 +537,19 @@ class DnnlConv : public DnnlKernel {
 #ifdef USE_DNNL_GPU_OCL
       std::lock_guard<OrtMutex> lock(provider_->GetMutex());
       filter_mem_gpu_->set_ocl_mem_object(filter_dst_mem->get_ocl_mem_object());
-#endif
+#endif // USE_DNNL_GPU_OCL
+    }
+#else // ENABLE_TRAINING
+    if (!gpu_available_) {
+      filter_data = static_cast<T*>(filter_dst_mem_->get_data_handle());
+      filter_mem_->set_data_handle(static_cast<void*>(const_cast<T*>(filter_data)));
+    } else if (gpu_available_) {
+#ifdef USE_DNNL_GPU_OCL
+      std::lock_guard<OrtMutex> lock(provider_->GetMutex());
+      filter_mem_gpu_->set_ocl_mem_object(filter_dst_mem_->get_ocl_mem_object());
+#endif // USE_DNNL_GPU_OCL
     }
+#endif // ENABLE_TRAINING
 
     if (bias_data != nullptr) {
       bias_mem_->set_data_handle(static_cast<void*>(const_cast<T*>(bias_data)));
@@ -645,7 +663,9 @@ class DnnlConv : public DnnlKernel {
  private:
   dnnl::memory::desc filter_desc_;
   dnnl::memory::format_tag filter_format_;
-
+#ifdef ENABLE_TRAINING
+  std::shared_ptr<dnnl::memory> filter_dst_mem_;
+#endif
   std::shared_ptr<dnnl::memory> src_mem_from_;
   std::unique_ptr<dnnl::memory> src_mem_to_;
 

diff --git a/onnxruntime/core/providers/dnnl/subgraph/dnnl_convgrad.h b/onnxruntime/core/providers/dnnl/subgraph/dnnl_convgrad.h
@@ -262,7 +262,7 @@ class DnnlConvGrad : public DnnlKernel {
     }
 
     conv_bwd_data_desc_ = onnxruntime::make_unique<dnnl::convolution_backward_data::desc>(
-        dnnl::convolution_backward_data::desc::desc(
+        dnnl::convolution_backward_data::desc(
           dnnl::algorithm::convolution_direct,
           *primitive_dst_md_,
           *weights_md_,
@@ -277,7 +277,7 @@ class DnnlConvGrad : public DnnlKernel {
             *conv_bwd_data_desc_, engine_to_use, *(conv_fwd_->GetPrimitiveDesc())));
 
     conv_bwd_weights_desc_ = onnxruntime::make_unique<dnnl::convolution_backward_weights::desc>(
-        dnnl::convolution_backward_weights::desc::desc(
+        dnnl::convolution_backward_weights::desc(
             dnnl::algorithm::convolution_direct,
             *src_md_,
             *diff_weights_md_,

diff --git a/onnxruntime/core/providers/dnnl/subgraph/dnnl_func_kernel.cc b/onnxruntime/core/providers/dnnl/subgraph/dnnl_func_kernel.cc
@@ -9,16 +9,17 @@
 #include "core/session/onnxruntime_cxx_api.h"
 #include "core/providers/dnnl/dnnl_common.h"
 #include "core/providers/dnnl/subgraph/dnnl_conv.h"
-#include "core/providers/dnnl/subgraph/dnnl_convgrad.h"
 #include "core/providers/dnnl/subgraph/dnnl_batchnorm.h"
 #include "core/providers/dnnl/subgraph/dnnl_conv_batchnorm.h"
 #include "core/providers/dnnl/subgraph/dnnl_activations.h"
-#include "core/providers/dnnl/subgraph/dnnl_relugrad.h"
 #include "core/providers/dnnl/subgraph/dnnl_pool.h"
 #include "core/providers/dnnl/subgraph/dnnl_sum.h"
 #include "core/providers/dnnl/subgraph/dnnl_lrn.h"
+#ifdef ENABLE_TRAINING
+#include "core/providers/dnnl/subgraph/dnnl_convgrad.h"
+#include "core/providers/dnnl/subgraph/dnnl_relugrad.h"
 #include "core/providers/dnnl/subgraph/dnnl_maxpoolgrad.h"
-
+#endif // ENABLE_TRAINING
 
 namespace onnxruntime {
 namespace ort_dnnl {
@@ -78,20 +79,6 @@ class SubgraphPrimitive : public PrimitiveBase {
 #ifdef ENABLE_TRAINING
         params.provider->fwd_conv_stack.emplace(kernel);
 #endif
-        for (auto index : dnnl_node.parent_nodes) {
-          kernel->parents_.push_back(context_.kernels[index]);
-        }
-        context_.kernels.push_back(kernel);
-      } else if (dnnl_node.name == "ConvGrad") {
-        std::ostringstream os;
-        os << "ConvGrad-" << dnnl_node.node_index << "-";
-        std::shared_ptr<DnnlConvGrad<T>> kernel;
-        kernel = std::make_shared<DnnlConvGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());
-
-        auto fwd_kernel = params.provider->fwd_conv_stack.top();
-        kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlConv<T>>(fwd_kernel));
-        params.provider->fwd_conv_stack.pop();
-
         for (auto index : dnnl_node.parent_nodes) {
           kernel->parents_.push_back(context_.kernels[index]);
         }
@@ -117,25 +104,6 @@ class SubgraphPrimitive : public PrimitiveBase {
         // onnxruntime\core\framwork\run_options.h
         params.provider->SetForwardKernel(dnnl_node.onnx_index, kernel);
 #endif
-        for (auto index : dnnl_node.parent_nodes) {
-          kernel->parents_.push_back(context_.kernels[index]);
-        }
-        context_.kernels.push_back(kernel);
-      } else if (dnnl_node.name == "ReluGrad") {
-        std::ostringstream os;
-        os << "ReluGrad-" << dnnl_node.node_index << "-";
-        std::shared_ptr<DnnlReluGrad<T>> kernel;
-        kernel = std::make_shared<DnnlReluGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());
-
-        // walk the input_nodes for this ReluGrad dnnl_node find the node index of the Relu input_node
-        // use that index to obtain the Relu kernel pointer from the fwd_kernal_map.
-        for (auto iter = dnnl_node.input_nodes.begin(); iter != dnnl_node.input_nodes.end(); ++iter) {
-          if (iter->op_type == "Relu") {
-            auto fwd_kernel = params.provider->GetForwardKernal(iter->index);
-            kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlRelu<T>>(fwd_kernel));
-          }
-        }
-
         for (auto index : dnnl_node.parent_nodes) {
           kernel->parents_.push_back(context_.kernels[index]);
         }
@@ -183,25 +151,9 @@ class SubgraphPrimitive : public PrimitiveBase {
         os << "MaxPool-" << dnnl_node.node_index << "-";
         std::shared_ptr<DnnlPool<T>> kernel;
         kernel = std::make_shared<DnnlPool<T>>(dnnl_node, params.provider, *params.attributes, os.str());
+#ifdef ENABLE_TRAINING
         params.provider->SetForwardKernel(dnnl_node.onnx_index, kernel);
-        for (auto index : dnnl_node.parent_nodes) {
-          kernel->parents_.push_back(context_.kernels[index]);
-        }
-        params.provider->SetForwardKernel(dnnl_node.onnx_index, kernel);
-        context_.kernels.push_back(kernel);
-      } else if (dnnl_node.name == "MaxPoolGrad") {
-        std::ostringstream os;
-        os << "MaxPoolGrad-" << dnnl_node.node_index << "-";
-        std::shared_ptr<DnnlMaxPoolGrad<T>> kernel;
-        kernel = std::make_shared<DnnlMaxPoolGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());
-
-        for (auto iter = dnnl_node.input_nodes.begin(); iter != dnnl_node.input_nodes.end(); ++iter) {
-          if (iter->op_type == "MaxPool") {
-            auto fwd_kernel = params.provider->GetForwardKernal(iter->index);
-            kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlPool<T>>(fwd_kernel));
-          }
-        }
-
+#endif
         for (auto index : dnnl_node.parent_nodes) {
           kernel->parents_.push_back(context_.kernels[index]);
         }
@@ -252,6 +204,59 @@ class SubgraphPrimitive : public PrimitiveBase {
         }
         context_.kernels.push_back(kernel);
       }
+#ifdef ENABLE_TRAINING
+      else if (dnnl_node.name == "ConvGrad") {
+        std::ostringstream os;
+        os << "ConvGrad-" << dnnl_node.node_index << "-";
+        std::shared_ptr<DnnlConvGrad<T>> kernel;
+        kernel = std::make_shared<DnnlConvGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());
+
+        auto fwd_kernel = params.provider->fwd_conv_stack.top();
+        kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlConv<T>>(fwd_kernel));
+        params.provider->fwd_conv_stack.pop();
+
+        for (auto index : dnnl_node.parent_nodes) {
+          kernel->parents_.push_back(context_.kernels[index]);
+        }
+        context_.kernels.push_back(kernel);
+      } else if (dnnl_node.name == "ReluGrad") {
+        std::ostringstream os;
+        os << "ReluGrad-" << dnnl_node.node_index << "-";
+        std::shared_ptr<DnnlReluGrad<T>> kernel;
+        kernel = std::make_shared<DnnlReluGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());
+
+        // walk the input_nodes for this ReluGrad dnnl_node find the node index of the Relu input_node
+        // use that index to obtain the Relu kernel pointer from the fwd_kernal_map.
+        for (auto iter = dnnl_node.input_nodes.begin(); iter != dnnl_node.input_nodes.end(); ++iter) {
+          if (iter->op_type == "Relu") {
+            auto fwd_kernel = params.provider->GetForwardKernal(iter->index);
+            kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlRelu<T>>(fwd_kernel));
+          }
+        }
+
+        for (auto index : dnnl_node.parent_nodes) {
+          kernel->parents_.push_back(context_.kernels[index]);
+        }
+        context_.kernels.push_back(kernel);
+      } else if (dnnl_node.name == "MaxPoolGrad") {
+        std::ostringstream os;
+        os << "MaxPoolGrad-" << dnnl_node.node_index << "-";
+        std::shared_ptr<DnnlMaxPoolGrad<T>> kernel;
+        kernel = std::make_shared<DnnlMaxPoolGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());
+
+        for (auto iter = dnnl_node.input_nodes.begin(); iter != dnnl_node.input_nodes.end(); ++iter) {
+          if (iter->op_type == "MaxPool") {
+            auto fwd_kernel = params.provider->GetForwardKernal(iter->index);
+            kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlPool<T>>(fwd_kernel));
+          }
+        }
+
+        for (auto index : dnnl_node.parent_nodes) {
+          kernel->parents_.push_back(context_.kernels[index]);
+        }
+        context_.kernels.push_back(kernel);
+      }
+#endif //ENABLE_TRAINING
     }
   }
 
@@ -292,10 +297,7 @@ class SubgraphPrimitivePool : public PrimitivePool<T> {
     for (auto i = 0; i < params.subgraph->dnnl_nodes[0].num_inputs; i++) {
       const OrtValue* input_tensor = ort.KernelContext_GetInput(context, i);
 
-	  if (i>0)
-        continue;
-
-	  auto tensor_info = ort.GetTensorTypeAndShape(input_tensor);
+      auto tensor_info = ort.GetTensorTypeAndShape(input_tensor);
       auto tensor_shape = ort.GetTensorShape(tensor_info);
       ort.ReleaseTensorTypeAndShapeInfo(tensor_info);