Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dnnl training #6045

Merged
merged 13 commits into from
Jan 30, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fixed misc issues when testing training code with dnnl provider
* fix conv_grad dnnl tests with dilation to run dnnl execution provider

* update mnist training sample to accept convolution type models

  convolution models require the input shape to be {1, 28, 28}
  instead of the flat {728} image that is used for the gemm models

  this will enable models that require the different shape by adding
 `--model_type conv` to the command line when running the mnist sample.
 (while testing a workaround was used see #4762)

* Disable weight caching in dnnl conv operator when using training

  When training we can not use cached weights because the weight
  will be updated each run. This re-enables dnnl Conv and ConvGrad Ops.
  The weight caching was the source of the error from Conv when training.

* Fix issues found when building grad ops on Linux
  * The dnnl_convgrad code was over using the scope operator
    causing a compilation problem.
  * The dnnl_maxpoolgrad code had a logic error that is was
    comparing with the source description when it should have
    been comparing with the destination despription.

* Update BUILD.md so it shows DNNL for training
  * Updated the table of contents. Since the same providers
    are listed twice. Once for Infrance and again for Training
    an HTML anchor was added to distinguish the second header
    from the first for the TOC.

* Fix build failure when not using --enable-training build option

* reorganize the gradient operators so they are grouped together

* Fix issues found when running onnx_backend_test_series.py

* Pooling code only supports 2 outputs when built with --enable-training

* Address code review feedback
  * class member variables end in underscore_
  * use dst instead of dist to match pattern use elsewhere in DNNL code.

* Remove workaround that was introduced to handle problems running
  convolution based training models. See issue #4762

Signed-off-by: George Nash <george.nash@intel.com>
  • Loading branch information
georgen117 committed Jan 5, 2021
commit bd7b6728f23fd9d3df53a2c8ba6f091296d2983d
55 changes: 50 additions & 5 deletions BUILD.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,11 @@
* [iOS](#iOS)

**[Training](#Training)**
* [Baseline CPU](#baseline-cpu)
* Traning Enabled Execution Providers
* [NVIDIA CUDA](#cuda-training)
* [ROCM](#ROCM)
* [Intel DNNL/MKL-ML](#dnnl-training)

***
# Inferencing
Expand Down Expand Up @@ -133,6 +138,7 @@ GCC 4.x and below are not supported.
|**Use OpenMP**|--use_openmp|OpenMP will parallelize some of the code for potential performance improvements. This is not recommended for running on single threads.|
|**Build using parallel processing**|--parallel|This is strongly recommended to speed up the build.|
|**Build Shared Library**|--build_shared_lib||
|**Enable Training support**|--enable_training||

### APIs and Language Bindings
|API|Command|Additional details|
Expand Down Expand Up @@ -1178,8 +1184,32 @@ Dockerfile instructions are available [here](./dockerfiles#migraphx)
***

# Training
## CUDA
### Prerequisites

## Baseline CPU

### Build Instructions
To build ORT with training support add `--enable_training` build instruction.

All other build options are the same for inferencing as they are for training.

#### Windows
```
.\build.bat --config RelWithDebInfo --build_shared_lib --parallel --enable_training
```

The default Windows CMake Generator is Visual Studio 2017, but you can also use the newer Visual Studio 2019 by passing
`--cmake_generator "Visual Studio 16 2019"` to `.\build.bat`


#### Linux/macOS
```
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --enable_training
```

## Training Enabled Execution Providers

### <a id="cuda-training">CUDA</a>
#### Prerequisites

The default NVIDIA GPU build requires CUDA runtime libraries installed on the system:

Expand All @@ -1191,7 +1221,7 @@ The default NVIDIA GPU build requires CUDA runtime libraries installed on the sy

These dependency versions should reflect what is in [Dockerfile.training](./dockerfiles/Dockerfile.training).

### Build instructions
#### Build instructions

1. Checkout this code repo with `git clone https://github.com/microsoft/onnxruntime`

Expand All @@ -1213,8 +1243,8 @@ These dependency versions should reflect what is in [Dockerfile.training](./dock

This produces the .whl file in `./build/Linux/RelWithDebInfo/dist` for ONNX Runtime Training.

## ROCM
### Prerequisites
### ROCM
#### Prerequisites

The default AMD GPU build requires ROCM software toolkit installed on the system:

Expand All @@ -1234,3 +1264,18 @@ These dependency versions should reflect what is in [Dockerfile.training](./dock
* Run `./build.sh --config RelWithDebInfo --enable_training --build_wheel --use_rocm --rocm_home /opt/rocm --nccl_home /opt/rocm --mpi_home <location for openmpi>`

This produces the .whl file in `./build/Linux/RelWithDebInfo/dist` for ONNX Runtime Training.

### <a id="dnnl-training"> DNNL and MKLML </a>

#### Build Instructions
##### Linux

`./build.sh --enable_training --use_dnnl`

##### Windows

`.\build.bat --enable_training --use_dnnl`

Add `--build_wheel` to build the ONNX Runtime wheel

This will produce a .whl file in `build/Linux/RelWithDebInfo/dist` for ONNX Runtime Training
14 changes: 9 additions & 5 deletions onnxruntime/core/providers/dnnl/dnnl_execution_provider.h
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,12 @@ class DNNLExecutionProvider : public IExecutionProvider {
// Note if the a DnnlKernel already exists this will replace the existing kernel with the
// new kernel. This was done so the latest kernel is always placed in the map.
void SetForwardKernel(onnxruntime::NodeIndex key, std::shared_ptr<ort_dnnl::DnnlKernel> kernel) {
fwd_kernal_map[key] = kernel;
fwd_kernal_map_[key] = kernel;
}

// Fetch the kerenel using the NodeIndex
std::shared_ptr<ort_dnnl::DnnlKernel> GetForwardKernal(onnxruntime::NodeIndex key) {
return fwd_kernal_map.at(key);
return fwd_kernal_map_.at(key);
}

std::stack<std::shared_ptr<ort_dnnl::DnnlKernel>> fwd_conv_stack;
Expand All @@ -119,7 +119,7 @@ class DNNLExecutionProvider : public IExecutionProvider {
// running in training mode.The backward Kernels need access the forward kernals; typically
// to obtain the forward primitive description but it may be need for other items like
// accessing workspace memory.
std::map<onnxruntime::NodeIndex, std::shared_ptr<ort_dnnl::DnnlKernel>> fwd_kernal_map;
std::map<onnxruntime::NodeIndex, std::shared_ptr<ort_dnnl::DnnlKernel>> fwd_kernal_map_;
#endif
// SUBGRAPH
private:
Expand Down Expand Up @@ -164,9 +164,13 @@ class DNNLExecutionProvider : public IExecutionProvider {
if (node_inputs[0]->Shape() != nullptr && node_inputs[0]->Shape()->dim_size() < 3) {
supported = false;
}

#ifdef ENABLE_TRAINING
if (node->OutputDefs().size() > 2)
supported = false;
#else
if (node->OutputDefs().size() > 1)
supported = false;
#endif

}
return supported;
Expand Down Expand Up @@ -194,7 +198,7 @@ class DNNLExecutionProvider : public IExecutionProvider {

private:
// supported Dnnl Operators
std::set<std::string> dnnl_ops_ = {/*"Conv", "ConvGrad",*/ "BatchNormalization", "Relu", "ReluGrad", "Sum",
std::set<std::string> dnnl_ops_ = {"Conv", "ConvGrad", "BatchNormalization", "Relu", "ReluGrad", "Sum",
"AveragePool", "GlobalMaxPool", "GlobalAveragePool", "MaxPool", "MaxPoolGrad", "LRN"};

mutable std::unordered_map<std::string, std::shared_ptr<ort_dnnl::Subgraph>> mkl_subgraphs_;
Expand Down
26 changes: 23 additions & 3 deletions onnxruntime/core/providers/dnnl/subgraph/dnnl_conv.h
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,6 @@ class DnnlConv : public DnnlKernel {
auto xdim = tensor_shape.size();

TensorShape W(xshape, xdim);
const T* filter_data = const_cast<T*>(ort.GetTensorData<T>(input_tensor));

const int group_mkl = static_cast<int>(group_);

Expand All @@ -474,6 +473,7 @@ class DnnlConv : public DnnlKernel {
filter_dims_mkl.insert(filter_dims_mkl.end(), W.GetDims().begin() + 1, W.GetDims().end());
}

const T* filter_data = const_cast<T*>(ort.GetTensorData<T>(input_tensor));
{
// lock to make sure reordering is done only once
std::lock_guard<OrtMutex> lock(provider_->GetMutex());
Expand All @@ -499,7 +499,12 @@ class DnnlConv : public DnnlKernel {
.execute(dnnl_engine_gpu_, src, *filter_dst_mem);
}

// Do not use cached weights if running training since weight is changed each iteration
#ifndef ENABLE_TRAINING
provider_->SetWeightsMemoryBuffer(mklnode_ptr_->weight_name, filter_dst_mem);
#else
filter_dst_mem_ = filter_dst_mem;
#endif // !ENABLE_TRAINING
}
}
}
Expand All @@ -518,6 +523,8 @@ class DnnlConv : public DnnlKernel {
const OrtValue* binput_tensor = ort.KernelContext_GetInput(context, input_index + 2);
bias_data = const_cast<T*>(ort.GetTensorData<T>(binput_tensor));
}
// Do not use cached weights if running training
#ifndef ENABLE_TRAINING
std::shared_ptr<dnnl::memory> filter_dst_mem = provider_->GetWeightsMemoryBuffer(mklnode_ptr_->weight_name);
if (filter_dst_mem == nullptr) {
ReorderWeights(api, context, dnnl_engine_cpu_);
Expand All @@ -530,8 +537,19 @@ class DnnlConv : public DnnlKernel {
#ifdef USE_DNNL_GPU_OCL
std::lock_guard<OrtMutex> lock(provider_->GetMutex());
filter_mem_gpu_->set_ocl_mem_object(filter_dst_mem->get_ocl_mem_object());
#endif
#endif // USE_DNNL_GPU_OCL
}
#else // ENABLE_TRAINING
if (!gpu_available_) {
filter_data = static_cast<T*>(filter_dst_mem_->get_data_handle());
filter_mem_->set_data_handle(static_cast<void*>(const_cast<T*>(filter_data)));
} else if (gpu_available_) {
#ifdef USE_DNNL_GPU_OCL
std::lock_guard<OrtMutex> lock(provider_->GetMutex());
filter_mem_gpu_->set_ocl_mem_object(filter_dst_mem_->get_ocl_mem_object());
#endif // USE_DNNL_GPU_OCL
}
#endif // ENABLE_TRAINING

if (bias_data != nullptr) {
bias_mem_->set_data_handle(static_cast<void*>(const_cast<T*>(bias_data)));
Expand Down Expand Up @@ -645,7 +663,9 @@ class DnnlConv : public DnnlKernel {
private:
dnnl::memory::desc filter_desc_;
dnnl::memory::format_tag filter_format_;

#ifdef ENABLE_TRAINING
std::shared_ptr<dnnl::memory> filter_dst_mem_;
#endif
std::shared_ptr<dnnl::memory> src_mem_from_;
std::unique_ptr<dnnl::memory> src_mem_to_;

Expand Down
4 changes: 2 additions & 2 deletions onnxruntime/core/providers/dnnl/subgraph/dnnl_convgrad.h
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ class DnnlConvGrad : public DnnlKernel {
}

conv_bwd_data_desc_ = onnxruntime::make_unique<dnnl::convolution_backward_data::desc>(
dnnl::convolution_backward_data::desc::desc(
dnnl::convolution_backward_data::desc(
dnnl::algorithm::convolution_direct,
*primitive_dst_md_,
*weights_md_,
Expand All @@ -277,7 +277,7 @@ class DnnlConvGrad : public DnnlKernel {
*conv_bwd_data_desc_, engine_to_use, *(conv_fwd_->GetPrimitiveDesc())));

conv_bwd_weights_desc_ = onnxruntime::make_unique<dnnl::convolution_backward_weights::desc>(
dnnl::convolution_backward_weights::desc::desc(
dnnl::convolution_backward_weights::desc(
dnnl::algorithm::convolution_direct,
*src_md_,
*diff_weights_md_,
Expand Down
118 changes: 60 additions & 58 deletions onnxruntime/core/providers/dnnl/subgraph/dnnl_func_kernel.cc
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,17 @@
#include "core/session/onnxruntime_cxx_api.h"
#include "core/providers/dnnl/dnnl_common.h"
#include "core/providers/dnnl/subgraph/dnnl_conv.h"
#include "core/providers/dnnl/subgraph/dnnl_convgrad.h"
#include "core/providers/dnnl/subgraph/dnnl_batchnorm.h"
#include "core/providers/dnnl/subgraph/dnnl_conv_batchnorm.h"
#include "core/providers/dnnl/subgraph/dnnl_activations.h"
#include "core/providers/dnnl/subgraph/dnnl_relugrad.h"
#include "core/providers/dnnl/subgraph/dnnl_pool.h"
#include "core/providers/dnnl/subgraph/dnnl_sum.h"
#include "core/providers/dnnl/subgraph/dnnl_lrn.h"
#ifdef ENABLE_TRAINING
#include "core/providers/dnnl/subgraph/dnnl_convgrad.h"
#include "core/providers/dnnl/subgraph/dnnl_relugrad.h"
#include "core/providers/dnnl/subgraph/dnnl_maxpoolgrad.h"

#endif // ENABLE_TRAINING

namespace onnxruntime {
namespace ort_dnnl {
Expand Down Expand Up @@ -78,20 +79,6 @@ class SubgraphPrimitive : public PrimitiveBase {
#ifdef ENABLE_TRAINING
params.provider->fwd_conv_stack.emplace(kernel);
#endif
for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
context_.kernels.push_back(kernel);
} else if (dnnl_node.name == "ConvGrad") {
std::ostringstream os;
os << "ConvGrad-" << dnnl_node.node_index << "-";
std::shared_ptr<DnnlConvGrad<T>> kernel;
kernel = std::make_shared<DnnlConvGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());

auto fwd_kernel = params.provider->fwd_conv_stack.top();
kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlConv<T>>(fwd_kernel));
params.provider->fwd_conv_stack.pop();

for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
Expand All @@ -117,25 +104,6 @@ class SubgraphPrimitive : public PrimitiveBase {
// onnxruntime\core\framwork\run_options.h
params.provider->SetForwardKernel(dnnl_node.onnx_index, kernel);
#endif
for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
context_.kernels.push_back(kernel);
} else if (dnnl_node.name == "ReluGrad") {
std::ostringstream os;
os << "ReluGrad-" << dnnl_node.node_index << "-";
std::shared_ptr<DnnlReluGrad<T>> kernel;
kernel = std::make_shared<DnnlReluGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());

// walk the input_nodes for this ReluGrad dnnl_node find the node index of the Relu input_node
// use that index to obtain the Relu kernel pointer from the fwd_kernal_map.
for (auto iter = dnnl_node.input_nodes.begin(); iter != dnnl_node.input_nodes.end(); ++iter) {
if (iter->op_type == "Relu") {
auto fwd_kernel = params.provider->GetForwardKernal(iter->index);
kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlRelu<T>>(fwd_kernel));
}
}

for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
Expand Down Expand Up @@ -183,25 +151,9 @@ class SubgraphPrimitive : public PrimitiveBase {
os << "MaxPool-" << dnnl_node.node_index << "-";
std::shared_ptr<DnnlPool<T>> kernel;
kernel = std::make_shared<DnnlPool<T>>(dnnl_node, params.provider, *params.attributes, os.str());
#ifdef ENABLE_TRAINING
params.provider->SetForwardKernel(dnnl_node.onnx_index, kernel);
for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
params.provider->SetForwardKernel(dnnl_node.onnx_index, kernel);
context_.kernels.push_back(kernel);
} else if (dnnl_node.name == "MaxPoolGrad") {
std::ostringstream os;
os << "MaxPoolGrad-" << dnnl_node.node_index << "-";
std::shared_ptr<DnnlMaxPoolGrad<T>> kernel;
kernel = std::make_shared<DnnlMaxPoolGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());

for (auto iter = dnnl_node.input_nodes.begin(); iter != dnnl_node.input_nodes.end(); ++iter) {
if (iter->op_type == "MaxPool") {
auto fwd_kernel = params.provider->GetForwardKernal(iter->index);
kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlPool<T>>(fwd_kernel));
}
}

#endif
for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
Expand Down Expand Up @@ -252,6 +204,59 @@ class SubgraphPrimitive : public PrimitiveBase {
}
context_.kernels.push_back(kernel);
}
#ifdef ENABLE_TRAINING
else if (dnnl_node.name == "ConvGrad") {
std::ostringstream os;
os << "ConvGrad-" << dnnl_node.node_index << "-";
std::shared_ptr<DnnlConvGrad<T>> kernel;
kernel = std::make_shared<DnnlConvGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());

auto fwd_kernel = params.provider->fwd_conv_stack.top();
kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlConv<T>>(fwd_kernel));
params.provider->fwd_conv_stack.pop();

for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
context_.kernels.push_back(kernel);
} else if (dnnl_node.name == "ReluGrad") {
std::ostringstream os;
os << "ReluGrad-" << dnnl_node.node_index << "-";
std::shared_ptr<DnnlReluGrad<T>> kernel;
kernel = std::make_shared<DnnlReluGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());

// walk the input_nodes for this ReluGrad dnnl_node find the node index of the Relu input_node
// use that index to obtain the Relu kernel pointer from the fwd_kernal_map.
for (auto iter = dnnl_node.input_nodes.begin(); iter != dnnl_node.input_nodes.end(); ++iter) {
if (iter->op_type == "Relu") {
auto fwd_kernel = params.provider->GetForwardKernal(iter->index);
kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlRelu<T>>(fwd_kernel));
mrry marked this conversation as resolved.
Show resolved Hide resolved
}
}

for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
context_.kernels.push_back(kernel);
} else if (dnnl_node.name == "MaxPoolGrad") {
std::ostringstream os;
os << "MaxPoolGrad-" << dnnl_node.node_index << "-";
std::shared_ptr<DnnlMaxPoolGrad<T>> kernel;
kernel = std::make_shared<DnnlMaxPoolGrad<T>>(dnnl_node, params.provider, *params.attributes, os.str());

for (auto iter = dnnl_node.input_nodes.begin(); iter != dnnl_node.input_nodes.end(); ++iter) {
if (iter->op_type == "MaxPool") {
auto fwd_kernel = params.provider->GetForwardKernal(iter->index);
kernel->AddForwardDnnlKernel(std::dynamic_pointer_cast<DnnlPool<T>>(fwd_kernel));
}
}

for (auto index : dnnl_node.parent_nodes) {
kernel->parents_.push_back(context_.kernels[index]);
}
context_.kernels.push_back(kernel);
}
#endif //ENABLE_TRAINING
}
}

Expand Down Expand Up @@ -292,10 +297,7 @@ class SubgraphPrimitivePool : public PrimitivePool<T> {
for (auto i = 0; i < params.subgraph->dnnl_nodes[0].num_inputs; i++) {
const OrtValue* input_tensor = ort.KernelContext_GetInput(context, i);

if (i>0)
continue;

auto tensor_info = ort.GetTensorTypeAndShape(input_tensor);
auto tensor_info = ort.GetTensorTypeAndShape(input_tensor);
auto tensor_shape = ort.GetTensorShape(tensor_info);
ort.ReleaseTensorTypeAndShapeInfo(tensor_info);

Expand Down
Loading