Skip to content

Commit 7aed0ca

Browse files
author
Ivy Zhang
authored
RFC-BYOC-DNNL-Integration (#73)
* RFC-BYOC-DNNL-Integration * update BYOC-DNNL RFC
1 parent ac15f2a commit 7aed0ca

File tree

4 files changed

+115
-0
lines changed

4 files changed

+115
-0
lines changed

rfcs/0069-byoc-onednn-integration.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
- Feature Name: oneDNN Integration via BYOC
2+
- Start Date: 2021-11-29
3+
- RFC PR: [apache/tvm-rfcs#0069](https://github.com/apache/tvm-rfcs/pull/0069)
4+
- GitHub PR: [PR#9671](https://github.com/apache/tvm/pull/9671/commits), [PR#9797](https://github.com/apache/tvm/pull/9797/commits), [PR#9995](https://github.com/apache/tvm/pull/9995/commits), [PR#9996](https://github.com/apache/tvm/pull/9996/commits), [PR#10112](https://github.com/apache/tvm/pull/10112/commits), [PR#10266](https://github.com/apache/tvm/pull/10266/commits), [PR#10421](https://github.com/apache/tvm/pull/10421/commits), [PR#10835](https://github.com/apache/tvm/pull/10835/commits), [PR#10836](https://github.com/apache/tvm/pull/10837/commits)
5+
6+
# Summary
7+
[summary]: #summary
8+
9+
This RFC proposes to integrate DNNL into TVM via BYOC framework. The drawback of the current "Bring DNNL to TVM via DNNL JSON codegen/runtime" is analysed and has been enhanced. Performance benefits are observed by comparing with either MXNet-oneDNN or TVM-autoscheduler on several popular workloads.
10+
11+
# Motivation
12+
[motivation]: #motivation
13+
14+
TVM has shown its good performance on many CV models. One of the major advantages is the maximizing throughput which benefits from the small overhead. However, tuning is needed for each new shape, and it usually takes long time.
15+
16+
oneDNN is an open-source cross-platform performance library of basic building blocks for deep learning applications. The library is optimized for Intel(R) Architecture Processors, Intel(R) Processor Graphics and Xe Architecture graphics. Given a new shape and the env config, oneDNN is able to infer the optimal data format immediately. In order to take the advantage of small overhead of TVM, and achieve the best performance on CPU in a short time, we propose to integrate oneDNN into TVM via BYOC framework.
17+
18+
Currently, the BYOC homepage provides a simple example of integrating DNNL(naming to oneDNN nowadays) into TVM, but the performance is far away from both TVM autoscheduler and MXNet-oneDNN due to the following main reasons:
19+
- Non-optimal layout was used in dnnl ops.
20+
- Insufficient subgraph partitioning.
21+
- Unnecessary overhead due to memory copy from tensor to dnnl memory buffer or vice versa.
22+
23+
# Guide-level explanation
24+
25+
We have already solved the above issues and observed the performance benefits by comparing with either MXNet-oneDNN or TVM-autoscheduler on several popular workloads like ResNet50_v1b, InceptionV3, VGG11_bn in several scenarios including latency (Figure 1, single instance with 28 cores and bs=1), throughput (Figure 2, single instance with 28 core and bs=32) and real-time (Figure 3, 7 instances with 4core per each and bs=1) mode.
26+
27+
## *Note
28+
[Note]: ##Note
29+
30+
Hardware config
31+
- Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
32+
33+
Compilation config
34+
- g++ 7
35+
- 'llvm -mcpu=cascadelake -model=platinum-8280'
36+
- TVM commitID: 19b23b9
37+
- MXNet version: V1.8.0
38+
- OneDNN version: V1.7 / V2.4
39+
40+
Runtime config
41+
- 20 warm-up and 100 batches
42+
43+
![Figure 1 latency scenario](assets/latest/latency.png)
44+
45+
![Figure 2 Throughput scenario](assets/latest/throughput.png)
46+
47+
![Figure 3 Real-time scenario](assets/latest/real-time.png)
48+
49+
# Reference-level explanation
50+
This proposal aims to provide a new approach to integrate oneDNN into TVM via DNNL JSON codegen/runtime by applying the following adjustments to tackle the aforementioned issues:
51+
- Register a new “alter_op_layout” function for dnnl to get the optimal layouts for dnnl ops with a new layout auto-query function in Relay.
52+
- Add a `simplifyConsecuitiveAdd` pattern in `simplify_expr` pass. So that, `FoldConstant` pass able to fuse pattern `conv-add-add-relu` (comes from `conv-bias_add-bn-relu`) into `conv-add-relu`.
53+
- Add a new pattern “Conv-Add-Sum-ReLu” for the fusion.
54+
- Remove the unnecessary memory copy in “dnnl_json_runtime.cc” with pointer assignment only.
55+
56+
We have enhanced and updated the support. Currently, the following ops/post-op fusion/datatype are enhanced/added, as well as some CV models are verified with the new oneDNN backend, we’re going to cover more ops/datatypes and models (denoted with *) in the next step.
57+
58+
## Ops
59+
- nn.conv2d
60+
- depthwise conv
61+
- nn.conv3d
62+
- nn.dense
63+
- nn.relu
64+
- nn.max_pool2d
65+
- nn.avg_pool2d
66+
- matrix multiplication *
67+
- nn.conv1d *
68+
69+
## Post-Op Fusions
70+
- conv2d_bias_sum_relu
71+
- conv2d_bias_relu
72+
- conv2d_bias
73+
- dense_bias_relu
74+
- dense_bias
75+
- Eltwise Post-op
76+
- Depthwise *
77+
- Binary *
78+
- PReLu *
79+
80+
## Datatype
81+
- Float32
82+
- BF16
83+
- INT8 *
84+
85+
## Verified Models (from gluoncv)
86+
- ResNet 18, 32, 50, 101, 152
87+
- VGG 11, 13, 16, 19; VGG_BN 11, 13, 16, 19
88+
- InceptionV3
89+
- Mobilenet *
90+
- Bert *
91+
92+
# Drawbacks
93+
[drawbacks]: #drawbacks
94+
95+
Currently, only test on `Intel` CPU.
96+
97+
# Rationale and alternatives
98+
[rationale-and-alternatives]: #rationale-and-alternatives
99+
100+
There are two ways to integrate oneDNN into TVM, "JSON Codegen" and "C Source Codegen". This RFC is developped with "JSON Codegen".
101+
102+
# Prior art
103+
[prior-art]: #prior-art
104+
105+
This RFC aims to enhance the existing ["Bring DNNL to TVM via DNNL JSON codegen/runtime"](https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm) to take advantages of both TVM and oneDNN.
106+
107+
The issues related to poor performance of the existing BYOC with DNNL have been listed in [Motivation]. They have been solved in this RFC.
108+
109+
# Unresolved questions
110+
111+
112+
# Future possibilities
113+
[future-possibilities]: #future-possibilities
114+
115+
More ops, post-op fusions, datatypes and more workloads are to be supported in the next step.

rfcs/assets/latest/latency.png

158 KB
Loading

rfcs/assets/latest/real-time.png

162 KB
Loading

rfcs/assets/latest/throughput.png

165 KB
Loading

0 commit comments

Comments
 (0)