-
Notifications
You must be signed in to change notification settings - Fork 1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doc: added release notes for oneDNN v3.6
- Loading branch information
Showing
1 changed file
with
153 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
oneDNN v3.6 Release Notes | ||
========================= | ||
|
||
# Performance Optimizations | ||
|
||
## Intel Architecture Processors | ||
|
||
* Improved performance for 4th generation Intel Xeon Scalable processors | ||
(formerly Sapphire Rapids). | ||
* Improved performance for Intel Xeon 6 processors (formerly Granite Rapids). | ||
* Improved performance of group normalization primitive. | ||
* Improved bf16 matmul performance with int4 compressed weights on processors | ||
with Intel AMX instruction set support. | ||
* Improved performance of `fp8` matmul, pooling, and eltwise primitives on | ||
processors with Intel AMX instruction set support. | ||
* Improved `fp32` RNN primitive performance on processors with Intel AVX2 | ||
instruction set support. | ||
* Improved performance of the following subgraphs with Graph API: | ||
- `convolution` and `binary` operation fusions with better layout selection | ||
in Graph API. | ||
- `fp8` `convolution` and `unary` or `binary` on processors with Intel AMX | ||
instruction set support. | ||
- Scaled Dot Product Attention (SDPA) without scale, | ||
Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns. | ||
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output | ||
and zero-points. | ||
|
||
## Intel Graphics Products | ||
|
||
* Improved performance for the Intel Data Center GPU Max Series (formerly | ||
Ponte Vecchio). | ||
* Introduced broad production quality optimizations for Intel Arc Graphics for | ||
Intel Core Ultra Processors (Series 2) (formerly Lunar Lake). | ||
* Introduced broad production quality optimizations for future discrete GPU | ||
based on Xe2 architecture (code name Battlemage). | ||
* Introduced support for Intel Arc Graphics for future Intel Core Ultra | ||
Processor (code name Arrow Lake-H). | ||
* Improved performance of `fp8_e5m2` primitives on Intel Data Center GPU Max | ||
Series (formerly Ponte Vecchio). | ||
* Improved matmul and inner product primitives performance for shapes relevant | ||
to large language models (LLMs) on GPUs with Intel XMX support. | ||
* Improved `int8` convolution performance with weight zero points. | ||
* Reduced primitive creation time for softmax, layer normalization, and concat | ||
primitives via kernel reuse. | ||
* Improved performance of the following subgraphs with Graph API: | ||
- SDPA without scale, MQA, and GQA patterns. `f16` variants of these | ||
patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) | ||
XMX) support. | ||
- `fp8` `convolution` and `unary` or `binary` on Intel Data Center GPU Max | ||
Series. | ||
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and | ||
zero-points. | ||
|
||
## AArch64-based Processors | ||
|
||
* Improved `fp32` convolution backpropagation performance on processors with | ||
SVE support. | ||
* Improved reorder performance for blocked format on processors with | ||
SVE support. | ||
* Improved `bf16` softmax performance on processors with SVE support. | ||
* Improved batch normalization performance on processors with SVE support. | ||
* Improved matmul performance on processors with SVE support. | ||
* Improved `fp16` convolution with Arm Compute Library (ACL). | ||
* Improved matmul performance with ACL. | ||
* Switched matmul and convolution implementation with ACL to stateless API | ||
significantly improving primitive creation time and increasing caching | ||
efficiency and performance for these operators. | ||
|
||
# Functionality | ||
|
||
* Introduced [generic GPU] support. This implementation relies on portable | ||
SYCL kernels and can be used as a starting point to enable new devices in | ||
oneDNN. | ||
* Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL based | ||
implementations. | ||
* Enabled support for `int8` activations with grouped scales and `int8` | ||
or `int4` compressed weights in matmul primitive. This functionality | ||
is implemented on Intel GPUs. | ||
* Introduces support for stochastic rounding for `fp8` data type | ||
functionality. | ||
* **[experimental]** Extended [microkernel API]: | ||
- Introduced `int8` quantization support. | ||
- Extended transform microkernel with transposition support and support for | ||
arbitrary strides. | ||
- Introduced verbose diagnostics support. | ||
* **[experimental]** Extended [sparse API]: | ||
- Introduced support for sparse memory with coordinate (COO) storage format. | ||
- Extended matmul primitive to work with sparse memory in COO format. This | ||
functionality is implemented on CPUs and Intel GPUs. | ||
* Introduced `int8` support in eltwise primitive with 'clip' algorithm. This | ||
functionality is implemented on CPUs. | ||
* Graph API: | ||
- Introduced `GroupNorm` operation and fusions in Graph API. | ||
- Introduced support for standalone `StaticReshape` and `StaticTranspose` | ||
operations. | ||
|
||
[generic GPU]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/src/gpu/generic/sycl/README.md | ||
[microkernel API]: https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html | ||
[sparse API]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-sparse | ||
|
||
# Usability | ||
|
||
* Added [examples][Graph API examples] for SDPA, MQA, and GQA patterns | ||
implementation with Graph API. | ||
* Added [an example][deconvolution example] for deconvolution primitive. | ||
* Added examples for [Vanilla RNN][Vanilla RNN example] and | ||
[LBR GRU][LBR GRU example] RNN cells. | ||
* Introduced support for Intel DPC++/C++ Compiler 2025.0. | ||
* Introduced interoperability with [SYCL Graph] record/replay mode. | ||
* Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs. | ||
* **[experimental]** Introduced [logging mechanism][spdlog] based on spdlog | ||
library. | ||
* Introduced support for `ONEDNN_ENABLE_WORKLOAD` build knob for Graph API. | ||
* Improved performance of `get_partitions()` function in Graph API. | ||
|
||
[Graph API examples]: https://github.com/oneapi-src/oneDNN/tree/rls-v3.6/examples/graph | ||
[deconvolution example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/deconvolution.cpp | ||
[Vanilla RNN example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/vanilla_rnn.cpp | ||
[LBR GRU example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/lbr_gru.cpp | ||
[SYCL Graph]: https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs | ||
[spdlog]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-logging | ||
|
||
# Validation | ||
|
||
* Introduced protection from out of memory scenarios in benchdnn Graph API | ||
driver. | ||
|
||
# Breaking Changes | ||
|
||
* Experimental [microkernel API] in this release is not compatible with | ||
[the version available][microkernel API v3.5] in oneDNN v3.5. | ||
* Updated minimal supported ACL version to 24.08.1 (was 24.04). | ||
|
||
[microkernel API v3.5]: https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html | ||
|
||
# Thanks to these Contributors | ||
|
||
This release contains contributions from the [project core team] as well as | ||
Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, | ||
Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts | ||
@apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, | ||
Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, | ||
Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, | ||
@matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich, | ||
Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, | ||
Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros | ||
Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, | ||
Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, | ||
Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov | ||
@vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone | ||
who asked questions and reported issues. | ||
|
||
[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/MAINTAINERS.md |