doc: added release notes for oneDNN v3.6

oneapi-src · Sep 23, 2024 · 599c1a4 · 599c1a4
1 parent 146b7a6
commit 599c1a4
Showing 1 changed file with 153 additions and 0 deletions.
diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -0,0 +1,153 @@
+oneDNN v3.6 Release Notes
+=========================
+
+# Performance Optimizations
+
+## Intel Architecture Processors
+
+  * Improved performance for 4th generation Intel Xeon Scalable processors
+  (formerly Sapphire Rapids).
+  * Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
+  * Improved performance of group normalization primitive.
+  * Improved bf16 matmul performance with int4 compressed weights on processors
+  with Intel AMX instruction set support.
+  * Improved performance of `fp8` matmul, pooling, and eltwise primitives on
+  processors with Intel AMX instruction set support.
+  * Improved `fp32` RNN primitive performance on processors with Intel AVX2
+  instruction set support.
+  * Improved performance of the following subgraphs with Graph API:
+    - `convolution` and `binary` operation fusions with better layout selection
+    in Graph API.
+    - `fp8` `convolution` and `unary` or `binary` on processors with Intel AMX
+    instruction set support.
+    - Scaled Dot Product Attention (SDPA) without scale,
+    Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
+    - `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output
+    and zero-points.
+
+## Intel Graphics Products
+
+  * Improved performance for the Intel Data Center GPU Max Series (formerly
+  Ponte Vecchio).
+  * Introduced broad production quality optimizations for Intel Arc Graphics for
+  Intel Core Ultra Processors (Series 2) (formerly Lunar Lake).
+  * Introduced broad production quality optimizations for future discrete GPU
+  based on Xe2 architecture (code name Battlemage).
+  * Introduced support for Intel Arc Graphics for future Intel Core Ultra
+  Processor (code name Arrow Lake-H).
+  * Improved performance of `fp8_e5m2` primitives on Intel Data Center GPU Max
+  Series (formerly Ponte Vecchio).
+  * Improved matmul and inner product primitives performance for shapes relevant
+  to large language models (LLMs) on GPUs with Intel XMX support.
+  * Improved `int8` convolution performance with weight zero points.
+  * Reduced primitive creation time for softmax, layer normalization, and concat
+  primitives via kernel reuse.
+  * Improved performance of the following subgraphs with Graph API:
+    - SDPA without scale, MQA, and GQA patterns. `f16` variants of these
+    patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R)
+    XMX) support.
+    - `fp8` `convolution` and `unary` or `binary` on Intel Data Center GPU Max
+    Series.
+    - `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and
+    zero-points.
+
+## AArch64-based Processors
+
+  * Improved `fp32` convolution backpropagation performance on processors with
+  SVE support.
+  * Improved reorder performance for blocked format on processors with
+  SVE support.
+  * Improved `bf16` softmax performance on processors with SVE support.
+  * Improved batch normalization performance on processors with SVE support.
+  * Improved matmul performance on processors with SVE support.
+  * Improved `fp16` convolution with Arm Compute Library (ACL).
+  * Improved matmul performance with ACL.
+  * Switched matmul and convolution implementation with ACL to stateless API
+  significantly improving primitive creation time and increasing caching
+  efficiency and performance for these operators.
+
+# Functionality
+
+  * Introduced [generic GPU] support. This implementation relies on portable
+  SYCL kernels and can be used as a starting point to enable new devices in
+  oneDNN.
+  * Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL based
+  implementations.
+  * Enabled support for `int8` activations with grouped scales and `int8`
+  or `int4` compressed weights in matmul primitive. This functionality
+  is implemented on Intel GPUs.
+  * Introduces support for stochastic rounding for `fp8` data type
+  functionality.
+  * **[experimental]** Extended [microkernel API]:
+    - Introduced `int8` quantization support.
+    - Extended transform microkernel with transposition support and support for
+    arbitrary strides.
+    - Introduced verbose diagnostics support.
+  * **[experimental]** Extended [sparse API]:
+    - Introduced support for sparse memory with coordinate (COO) storage format.
+    - Extended matmul primitive to work with sparse memory in COO format. This
+    functionality is implemented on CPUs and Intel GPUs.
+  * Introduced `int8` support in eltwise primitive with 'clip' algorithm. This
+  functionality is implemented on CPUs.
+  * Graph API:
+    - Introduced `GroupNorm` operation and fusions in Graph API.
+    - Introduced support for standalone `StaticReshape` and `StaticTranspose`
+    operations.
+
+[generic GPU]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/src/gpu/generic/sycl/README.md
+[microkernel API]: https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html
+[sparse API]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-sparse
+
+# Usability
+
+  * Added [examples][Graph API examples] for SDPA, MQA, and GQA patterns
+  implementation with Graph API.
+  * Added [an example][deconvolution example] for deconvolution primitive.
+  * Added examples for [Vanilla RNN][Vanilla RNN example] and
+  [LBR GRU][LBR GRU example] RNN cells.
+  * Introduced support for Intel DPC++/C++ Compiler 2025.0.
+  * Introduced interoperability with [SYCL Graph] record/replay mode.
+  * Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
+  * **[experimental]** Introduced [logging mechanism][spdlog] based on spdlog
+  library.
+  * Introduced support for `ONEDNN_ENABLE_WORKLOAD` build knob for Graph API.
+  * Improved performance of `get_partitions()` function in Graph API.
+
+[Graph API examples]: https://github.com/oneapi-src/oneDNN/tree/rls-v3.6/examples/graph
+[deconvolution example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/deconvolution.cpp
+[Vanilla RNN example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/vanilla_rnn.cpp
+[LBR GRU example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/lbr_gru.cpp
+[SYCL Graph]: https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs
+[spdlog]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-logging
+
+# Validation
+
+  * Introduced protection from out of memory scenarios in benchdnn Graph API
+  driver.
+
+# Breaking Changes
+
+  * Experimental [microkernel API] in this release is not compatible with
+  [the version available][microkernel API v3.5] in oneDNN v3.5.
+  * Updated minimal supported ACL version to 24.08.1 (was 24.04).
+
+[microkernel API v3.5]: https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html
+
+# Thanks to these Contributors
+
+This release contains contributions from the [project core team] as well as
+Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron,
+Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts
+@apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph,
+Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha,
+Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm,
+@matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich,
+Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu,
+Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros
+Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick,
+Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen,
+Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov
+@vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone
+who asked questions and reported issues.
+
+[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/MAINTAINERS.md