Skip to content

Commit

Permalink
doc: added release notes for oneDNN v3.6
Browse files Browse the repository at this point in the history
  • Loading branch information
vpirogov committed Sep 23, 2024
1 parent 146b7a6 commit 599c1a4
Showing 1 changed file with 153 additions and 0 deletions.
153 changes: 153 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
oneDNN v3.6 Release Notes
=========================

# Performance Optimizations

## Intel Architecture Processors

* Improved performance for 4th generation Intel Xeon Scalable processors
(formerly Sapphire Rapids).
* Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
* Improved performance of group normalization primitive.
* Improved bf16 matmul performance with int4 compressed weights on processors
with Intel AMX instruction set support.
* Improved performance of `fp8` matmul, pooling, and eltwise primitives on
processors with Intel AMX instruction set support.
* Improved `fp32` RNN primitive performance on processors with Intel AVX2
instruction set support.
* Improved performance of the following subgraphs with Graph API:
- `convolution` and `binary` operation fusions with better layout selection
in Graph API.
- `fp8` `convolution` and `unary` or `binary` on processors with Intel AMX
instruction set support.
- Scaled Dot Product Attention (SDPA) without scale,
Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output
and zero-points.

## Intel Graphics Products

* Improved performance for the Intel Data Center GPU Max Series (formerly
Ponte Vecchio).
* Introduced broad production quality optimizations for Intel Arc Graphics for
Intel Core Ultra Processors (Series 2) (formerly Lunar Lake).
* Introduced broad production quality optimizations for future discrete GPU
based on Xe2 architecture (code name Battlemage).
* Introduced support for Intel Arc Graphics for future Intel Core Ultra
Processor (code name Arrow Lake-H).
* Improved performance of `fp8_e5m2` primitives on Intel Data Center GPU Max
Series (formerly Ponte Vecchio).
* Improved matmul and inner product primitives performance for shapes relevant
to large language models (LLMs) on GPUs with Intel XMX support.
* Improved `int8` convolution performance with weight zero points.
* Reduced primitive creation time for softmax, layer normalization, and concat
primitives via kernel reuse.
* Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns. `f16` variants of these
patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R)
XMX) support.
- `fp8` `convolution` and `unary` or `binary` on Intel Data Center GPU Max
Series.
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and
zero-points.

## AArch64-based Processors

* Improved `fp32` convolution backpropagation performance on processors with
SVE support.
* Improved reorder performance for blocked format on processors with
SVE support.
* Improved `bf16` softmax performance on processors with SVE support.
* Improved batch normalization performance on processors with SVE support.
* Improved matmul performance on processors with SVE support.
* Improved `fp16` convolution with Arm Compute Library (ACL).
* Improved matmul performance with ACL.
* Switched matmul and convolution implementation with ACL to stateless API
significantly improving primitive creation time and increasing caching
efficiency and performance for these operators.

# Functionality

* Introduced [generic GPU] support. This implementation relies on portable
SYCL kernels and can be used as a starting point to enable new devices in
oneDNN.
* Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL based
implementations.
* Enabled support for `int8` activations with grouped scales and `int8`
or `int4` compressed weights in matmul primitive. This functionality
is implemented on Intel GPUs.
* Introduces support for stochastic rounding for `fp8` data type
functionality.
* **[experimental]** Extended [microkernel API]:
- Introduced `int8` quantization support.
- Extended transform microkernel with transposition support and support for
arbitrary strides.
- Introduced verbose diagnostics support.
* **[experimental]** Extended [sparse API]:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This
functionality is implemented on CPUs and Intel GPUs.
* Introduced `int8` support in eltwise primitive with 'clip' algorithm. This
functionality is implemented on CPUs.
* Graph API:
- Introduced `GroupNorm` operation and fusions in Graph API.
- Introduced support for standalone `StaticReshape` and `StaticTranspose`
operations.

[generic GPU]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/src/gpu/generic/sycl/README.md
[microkernel API]: https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html
[sparse API]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-sparse

# Usability

* Added [examples][Graph API examples] for SDPA, MQA, and GQA patterns
implementation with Graph API.
* Added [an example][deconvolution example] for deconvolution primitive.
* Added examples for [Vanilla RNN][Vanilla RNN example] and
[LBR GRU][LBR GRU example] RNN cells.
* Introduced support for Intel DPC++/C++ Compiler 2025.0.
* Introduced interoperability with [SYCL Graph] record/replay mode.
* Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
* **[experimental]** Introduced [logging mechanism][spdlog] based on spdlog
library.
* Introduced support for `ONEDNN_ENABLE_WORKLOAD` build knob for Graph API.
* Improved performance of `get_partitions()` function in Graph API.

[Graph API examples]: https://github.com/oneapi-src/oneDNN/tree/rls-v3.6/examples/graph
[deconvolution example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/deconvolution.cpp
[Vanilla RNN example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/vanilla_rnn.cpp
[LBR GRU example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/lbr_gru.cpp
[SYCL Graph]: https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs
[spdlog]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-logging

# Validation

* Introduced protection from out of memory scenarios in benchdnn Graph API
driver.

# Breaking Changes

* Experimental [microkernel API] in this release is not compatible with
[the version available][microkernel API v3.5] in oneDNN v3.5.
* Updated minimal supported ACL version to 24.08.1 (was 24.04).

[microkernel API v3.5]: https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html

# Thanks to these Contributors

This release contains contributions from the [project core team] as well as
Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron,
Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts
@apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph,
Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha,
Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm,
@matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich,
Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu,
Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros
Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick,
Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen,
Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov
@vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone
who asked questions and reported issues.

[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/MAINTAINERS.md

0 comments on commit 599c1a4

Please sign in to comment.