v3.6
Performance Optimizations
Intel Architecture Processors
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
- Improved performance of group normalization primitive.
- Improved
bf16
matmul performance withint4
compressed weights on processors with Intel AMX instruction set support. - Improved performance of
fp8
matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support. - Improved
fp32
RNN primitive performance on processors with Intel AVX2 instruction set support. - Improved performance of the following subgraphs with Graph API:
convolution
andbinary
operation fusions with better layout selection in Graph API.fp8
convolution
andunary
orbinary
on processors with Intel AMX instruction set support.- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
LayerNorm
,GroupNorm
, andSoftMax
withint8
quantized output and zero-points.
Intel Graphics Products
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
- Introduced support for Intel Arc Graphics for future Intel Core Ultra processor (code name Arrow Lake-H).
- Improved performance of
fp8_e5m2
primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio). - Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
- Improved
int8
convolution performance with weight zero-points. - Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
- Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns.
f16
variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support. fp8
,convolution
, andunary
orbinary
on the Intel Data Center GPU Max Series.LayerNorm
,GroupNorm
, andSoftMax
withint8
quantized output and zero-points.
- SDPA without scale, MQA, and GQA patterns.
AArch64-based Processors
- Improved
fp32
convolution backpropagation performance on processors with SVE support. - Improved reorder performance for blocked format on processors with SVE support.
- Improved
bf16
softmax performance on processors with SVE support. - Improved batch normalization performance on processors with SVE support.
- Improved matmul performance on processors with SVE support.
- Improved
fp16
convolution with Arm Compute Library (ACL). - Improved matmul performance with ACL.
- Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.
Functionality
- Introduced generic GPU support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
- Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL-based implementations.
- Enabled support for
int8
activations with grouped scales andint8
orint4
compressed weights in matmul primitive. This functionality is implemented on Intel GPUs. - Introduces support for stochastic rounding for
fp8
data type functionality. - [experimental] Extended microkernel API:
- Introduced
int8
quantization support. - Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
- Introduced
- [experimental] Extended sparse API:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
- Introduced
int8
support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs. - Graph API:
- Introduced
GroupNorm
operation and fusions in Graph API. - Introduced support for standalone
StaticReshape
andStaticTranspose
operations.
- Introduced
Usability
- Added examples for SDPA, MQA, and GQA patterns implementation with Graph API.
- Added an example for deconvolution primitive.
- Added examples for Vanilla RNN and LBR GRU RNN cells.
- Introduced support for Intel oneAPI DPC++/C++ Compiler 2025.0.
- Introduced interoperability with SYCL Graph record/replay mode.
- Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
- [experimental] Introduced logging mechanism based on spdlog library.
- Introduced support for
ONEDNN_ENABLE_WORKLOAD
build knob for Graph API. - Improved performance of
get_partitions()
function in Graph API.
Validation
- Introduced protection from out-of-memory scenarios in benchdnn Graph API driver.
Deprecated Functionality
- Experimental Graph Compiler is deprecated and will be removed in future releases.
Breaking Changes
- Experimental microkernel API in this release is not compatible with the version available in oneDNN v3.5.
- Updated minimal supported ACL version to 24.08.1 (was 24.04).
Thanks to these Contributors
This release contains contributions from the project core team as well as Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts @apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, @matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich,
Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov @vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone who asked questions and reported issues.