v3.5
Performance Optimizations
Intel Architecture Processors
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
- Improved performance of group normalization primitive.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Improved performance of the following subgraphs with Graph API:
- Multi-Query Attention (MQA).
- Scaled Dot Product Attention (SDPA), including the variant with
select
operation. LayerNorm
+Multiply
+Quantize
produced by SmoothQuant algorithm.Convolution
+Sigmoid
+Multiply
with mixed precisions.
Intel Graphics Products
- Improved performance for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved RNN primitive performance for LSTM cell case.
- Improved performance of
f8_e4m3
data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
AArch64-based Processors
- Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
- Improved
bf16
matmul, convolution, and reorder primitives performance with Arm Compute Library (ACL). - Improved eltwise primitive performance with
gelu_erf
algorithm with ACL.
Functionality
- Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
- Introduced support for
int4
data type and extended quantization model with support for grouped scales and zero points. - Introduced
fp64
matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only. - Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
bfloat16
matmul withint8
weights on Intel CPUs.float16
andbfloat16
matmul withint8
orint4
weights on Intel GPUs.
- [experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.
Usability
- Extended error messages for engine and memory objects creation errors.
- Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
- Introduced support for
clang++
host compiler in SYCL builds. - Introduced API for tensor serialization and deserialization.
- Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
- Introduced OpenCL runtime support for Graph API.
- Added support for building oneDNN with installed Arm Compute Library (ACL).
Validation
- Extended benchdnn with support for tensor tags in RNN primitive validation.
Breaking Changes
- Updated minimal supported ACL version to 24.04 (was 23.11).
Thanks to these Contributors
This release contains contributions from the project core team as well as Abdel @quickwritereader, @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, David Svantesson @davsva01, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Fadi Arafeh @fadara01, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej Ciglarič @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.