ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel

Now that I have proper build of OpenBLAS 0.3.26 with GCC 11 for julia (see #4431 for the record), I'm testing it on Nvidia Grace, with surprising results:
```
$ OPENBLAS_NUM_THREADS=72 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8sve
julia> peakflops(20_000)
1.975233629673878e12

julia> 
$ OPENBLAS_NUM_THREADS=72 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
2.2233581470163237e12
```
Source code of the `peakflops` function is at https://github.com/JuliaLang/julia/blob/b058146cafec8405aa07f9647642fd485ea3b5d7/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L636-L654, it basically runs a double precision matrix-matrix multiplication for a few times (3 by default) using the underlying BLAS library (OpenBLAS by default), measures the time and returns the flops.  Numbers reported above are reproducible, I always get peakflops of the order of 2.0e12 for armv8sve, and 2.2e12 for armv8, it isn't a one-off result.

It appears that the armv8sve dgemm kernel is slower (i.e. lower peakflops) than the generic armv8 one.

For the record, I get the same results as julia's dynamic arch build also with a native build of OpenBLAS 0.3.26 on the system driven by spack, configured with
```
 'make' '-j24' '-s' 'CC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gcc' 'FC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=ARMV8SVE' 'USE_LOCKING=1' 'USE_OPENMP=1' 'USE_THREAD=1' 'INTERFACE64=1' 'SYMBOLSUFFIX=64_' 'RANLIB=ranlib' 'NUM_THREADS=512' 'all'
```
Compiler used on the system is
```console
$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```
So the problem isn't specific to julia's build of OpenBLAS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions