Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAPACK test fails with Error code from DDRGES3 = 9 on AMD Genoa #4032

Open
HPC-UniOldenburg opened this issue May 5, 2023 · 12 comments
Open
Labels
LAPACK issue Deficiency in code imported from Reference-LAPACK

Comments

@HPC-UniOldenburg
Copy link

System:

$ lscpu | grep 'Model name:'
Model name:          AMD EPYC 9554 64-Core Processor
$ uname -a
Linux hpcl002 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Fri Oct 7 12:35:10 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/cm/shared/uniol/sw/zen4/12.2/GCCcore/12.2.0/libexec/gcc/x86_64-pc-linux-gnu/12.2.0/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
Target: x86_64-pc-linux-gnu
Configured with: ../configure --enable-languages=c,c++,fortran --without-cuda-driver --enable-offload-targets=nvptx-none --enable-lto --enable-checking=release --disable-multilib --enable-shared=yes --enable-static=yes --enable-threads=posix --enable-plugins --enable-gold --enable-ld=default --prefix=/cm/shared/uniol/sw/zen4/12.2/GCCcore/12.2.0 --with-local-prefix=/cm/shared/uniol/sw/zen4/12.2/GCCcore/12.2.0 --enable-bootstrap --with-isl=/scratch/easybuild/build/GCCcore/12.2.0/system-system/gcc-12.2.0/stage2_stuff --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.2.0 (GCC)

Build and Test Commands:
Building and testing OpenBLAS-0.3.23 (using Easybuild) with the following commands:

$ make -j 256 libs netlib shared  BINARY='64'  CC='gcc'  FC='gfortran'  MAKE_NB_JOBS='-1'  USE_OPENMP='1'  USE_THREAD='1'  CFLAGS='-O2 -ftree-vectorize -march=native -fno-math-errno'
$ make tests  BINARY='64'  CC='gcc'  FC='gfortran'  MAKE_NB_JOBS='-1'  USE_OPENMP='1'  USE_THREAD='1'
$ make lapack-test  BINARY='64'  CC='gcc'  FC='gfortran'  MAKE_NB_JOBS='-1'  USE_OPENMP='1'  USE_THREAD='1'

Test results:
make tests completes without error, LAPACK tests return summary:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    1328283         0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        1327545         1       (0.000%)        1       (0.000%)
COMPLEX                 779587          171     (0.022%)        0       (0.000%)
COMPLEX16               780654          97      (0.012%)        0       (0.000%)

--> ALL PRECISIONS      4216069         269     (0.006%)        1       (0.000%)

I think the other error is coming from

DGS drivers:      1 out of   1555 tests failed to pass the threshold
 *** Error code from DDRGES3 =    9

All details are in testing_results.txt

Questions:
How can I get this other error resolved?
And should I worry about the 269 tests with numerical errors?

@martin-frbg
Copy link
Collaborator

Can you please re-test with the -ftree-vectorize in your CFLAGS replaced with its opposite, -fno-tree-vectorize (tree vectorizer is on by default in 12.2 and is known to cause this kind of problems).
And/or try current develop branch - unfortunately I do not have such a big Ryzen system available to me at the moment, but there were some recent fixes (added #pragma "no-tree-vectorize") for GCC 11&12 over-optimizing some complex BLAS functions
The "other" error is probably a failed iteration in one of the LAPACK routines called, so ultimately a numerical accuracy problem as well. (Also see Reference-LAPACK/lapack#732 and linked issues, sadly parts of the testsuite appear to be too stringent to be useful in the context of optimized implementations or standard optimizations performed by modern compilers)

@martin-frbg
Copy link
Collaborator

Using the newly released GCC 13 could also be an option - at least according to my first tests, it appears to have fixed the tree-optimizer bugs that had me put the pragmas in the known affected source files.

@HPC-UniOldenburg
Copy link
Author

Thanks for the suggestions: tried the -fno-tree-vectorize first, unfortunately same result. Going to GCC 13 might be a good idea as it will also supports Zen4 better. So will try this next.

@HPC-UniOldenburg
Copy link
Author

Some progress: building with GCC 13.1.0 reduces the number of numerical errors to 26, all in the COMPLEX (4) or COMPLEX16 (21) tests. However, the error with code 9 in DDRGES3 remains. Will try development branch on Monday then.

@martin-frbg
Copy link
Collaborator

Also related : Reference-LAPACK/lapack#744 and Reference-LAPACK/lapack#475 (the latter was supposed to be fixed by Reference-LAPACK/lapack#477 but this appears to be rather fragile code with a long history of odd and fleeting convergence problems) Note also how the reported result is always a dramatic 4.5E+15

@HPC-UniOldenburg
Copy link
Author

HPC-UniOldenburg commented May 8, 2023

Test summary from building with GCC 13.1 and -ftree-vectorize:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    1328283         0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        1327545         1       (0.000%)        1       (0.000%)
COMPLEX                 786943          4       (0.001%)        0       (0.000%)
COMPLEX16               786918          21      (0.003%)        0       (0.000%)

--> ALL PRECISIONS      4229689         26      (0.001%)        1       (0.000%)

There is no difference between -ftree-vectorize and -fno-tree-vectorize and also no difference between version 0.3.23 and development branch. The error code 9 is due to reordering failed in DTGSEN in line 550 of dgges3.f.

I also changed Makefile.x86_64:

$ diff -ru Makefile.x86_64.orig Makefile.x86_64
--- Makefile.x86_64.orig        2023-05-08 13:22:43.147444042 +0200
+++ Makefile.x86_64     2023-05-08 13:23:22.597020079 +0200
@@ -133,9 +133,9 @@
 ifeq ($(CORE), ZEN)
 ifdef HAVE_AVX512VL
 ifndef NO_AVX512
-CCOMMON_OPT += -march=skylake-avx512
+CCOMMON_OPT += -march=znver4
 ifneq ($(F_COMPILER), NAG)
-FCOMMON_OPT += -march=skylake-avx512
+FCOMMON_OPT += -march=znver4
 endif
 ifeq ($(OSNAME), CYGWIN_NT)
 CCOMMON_OPT += -fno-asynchronous-unwind-tables

but this also had no notable effect.

Surprisingly to me, changing the overall optimization from -O2 to -O1 changes the number of numerical errors to 76 in total (14 in REAL, 36 in DOUBLE, 4 in COMPLEX, and 22 in COMPLEX16) but no other error. Not sure if this helps.

@martin-frbg
Copy link
Collaborator

Yes, the -O1 effect is one of those counter-intuitive things where (probably) using fewer instructions means having fewer instances of rounding error on intermediate results. Using -znver4 will affect some instruction cost calculations but would need to be guarded with another gcc version check (and I think the performance gain should be pretty marginal). Lastly, you get much the same picture on a lowly zen3-based laptop so cpu model and core count does not play much of a role after all - ISTR these test failures crept up after algorithm changes in Reference-LAPACK 3.10 but they appear to be more of a nuisance than an actual defect.

@HPC-UniOldenburg
Copy link
Author

Thanks, I will ignore the errors for now and will keep an eye on the future releases of GCC and OpenBLAS.

@boegel
Copy link
Contributor

boegel commented Sep 28, 2023

We're seeing this same problem (1 failing test due to a non-numerical issue, "DDRGES: DGGES returned INFO= 9.") in different setting, including when:

Is there an easy way to selectively disable this particular test, to avoid blindly ignoring other failing tests which do signal a problem worth looking into?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Sep 28, 2023

probably by editing lapack-netlib/TESTING/dgg.in to either disable all these tests or removing the parameter(s) of the offending one - have not confirmed this though. Interesting that it would happen with the generic build as well, where there is no FMA optimization beyond what the compiler does, and only some loop unrolling (assuming easybuild's "generic" corresponds to TARGET=GENERIC in OpenBLAS)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Sep 28, 2023

sorry, dgd.in not dgg - and specifically remove the "6" from the first list of matrix dimensions in line 3 of that file (6 eigenvalues + error code 3 => INFO=9)

@boegel
Copy link
Contributor

boegel commented Sep 29, 2023

@martin-frbg Seems like that worked like a charm, see easybuilders/easybuild-easyconfigs#18887 + EESSI/software-layer#334 (comment) in which we're retrying the build of OpenBLAS 0.3.23 with the patch included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LAPACK issue Deficiency in code imported from Reference-LAPACK
Projects
None yet
Development

No branches or pull requests

3 participants