Description
DGELSD fails to converge in some rare cases. I have attached a set of matrices (as binary files) that reproduce the problem. The LHS matrix is 1602x1602, the RHS matrix is 1602x1. I am only able to reliably reproduce the problem with OMP_NUM_THREADS=1.
Here's a minimal Fortran program that reproduces the problem:
program solve_least_squares
implicit none
integer, parameter :: m = 1602, n = 1602, nrhs = 1
double precision, dimension(m, n) :: lhs
double precision, dimension(m, nrhs) :: rhs
double precision, dimension(min(m, n)) :: s
double precision :: rcond
integer :: rank, info, lwork, liwork
double precision, allocatable :: work(:)
integer, allocatable :: iwork(:)
! Declare the LAPACK routine
interface
subroutine DGELSD(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, iwork, info)
integer, intent(in) :: m, n, nrhs, lda, ldb, lwork
double precision, intent(inout) :: a(lda, *), b(ldb, *), s(*), work(*)
double precision, intent(in) :: rcond
integer, intent(out) :: rank, iwork(*), info
end subroutine DGELSD
end interface
! Use machine precision as the default
rcond = -1.0d0
! Open and read lhs.bin
print *, "Reading lhs.bin"
open(unit=10, file="lhs.bin", form="unformatted", access="stream", status="old")
read(10) lhs
close(10)
! Open and read rhs.bin
print *, "Reading rhs.bin"
open(unit=11, file="rhs.bin", form="unformatted", access="stream", status="old")
read(11) rhs
close(11)
! Workspace query
print *, "Querying workspace"
lwork = -1
allocate(work(1))
allocate(iwork(1))
! Query optimal workspace size
print *, "Calling DGELSD to get workspace size"
call DGELSD(m, n, nrhs, lhs, m, rhs, m, s, rcond, rank, work, lwork, iwork, info)
if (info /= 0) then
print *, "Error during workspace query: DGELSD failed with info =", info
stop
end if
! Allocate optimal workspace
lwork = int(work(1))
liwork = iwork(1)
deallocate(work, iwork)
allocate(work(lwork), iwork(liwork))
! Solve the least-squares problem
print *, "Calling DGELSD to solve the least-squares problem"
call DGELSD(m, n, nrhs, lhs, m, rhs, m, s, rcond, rank, work, lwork, iwork, info)
if (info /= 0) then
print *, "Error: DGELSD failed with info =", info
stop
end if
! Write the result to res.bin
print *, "Writing res.bin"
open(unit=12, file="res.bin", form="unformatted", access="stream", status="replace")
write(12) rhs(1:n, 1) ! Write only the solution vector
close(12)
! Deallocate workspace
deallocate(work, iwork)
print *, "Solution written to res.bin"
end program solve_least_squares
When the problem occurs, it produces the following error:
Error: DGELSD failed with info = 1
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
... which indicates that the SVD failed to converge.
I can reproduce the problem with OpenBLAS versions v0.3.18 through v0.3.30. I bisected it down to this commit:
47ba85f314808476c8254779389607f9af60231f is the first bad commit
commit 47ba85f314808476c8254779389607f9af60231f
Author: Martin Kroeker <martin@ruby.chemie.uni-freiburg.de>
Date: Thu Jul 22 17:24:15 2021 +0200
Fix regex to match kernels suffixed with cpuname too
cmake/utils.cmake | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
... which had the effect of enabling FMA instructions for many files.
This problem seems to be highly sensitive to floating point rounding errors. Even tiny changes to the input matrix make the problem go away, as does setting OMP_NUM_THREADS to anything other than 1. I don't know whether the introduction of FMA instructions actually caused this problem, or merely changed the rounding errors slightly such that my example matrix no longer works to trigger the problem.
I have compiled OpenBLAS with make TARGET=HASWELL
on Red Hat Enterprise Linux release 8.10 (Ootpa). The pre-compiled OpenBLAS library included with Scipy shares the same issue. The corresponding Scipy issue can be found here: scipy/scipy#23220