Skip to content

DGELSD fails to converge for some matrices #5333

Closed
@MaartenBaert

Description

@MaartenBaert

DGELSD fails to converge in some rare cases. I have attached a set of matrices (as binary files) that reproduce the problem. The LHS matrix is 1602x1602, the RHS matrix is 1602x1. I am only able to reliably reproduce the problem with OMP_NUM_THREADS=1.

matrices.zip

Here's a minimal Fortran program that reproduces the problem:

program solve_least_squares
    implicit none
    integer, parameter :: m = 1602, n = 1602, nrhs = 1
    double precision, dimension(m, n) :: lhs
    double precision, dimension(m, nrhs) :: rhs
    double precision, dimension(min(m, n)) :: s
    double precision :: rcond
    integer :: rank, info, lwork, liwork
    double precision, allocatable :: work(:)
    integer, allocatable :: iwork(:)

    ! Declare the LAPACK routine
    interface
        subroutine DGELSD(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, iwork, info)
            integer, intent(in) :: m, n, nrhs, lda, ldb, lwork
            double precision, intent(inout) :: a(lda, *), b(ldb, *), s(*), work(*)
            double precision, intent(in) :: rcond
            integer, intent(out) :: rank, iwork(*), info
        end subroutine DGELSD
    end interface

    ! Use machine precision as the default
    rcond = -1.0d0

    ! Open and read lhs.bin
    print *, "Reading lhs.bin"
    open(unit=10, file="lhs.bin", form="unformatted", access="stream", status="old")
    read(10) lhs
    close(10)

    ! Open and read rhs.bin
    print *, "Reading rhs.bin"
    open(unit=11, file="rhs.bin", form="unformatted", access="stream", status="old")
    read(11) rhs
    close(11)

    ! Workspace query
    print *, "Querying workspace"
    lwork = -1
    allocate(work(1))
    allocate(iwork(1))

    ! Query optimal workspace size
    print *, "Calling DGELSD to get workspace size"
    call DGELSD(m, n, nrhs, lhs, m, rhs, m, s, rcond, rank, work, lwork, iwork, info)
    if (info /= 0) then
        print *, "Error during workspace query: DGELSD failed with info =", info
        stop
    end if

    ! Allocate optimal workspace
    lwork = int(work(1))
    liwork = iwork(1)
    deallocate(work, iwork)
    allocate(work(lwork), iwork(liwork))

    ! Solve the least-squares problem
    print *, "Calling DGELSD to solve the least-squares problem"
    call DGELSD(m, n, nrhs, lhs, m, rhs, m, s, rcond, rank, work, lwork, iwork, info)
    if (info /= 0) then
        print *, "Error: DGELSD failed with info =", info
        stop
    end if

    ! Write the result to res.bin
    print *, "Writing res.bin"
    open(unit=12, file="res.bin", form="unformatted", access="stream", status="replace")
    write(12) rhs(1:n, 1)  ! Write only the solution vector
    close(12)

    ! Deallocate workspace
    deallocate(work, iwork)

    print *, "Solution written to res.bin"
end program solve_least_squares

When the problem occurs, it produces the following error:

 Error: DGELSD failed with info =           1
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

... which indicates that the SVD failed to converge.

I can reproduce the problem with OpenBLAS versions v0.3.18 through v0.3.30. I bisected it down to this commit:

47ba85f314808476c8254779389607f9af60231f is the first bad commit
commit 47ba85f314808476c8254779389607f9af60231f
Author: Martin Kroeker <martin@ruby.chemie.uni-freiburg.de>
Date:   Thu Jul 22 17:24:15 2021 +0200

    Fix regex to match kernels suffixed with cpuname too

 cmake/utils.cmake | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

... which had the effect of enabling FMA instructions for many files.

This problem seems to be highly sensitive to floating point rounding errors. Even tiny changes to the input matrix make the problem go away, as does setting OMP_NUM_THREADS to anything other than 1. I don't know whether the introduction of FMA instructions actually caused this problem, or merely changed the rounding errors slightly such that my example matrix no longer works to trigger the problem.

I have compiled OpenBLAS with make TARGET=HASWELL on Red Hat Enterprise Linux release 8.10 (Ootpa). The pre-compiled OpenBLAS library included with Scipy shares the same issue. The corresponding Scipy issue can be found here: scipy/scipy#23220

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions