Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS or BLIS instead of ATLAS? #15

Closed
geerlingguy opened this issue Sep 6, 2023 · 6 comments
Closed

OpenBLAS or BLIS instead of ATLAS? #15

geerlingguy opened this issue Sep 6, 2023 · 6 comments
Labels

Comments

@geerlingguy
Copy link
Owner

After doing some more testing with Ampere's recommended HPL setup (with an Ampere-optimized BLIS library), I would like to investigate switching away from ATLAS.

The primary motivation is build speed. I've noticed some machines can compile in an hour or two, but others take 2-3 days (especially slower systems like the Raspberry Pi 4...).

That's not especially fun, but in the past I've stuck with this method thinking it will compile ATLAS in a way that is tuned to each specific processor the best. Supposedly. (Who understands all this math that well anyway?)

I would like to compare other options like OpenBLAS or BLIS to see:

  1. If they are able to be used easily as a drop-in replacement for ATLAS
  2. If the performance result is affected too drastically (which would trigger me wanting to re-run HPL on all my machines, heh)
@geerlingguy
Copy link
Owner Author

geerlingguy commented Sep 6, 2023

Forum user arif-ali used OpenBLAS and got 13 Gflops on the Pi 4, which beats my result of 11-ish with ATLAS: https://forums.raspberrypi.com/viewtopic.php?p=1674010#p1674010

Scripts available here: https://github.com/arif-ali/raspberrypi-hpl/tree/master/scripts

@geerlingguy
Copy link
Owner Author

More detail from aa3025: https://www.hydromag.eu/~aa3025/rpi/

@geerlingguy
Copy link
Owner Author

ATLAS compile was taking absolutely forever on the big.LITTLE Orange Pi 5 RK3588s (I confirmed no throttling... not sure why it got hung up so long!). I decided to start working on this.

With OpenBLAS:

TASK [Output the results.] *********************************************************************************************
ok: [10.0.100.237] => 
  mpirun_output.stdout: |-
    ================================================================================
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================
  
    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.
  
    The following parameter values will be used:
  
    N      :   14745
    NB     :     256
    PMAP   : Row-major process mapping
    P      :       1
    Q      :       4
    PFACT  :   Right
    NBMIN  :       4
    NDIV   :       2
    RFACT  :   Crout
    BCAST  :  1ringM
    DEPTH  :       1
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words
  
    --------------------------------------------------------------------------------
  
    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0
  
    ================================================================================
    T/V                N    NB     P     Q               Time                 Gflops
    --------------------------------------------------------------------------------
    WR11C2R4       14745   256     1     4             211.86             1.0089e+01
    HPL_pdgesv() start time Thu Sep  7 10:54:22 2023
  
    HPL_pdgesv() end time   Thu Sep  7 10:57:54 2023
  
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.45732385e-03 ...... PASSED
    ================================================================================
  
    Finished      1 tests with the following results:
                  1 tests completed and passed residual checks,
                  0 tests completed and failed residual checks,
                  0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------
  
    End of Tests.
    ================================================================================

That seems low, so I'm going to re-run my OpenBLAS setup on a Pi 4 model B to see how it compares to the ATLAS library.

Note that OpenBLAS seems to support A72/A73, but doesn't have any optimizations for A76, and seems to be using the A55 optimizations since there are LITTLE cores on the RK3588s, and maybe it's only picking those up...

@geerlingguy
Copy link
Owner Author

geerlingguy commented Sep 7, 2023

On the Pi 4, I'm seeing 11.679 Gflops using OpenBLAS (compared to 11.774 Gflops for ATLAS):

ok: [10.0.100.156] => 
  mpirun_output.stdout: |-
    ================================================================================
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================
  
    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.
  
    The following parameter values will be used:
  
    N      :   23314
    NB     :     256
    PMAP   : Row-major process mapping
    P      :       1
    Q      :       4
    PFACT  :   Right
    NBMIN  :       4
    NDIV   :       2
    RFACT  :   Crout
    BCAST  :  1ringM
    DEPTH  :       1
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words
  
    --------------------------------------------------------------------------------
  
    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0
  
    ================================================================================
    T/V                N    NB     P     Q               Time                 Gflops
    --------------------------------------------------------------------------------
    WR11C2R4       23314   256     1     4             723.40             1.1679e+01
    HPL_pdgesv() start time Thu Sep  7 12:47:15 2023
  
    HPL_pdgesv() end time   Thu Sep  7 12:59:19 2023
  
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.68158083e-03 ...... PASSED
    ================================================================================
  
    Finished      1 tests with the following results:
                  1 tests completed and passed residual checks,
                  0 tests completed and failed residual checks,
                  0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------
  
    End of Tests.
    ================================================================================

(Note that it is tuned for A72...).

@geerlingguy
Copy link
Owner Author

Using Blis... got 11.889 Gflops, nice!

ok: [10.0.100.156] => 
  mpirun_output.stdout: |-
    ================================================================================
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================
  
    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.
  
    The following parameter values will be used:
  
    N      :   23314
    NB     :     256
    PMAP   : Row-major process mapping
    P      :       1
    Q      :       4
    PFACT  :   Right
    NBMIN  :       4
    NDIV   :       2
    RFACT  :   Crout
    BCAST  :  1ringM
    DEPTH  :       1
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words
  
    --------------------------------------------------------------------------------
  
    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0
  
    ================================================================================
    T/V                N    NB     P     Q               Time                 Gflops
    --------------------------------------------------------------------------------
    WR11C2R4       23314   256     1     4             710.67             1.1889e+01
    HPL_pdgesv() start time Thu Sep  7 14:02:20 2023
  
    HPL_pdgesv() end time   Thu Sep  7 14:14:11 2023
  
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
    ================================================================================
  
    Finished      1 tests with the following results:
                  1 tests completed and passed residual checks,
                  0 tests completed and failed residual checks,
                  0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------
  
    End of Tests.
    ================================================================================

@geerlingguy
Copy link
Owner Author

On the Orange Pi 5, using Blis:

  mpirun_output.stdout: |-
    ================================================================================
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================
  
    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.
  
    The following parameter values will be used:
  
    N      :   14745
    NB     :     256
    PMAP   : Row-major process mapping
    P      :       1
    Q      :       4
    PFACT  :   Right
    NBMIN  :       4
    NDIV   :       2
    RFACT  :   Crout
    BCAST  :  1ringM
    DEPTH  :       1
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words
  
    --------------------------------------------------------------------------------
  
    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0
  
    ================================================================================
    T/V                N    NB     P     Q               Time                 Gflops
    --------------------------------------------------------------------------------
    WR11C2R4       14745   256     1     4              39.94             5.3517e+01
    HPL_pdgesv() start time Thu Sep  7 14:23:28 2023
  
    HPL_pdgesv() end time   Thu Sep  7 14:24:08 2023
  
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.07780599e-03 ...... PASSED
    ================================================================================
  
    Finished      1 tests with the following results:
                  1 tests completed and passed residual checks,
                  0 tests completed and failed residual checks,
                  0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------
  
    End of Tests.
    ================================================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant