Skip to content

GPU solver doesn't work on cluster with A100 GPU #312

@qianggao-lab

Description

@qianggao-lab

Hi, I am trying to solve a large-scale SDP using SCS, which converges too slowly on the CPU. So, I want to use the GPU version to get some speedup. I first tested the GPU version with a small problem on my laptop (Windows 11) with an RTX 4080 GPU, which works perfectly:

------------------------------------------------------------------
	       SCS v3.2.5 - Splitting Conic Solver
	(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem:  variables n: 4875, constraints m: 594271
cones: 	  z: primal zero / dual free vars: 31
	  l: linear vars: 30
	  q: soc vars: 0, qsize: 1
	  s: psd vars: 594210, ssize: 182
settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07
	  alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
	  max_iters: 100000, normalize: 1, rho_x: 1.00e-06
	  acceleration_lookback: 10, acceleration_interval: 10
lin-sys:  sparse-indirect GPU
	  nnz(A): 138701, nnz(P): 0
------------------------------------------------------------------
 iter | pri res | dua res |   gap   |   obj   |  scale  | time (s)
------------------------------------------------------------------
     0| 7.11e+00  6.71e+01  4.04e+02 -3.49e+02  1.00e-01  2.06e+00 
   250| 7.03e-04  5.26e-04  1.46e-03  4.16e-02  3.14e-02  1.11e+02 
   475| 5.59e-05  2.10e-04  7.97e-05  4.22e-02  3.14e-02  1.90e+02 
------------------------------------------------------------------
status:  solved
timings: total: 1.90e+02s = setup: 2.54e-01s + solve: 1.90e+02s
	 lin-sys: 1.26e+02s, cones: 5.84e+01s, accel: 2.88e-01s
------------------------------------------------------------------
objective = 0.042202
------------------------------------------------------------------

This is calling SCS via YALMIP in MATLAB R2023b. The CUDA version is 12.9. Insider MATLAB it shows

>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'NVIDIA GeForce RTX 4080 Laptop GPU'
                     Index: 1
         ComputeCapability: '8.9'
            SupportsDouble: 1
     GraphicsDriverVersion: '576.02'
               DriverModel: 'WDDM'
            ToolkitVersion: 11.8000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 12878086144 (12.88 GB)
           AvailableMemory: 11573702656 (11.57 GB)
               CachePolicy: 'balanced'
       MultiprocessorCount: 58
              ClockRateKHz: 1665000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

However, when I tried the same procedure on a cluster with an A100 GPU, the solver didn't even run (just showed "-------------"). I just replaced the path to the CUDA folder (compile_gpu.m from scs-matlab) with the corresponding one on the cluster.
System information:

$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.9 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.9 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.9"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

I am using MATLAB R2022b and

>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'NVIDIA A100-SXM4-40GB MIG 3g.20gb'
                     Index: 1
         ComputeCapability: '8.0'
            SupportsDouble: 1
             DriverVersion: 12.5000
            ToolkitVersion: 11.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 21072183296 (21.07 GB)
           AvailableMemory: 20629553152 (20.63 GB)
       MultiprocessorCount: 42
              ClockRateKHz: 1410000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

I have the following modules loaded on the cluster

module list

Currently Loaded Modules:
  1) gmp/6.3.0-fasrc01   2) mpfr/4.2.1-fasrc01   3) mpc/1.3.1-fasrc02   4) cuda/12.4.1-fasrc01   5) gcc/14.2.0-fasrc01

I already tried using multi-core CPUs, which take far too long to converge. The possible GPU acceleration might be the only way for me to go.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions