GPU solver doesn't work on cluster with A100 GPU

Hi, I am trying to solve a large-scale SDP using SCS, which converges too slowly on the CPU. So, I want to use the GPU version to get some speedup. I first tested the GPU version with a small problem on my laptop (Windows 11) with an RTX 4080 GPU, which works perfectly:
```text
------------------------------------------------------------------
	       SCS v3.2.5 - Splitting Conic Solver
	(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem:  variables n: 4875, constraints m: 594271
cones: 	  z: primal zero / dual free vars: 31
	  l: linear vars: 30
	  q: soc vars: 0, qsize: 1
	  s: psd vars: 594210, ssize: 182
settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07
	  alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
	  max_iters: 100000, normalize: 1, rho_x: 1.00e-06
	  acceleration_lookback: 10, acceleration_interval: 10
lin-sys:  sparse-indirect GPU
	  nnz(A): 138701, nnz(P): 0
------------------------------------------------------------------
 iter | pri res | dua res |   gap   |   obj   |  scale  | time (s)
------------------------------------------------------------------
     0| 7.11e+00  6.71e+01  4.04e+02 -3.49e+02  1.00e-01  2.06e+00 
   250| 7.03e-04  5.26e-04  1.46e-03  4.16e-02  3.14e-02  1.11e+02 
   475| 5.59e-05  2.10e-04  7.97e-05  4.22e-02  3.14e-02  1.90e+02 
------------------------------------------------------------------
status:  solved
timings: total: 1.90e+02s = setup: 2.54e-01s + solve: 1.90e+02s
	 lin-sys: 1.26e+02s, cones: 5.84e+01s, accel: 2.88e-01s
------------------------------------------------------------------
objective = 0.042202
------------------------------------------------------------------
```

This is calling SCS via YALMIP in MATLAB R2023b. The CUDA version is 12.9. Insider MATLAB it shows
```text
>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'NVIDIA GeForce RTX 4080 Laptop GPU'
                     Index: 1
         ComputeCapability: '8.9'
            SupportsDouble: 1
     GraphicsDriverVersion: '576.02'
               DriverModel: 'WDDM'
            ToolkitVersion: 11.8000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 12878086144 (12.88 GB)
           AvailableMemory: 11573702656 (11.57 GB)
               CachePolicy: 'balanced'
       MultiprocessorCount: 58
              ClockRateKHz: 1665000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1
```
However, when I tried the same procedure on a cluster with an A100 GPU, the solver didn't even run (just showed "-------------"). I just replaced the path to the CUDA folder (compile_gpu.m from scs-matlab) with the corresponding one on the cluster.
System information:
```text
$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.9 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.9 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.9"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"
```
I am using MATLAB R2022b and 
```text
>> gpuDevice

ans = 

  CUDADevice with properties:

                      Name: 'NVIDIA A100-SXM4-40GB MIG 3g.20gb'
                     Index: 1
         ComputeCapability: '8.0'
            SupportsDouble: 1
             DriverVersion: 12.5000
            ToolkitVersion: 11.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 21072183296 (21.07 GB)
           AvailableMemory: 20629553152 (20.63 GB)
       MultiprocessorCount: 42
              ClockRateKHz: 1410000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1
```
I have the following modules loaded on the cluster
```text
module list

Currently Loaded Modules:
  1) gmp/6.3.0-fasrc01   2) mpfr/4.2.1-fasrc01   3) mpc/1.3.1-fasrc02   4) cuda/12.4.1-fasrc01   5) gcc/14.2.0-fasrc01
```

I already tried using multi-core CPUs, which take far too long to converge. The possible GPU acceleration might be the only way for me to go.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU solver doesn't work on cluster with A100 GPU #312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU solver doesn't work on cluster with A100 GPU #312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions