CUDA checkpoint fails inside gVisor sandbox (nvproxy) with CUDA_ERROR_OPERATING_SYSTEM / initialization error

### Description

 ### Summary
  CUDA checkpoint/restore fails for GPU processes running inside gVisor (nvproxy).
  Host-level CUDA checkpoint works fine on the same GPU/driver.

  ### Environment
  - OS: Ubuntu 24.04 (GCP)
  - Kernel: 6.14.0-1021-gcp
  - GPU: NVIDIA L4
  - Driver: 580.105.08
  - gVisor/runsc: release-20260126.0
  - Runtime: runsc (gVisor)
  - CUDA toolkit: 12.0 (nvcc 12.0.140)
  - CUDA checkpoint tool: NVIDIA cuda-checkpoint (version 580.105.08) (`runsc nvproxy list-supported-drivers`)

  ### Expected
  CUDA checkpoint tool can initialize and checkpoint the GPU process running in gVisor.

  ### Actual
  - In-container wrapper:

  [cuda-checkpoint-wrapper] ERROR: cuCheckpointProcessLock for PID <pid> failed: CUDA_ERROR_OPERATING_SYSTEM
  [cuda-checkpoint-wrapper] ERROR: GPU checkpoint save failed

  - Host cuda-checkpoint targeting runsc-sandbox PID:

  Error getting process state for process ID <pid>: "initialization error"

  ### Minimal logs
  Container GPU is healthy:

  $ nvidia-smi -L
  GPU 0: NVIDIA L4 (UUID: GPU-...redacted...)

  $ python3 - <<'PY'
  import torch
  print("cuda_available", torch.cuda.is_available())
  print("device", torch.cuda.get_device_name(0))
  x = torch.randn(1024, 1024, device="cuda")
  y = x @ x
  torch.cuda.synchronize()
  print("ok", y[0,0].item())
  PY
  cuda_available True
  device NVIDIA L4
  ok <value>


  Hypervisor log around snapshot:


  [cuda-checkpoint-wrapper] Starting GPU save (checkpoint)
  [cuda-checkpoint-wrapper] Found 1 GPU process(es)
  [cuda-checkpoint-wrapper] Checkpointing 1 process(es)...
  [cuda-checkpoint-wrapper] ERROR: cuCheckpointProcessLock for PID 1091866 failed: CUDA_ERROR_OPERATING_SYSTEM (OS call failed or operation not supported on this OS)
  [cuda-checkpoint-wrapper] ERROR: Failed to process 1/1 CUDA process(es)
  [cuda-checkpoint-wrapper] ERROR: GPU checkpoint save failed


  ### Notes
  - Host CUDA checkpoint works on native CUDA processes with the same driver/tool.
  - Failure happens only for GPU processes inside gVisor/nvproxy (runsc-sandbox).

  ### Question
  Is CUDA checkpoint/restore supported for GPU processes running under nvproxy?
  If not, is there any planned support or documented limitation?

### Steps to reproduce

 ### Repro steps
  1. Run a CUDA process inside gVisor (e.g. PyTorch CUDA).
  2. In the container, confirm GPU works:

  nvidia-smi -L
  python3 - <<'PY'
  import torch
  print(torch.cuda.is_available())
  x = torch.randn(1024,1024,device="cuda")
  y = x @ x
  torch.cuda.synchronize()
  print(y[0,0].item())
  PY

  3. Try GPU checkpoint inside the container:

  GVISOR_SAVE_RESTORE_AUTO_EXEC_MODE=save /etc/.test/cuda-snapshot

  4. Or from host using NVIDIA cuda-checkpoint on the runsc-sandbox PID:

  sudo /tmp/nv_cuda_checkpoint --get-state --pid <runsc-sandbox-pid>


### runsc version

```shell
runsc version release-20260126.0
spec: 1.1.0-rc.1
```

### docker version (if using docker)

```shell

```

### uname

Linux

### kubectl (if using Kubernetes)

```shell

```

### repo state (if built from source)

_No response_

### runsc debug logs (if available)

```shell
{"msg":"checkpoint failed: checkpointing container \"task-uuid0\": failed to exec save/restore binary: /etc/.test/cuda-snapshot exited with
  non-zero status 256","level":"error"}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA checkpoint fails inside gVisor sandbox (nvproxy) with CUDA_ERROR_OPERATING_SYSTEM / initialization error #12600

Description

Summary

Environment

Expected

Actual

Minimal logs

Notes

Question

Steps to reproduce

Repro steps

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA checkpoint fails inside gVisor sandbox (nvproxy) with CUDA_ERROR_OPERATING_SYSTEM / initialization error #12600

Description

Description

Summary

Environment

Expected

Actual

Minimal logs

Notes

Question

Steps to reproduce

Repro steps

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions