Skip to content

CUDA checkpoint fails inside gVisor sandbox (nvproxy) with CUDA_ERROR_OPERATING_SYSTEM / initialization error #12600

@0xAWM

Description

@0xAWM

Description

Summary

CUDA checkpoint/restore fails for GPU processes running inside gVisor (nvproxy).
Host-level CUDA checkpoint works fine on the same GPU/driver.

Environment

  • OS: Ubuntu 24.04 (GCP)
  • Kernel: 6.14.0-1021-gcp
  • GPU: NVIDIA L4
  • Driver: 580.105.08
  • gVisor/runsc: release-20260126.0
  • Runtime: runsc (gVisor)
  • CUDA toolkit: 12.0 (nvcc 12.0.140)
  • CUDA checkpoint tool: NVIDIA cuda-checkpoint (version 580.105.08) (runsc nvproxy list-supported-drivers)

Expected

CUDA checkpoint tool can initialize and checkpoint the GPU process running in gVisor.

Actual

  • In-container wrapper:

[cuda-checkpoint-wrapper] ERROR: cuCheckpointProcessLock for PID failed: CUDA_ERROR_OPERATING_SYSTEM
[cuda-checkpoint-wrapper] ERROR: GPU checkpoint save failed

  • Host cuda-checkpoint targeting runsc-sandbox PID:

Error getting process state for process ID : "initialization error"

Minimal logs

Container GPU is healthy:

$ nvidia-smi -L
GPU 0: NVIDIA L4 (UUID: GPU-...redacted...)

$ python3 - <<'PY'
import torch
print("cuda_available", torch.cuda.is_available())
print("device", torch.cuda.get_device_name(0))
x = torch.randn(1024, 1024, device="cuda")
y = x @ x
torch.cuda.synchronize()
print("ok", y[0,0].item())
PY
cuda_available True
device NVIDIA L4
ok

Hypervisor log around snapshot:

[cuda-checkpoint-wrapper] Starting GPU save (checkpoint)
[cuda-checkpoint-wrapper] Found 1 GPU process(es)
[cuda-checkpoint-wrapper] Checkpointing 1 process(es)...
[cuda-checkpoint-wrapper] ERROR: cuCheckpointProcessLock for PID 1091866 failed: CUDA_ERROR_OPERATING_SYSTEM (OS call failed or operation not supported on this OS)
[cuda-checkpoint-wrapper] ERROR: Failed to process 1/1 CUDA process(es)
[cuda-checkpoint-wrapper] ERROR: GPU checkpoint save failed

Notes

  • Host CUDA checkpoint works on native CUDA processes with the same driver/tool.
  • Failure happens only for GPU processes inside gVisor/nvproxy (runsc-sandbox).

Question

Is CUDA checkpoint/restore supported for GPU processes running under nvproxy?
If not, is there any planned support or documented limitation?

Steps to reproduce

Repro steps

  1. Run a CUDA process inside gVisor (e.g. PyTorch CUDA).
  2. In the container, confirm GPU works:

nvidia-smi -L
python3 - <<'PY'
import torch
print(torch.cuda.is_available())
x = torch.randn(1024,1024,device="cuda")
y = x @ x
torch.cuda.synchronize()
print(y[0,0].item())
PY

  1. Try GPU checkpoint inside the container:

GVISOR_SAVE_RESTORE_AUTO_EXEC_MODE=save /etc/.test/cuda-snapshot

  1. Or from host using NVIDIA cuda-checkpoint on the runsc-sandbox PID:

sudo /tmp/nv_cuda_checkpoint --get-state --pid

runsc version

runsc version release-20260126.0
spec: 1.1.0-rc.1

docker version (if using docker)

uname

Linux

kubectl (if using Kubernetes)

repo state (if built from source)

No response

runsc debug logs (if available)

{"msg":"checkpoint failed: checkpointing container \"task-uuid0\": failed to exec save/restore binary: /etc/.test/cuda-snapshot exited with
  non-zero status 256","level":"error"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions