-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Description
Summary
CUDA checkpoint/restore fails for GPU processes running inside gVisor (nvproxy).
Host-level CUDA checkpoint works fine on the same GPU/driver.
Environment
- OS: Ubuntu 24.04 (GCP)
- Kernel: 6.14.0-1021-gcp
- GPU: NVIDIA L4
- Driver: 580.105.08
- gVisor/runsc: release-20260126.0
- Runtime: runsc (gVisor)
- CUDA toolkit: 12.0 (nvcc 12.0.140)
- CUDA checkpoint tool: NVIDIA cuda-checkpoint (version 580.105.08) (
runsc nvproxy list-supported-drivers)
Expected
CUDA checkpoint tool can initialize and checkpoint the GPU process running in gVisor.
Actual
- In-container wrapper:
[cuda-checkpoint-wrapper] ERROR: cuCheckpointProcessLock for PID failed: CUDA_ERROR_OPERATING_SYSTEM
[cuda-checkpoint-wrapper] ERROR: GPU checkpoint save failed
- Host cuda-checkpoint targeting runsc-sandbox PID:
Error getting process state for process ID : "initialization error"
Minimal logs
Container GPU is healthy:
$ nvidia-smi -L
GPU 0: NVIDIA L4 (UUID: GPU-...redacted...)
$ python3 - <<'PY'
import torch
print("cuda_available", torch.cuda.is_available())
print("device", torch.cuda.get_device_name(0))
x = torch.randn(1024, 1024, device="cuda")
y = x @ x
torch.cuda.synchronize()
print("ok", y[0,0].item())
PY
cuda_available True
device NVIDIA L4
ok
Hypervisor log around snapshot:
[cuda-checkpoint-wrapper] Starting GPU save (checkpoint)
[cuda-checkpoint-wrapper] Found 1 GPU process(es)
[cuda-checkpoint-wrapper] Checkpointing 1 process(es)...
[cuda-checkpoint-wrapper] ERROR: cuCheckpointProcessLock for PID 1091866 failed: CUDA_ERROR_OPERATING_SYSTEM (OS call failed or operation not supported on this OS)
[cuda-checkpoint-wrapper] ERROR: Failed to process 1/1 CUDA process(es)
[cuda-checkpoint-wrapper] ERROR: GPU checkpoint save failed
Notes
- Host CUDA checkpoint works on native CUDA processes with the same driver/tool.
- Failure happens only for GPU processes inside gVisor/nvproxy (runsc-sandbox).
Question
Is CUDA checkpoint/restore supported for GPU processes running under nvproxy?
If not, is there any planned support or documented limitation?
Steps to reproduce
Repro steps
- Run a CUDA process inside gVisor (e.g. PyTorch CUDA).
- In the container, confirm GPU works:
nvidia-smi -L
python3 - <<'PY'
import torch
print(torch.cuda.is_available())
x = torch.randn(1024,1024,device="cuda")
y = x @ x
torch.cuda.synchronize()
print(y[0,0].item())
PY
- Try GPU checkpoint inside the container:
GVISOR_SAVE_RESTORE_AUTO_EXEC_MODE=save /etc/.test/cuda-snapshot
- Or from host using NVIDIA cuda-checkpoint on the runsc-sandbox PID:
sudo /tmp/nv_cuda_checkpoint --get-state --pid
runsc version
runsc version release-20260126.0
spec: 1.1.0-rc.1docker version (if using docker)
uname
Linux
kubectl (if using Kubernetes)
repo state (if built from source)
No response
runsc debug logs (if available)
{"msg":"checkpoint failed: checkpointing container \"task-uuid0\": failed to exec save/restore binary: /etc/.test/cuda-snapshot exited with
non-zero status 256","level":"error"}