Skip to content

Latest commit

 

History

History
410 lines (291 loc) · 10.9 KB

File metadata and controls

410 lines (291 loc) · 10.9 KB

GPU Service Management (On-Demand)

Author: Mr. Watson 🦄 Date: 2026-02-19

Goal

Monitor and manage GPU-intensive services (Whisper, RAG, Qwen3-TTS) with automatic lazy-loading and manual control to avoid VRAM exhaustion.

Hardware context

  • GPU: NVIDIA RTX 2070 Super 8GB (via eGPU, USB-C/Thunderbolt)
  • Total VRAM: 8192 MiB
  • Services can NOT run simultaneously: Combined VRAM usage exceeds capacity

VRAM usage per service (approximate):

  • Whisper (transcription + diarization): ~3.5 GB
  • RAG library (embeddings + reranker): ~1.2 GB
  • Qwen3-TTS 1.7B (voice cloning): ~4.3 GB

Combined: ~9 GB → exceeds 8 GB VRAM

Services

GPU service behavior:

  • whisper-web.service (port 8060, /whisper endpoint)

    • Auto-loading: Frontend always active, GPU model loads on first job
    • Auto-unloads after 120 seconds of inactivity
    • Auto-starts on boot
  • qwen3-tts.service (port 8070, /tts endpoint)

    • Auto-loading: Frontend always active, GPU model loads on first job
    • Auto-unloads after 120 seconds of inactivity
    • Auto-starts on boot
  • rag-library-ingest.service (SFTP inbox watcher)

    • ⚠️ Manual control: Must be started/stopped manually with gpu-service
    • Does NOT auto-start on boot
    • Runs continuously when active (no auto-unload)

Auto-Loading Behavior (Whisper/TTS)

How it works:

  1. Service always running: FastAPI frontend available 24/7
  2. GPU model lazy-loads: Only loaded when first job arrives in queue
  3. Auto-unload on idle: After 120 seconds with no jobs, model is unloaded and VRAM freed
  4. Failsafe: If GPU OOM during load, job fails with clear error message

Example timeline:

00:00 - User visits https://beachlab.org/whisper/
00:01 - User uploads audio and clicks "Transcribe"
00:02 - Worker thread detects queued job
00:03 - GPU model begins loading (~10-20s first time)
00:22 - Model loaded, transcription starts
00:45 - Job completes, marked as 'done'
02:45 - No new jobs for 120s → model unloads, VRAM freed

Benefits:

  • No 502 errors (frontend always available)
  • No manual service management needed
  • Efficient VRAM usage (only allocated when needed)
  • Multiple users can queue jobs (processed sequentially)

Management tool

/usr/local/bin/gpu-service — CLI tool for monitoring and manual control

Usage

Check status

gpu-service status

Output:

  • Service states (active/inactive)
  • GPU memory usage per process
  • Total VRAM used/available

Start a service

gpu-service start whisper
gpu-service start rag
gpu-service start tts

Important: Only start ONE service at a time.

Stop a service

gpu-service stop whisper
gpu-service stop rag
gpu-service stop tts

Stop all:

gpu-service stop all

Switch services

To switch from one GPU service to another:

gpu-service stop whisper
gpu-service start tts

Wait 2-3 seconds between stop and start for VRAM cleanup.

Operations

Typical workflows

Transcription job (automatic):

  1. Navigate to https://beachlab.org/whisper/
  2. Upload audio and submit job
  3. GPU model loads automatically (first job may take 10-20s)
  4. Wait for job to complete
  5. Download transcript
  6. Model auto-unloads after 2 minutes of inactivity

Voice cloning (automatic):

  1. Navigate to https://beachlab.org/tts/
  2. Upload reference audio + enter text
  3. GPU model loads automatically (first job may take 10-20s)
  4. Wait for generation to complete
  5. Download wav file
  6. Model auto-unloads after 2 minutes of inactivity

eBook indexing (manual):

  1. Check GPU status: gpu-service status
  2. If Whisper/TTS are idle, proceed. If not, wait or use gpu-service stop all
  3. gpu-service start rag
  4. Upload PDFs/EPUBs via SFTP to /home/sftpuser/library_inbox
  5. Monitor logs: journalctl -u rag-library-ingest -f
  6. gpu-service stop rag (when inbox is empty)

VRAM conflict handling

Automatic (Whisper/TTS):

If you submit a job and GPU memory is full:

  • Job will be marked as failed
  • Error message: "GPU memory full. Please stop other GPU services (gpu-service stop all) and try again."
  • Check gpu-service status to see what's using VRAM
  • Stop conflicting service or wait for auto-unload (120s idle)

Manual (RAG):

Before starting RAG, check for conflicts:

gpu-service status

If Whisper or TTS are using GPU:

  • Wait for auto-unload (check logs for "unloading model" message)
  • Or force stop: gpu-service stop all

Then start RAG:

gpu-service start rag

Emergency: all services stuck

sudo systemctl stop whisper-web rag-library-ingest qwen3-tts

Or kill GPU processes directly (last resort):

sudo pkill -9 -f "whisper-service|rag-library|qwen3-tts"

Why lazy-loading + manual control

  1. VRAM limit: 8GB is not enough to run all three services simultaneously
  2. Sporadic use: Whisper, RAG, and TTS are used infrequently, not 24/7
  3. Resource efficiency: GPU idle when not needed
  4. User experience: Frontends always accessible, no manual service management needed

Design decisions:

  • Auto-loading (Whisper/TTS): Frontend always available, GPU loads on demand
    • No CUDA OOM on startup (model loads when first job arrives)
    • Auto-unload after idle timeout (frees VRAM for other services)
    • Failsafe: if GPU memory full, job fails with clear message
  • ⚠️ Manual control (RAG): Continuous processing when active
    • No auto-unload (watcher runs continuously until stopped)
    • Requires explicit gpu-service start rag before use
    • Prevents unexpected VRAM usage when uploading large batches

Alternative approaches considered but rejected:

  • Smaller models: Qwen3-TTS 0.6B has noticeably lower quality
  • Shared VRAM pool: Not supported by PyTorch/CUDA without full model unloading
  • Always-on all services: Exceeds 8GB VRAM capacity

Thunderbolt Hot-Unplug Caveat

Observed on thebeachlab (NUC11TNKi3 + Razer Core X + RTX 2070 SUPER, June 2026):

  • Normal boot with eGPU attached works
  • boltctl shows Razer Core X as authorized
  • nvidia-smi works
  • GPU services (comfyui, qwen3-tts, whisper-web) can use the card normally

What breaks it reliably:

  • Unplugging the Thunderbolt cable while the eGPU is live
  • Reconnecting the cable in the same runtime session

What Linux reports when it breaks:

thunderbolt 0-3: device disconnected
pcieport 0000:00:07.0: pciehp: Slot(0): Link Down
NVRM: Xid (PCI:0000:04:00): 79, GPU has fallen off the bus.
NVRM: Xid (PCI:0000:04:00): 154, GPU recovery action changed ... GPU Reset Required

Typical broken-state symptoms after reconnect:

  • Core X light comes back on
  • boltctl may return to authorized
  • lspci may still list the NVIDIA device, sometimes as rev ff
  • nvidia-smi fails with No devices were found
  • Server fan can ramp hard during the failure window

Operational rule:

  • Do not hot-unplug or hot-replug the Thunderbolt cable while GPU workloads are active
  • Treat the eGPU cable as effectively non-hot-swappable for production use on this host

Recovery:

  1. Stop touching the Thunderbolt cable
  2. Reboot the host
  3. Re-check:
boltctl list
lspci | grep -i nvidia
nvidia-smi

If the reboot path does not recover cleanly, escalate to full power-off / power-on.

Notes from local testing:

  • Updating BIOS from 0073 to 0078 improved overall stability but did not make hot-unplug safe
  • pcie_port_pm=off is currently kept as part of the stable baseline while investigating
  • The issue matches known Linux/NVIDIA/Thunderbolt reports around Xid 79 and "fallen off the bus"

Recovery after dead PSU/GPU

Historical context: The Razer Core X PSU died on 2026-03-04, so GPU services were disabled for a while to avoid continuous errors and freezes.

Current state (restored on 2026-06-10)

Service State
whisper-web enabled + active
qwen3-tts enabled + active
comfyui enabled + active
nvidia-persistenced enabled + active
egpu-watchdog.timer enabled + active
Telegraf inputs.nvidia_smi enabled

Current runtime after restore:

  • eGPU: Razer Core X authorized via Thunderbolt
  • GPU: NVIDIA GeForce RTX 2070 SUPER
  • Driver: 595.71.05
  • nvidia-smi: OK
  • Persistence mode: ON

Restore procedure (if the eGPU disappears again)

1. Prefer cold boot, not hot-plug:

  1. Power off host
  2. Connect/power the Razer Core X
  3. Wait 5-10 seconds
  4. Boot host

2. Verify the GPU is visible:

boltctl
lspci | grep -i nvidia
nvidia-smi

If nvidia-smi fails, try:

sudo modprobe nvidia
nvidia-smi

3. Re-enable GPU services if they were disabled:

sudo systemctl enable --now nvidia-persistenced
sudo systemctl enable --now whisper-web
sudo systemctl enable --now qwen3-tts
sudo systemctl enable --now comfyui
sudo systemctl enable --now egpu-watchdog.timer

4. Re-enable Telegraf monitoring if needed:

Ensure /etc/telegraf/telegraf.d/nuc-timescale.conf contains:

[[inputs.nvidia_smi]]
  bin_path = "/usr/local/bin/nvidia-smi-safe.sh"
  timeout = "5s"

Then:

sudo systemctl restart telegraf
sudo journalctl -u telegraf -n 10 --no-pager | grep -E "Error|nvidia"

5. Verify watchdog + heartbeat instrumentation:

systemctl status egpu-watchdog.timer host-heartbeat-log.timer --no-pager
tail -n 20 /var/log/host-heartbeat.log

Expected behavior:

  • one alert when the eGPU is lost
  • one alert when it recovers
  • no repeating half-hour alerts while it remains missing
  • heartbeat log includes explicit transition lines such as:
event=egpu_lost last_ok=2026-06-15T05:03:29Z detected_at=2026-06-15T05:04:33Z
event=egpu_recovered missing_since=2026-06-15T05:04:33Z detected_at=2026-06-15T05:18:12Z

6. Verify telemetry:

DRY_RUN=true bash /home/pink/.openclaw/workspace/scripts/publish_telemetry.sh | python3 -m json.tool | grep gpu

The gpu field should show real temp/util values instead of null.

7. Quick service test:

curl -I http://localhost:8060/      # whisper-web
curl -I http://localhost:8070/      # qwen3-tts
curl -I http://localhost:8188/      # comfyui
curl http://localhost:8060/openapi.json | jq '.info'
curl http://localhost:8070/openapi.json | jq '.info'

Note: whisper-web and qwen3-tts do not expose /health; use /, /docs, or /openapi.json instead.

Log checks

journalctl -k -b | grep -iE 'NVRM|Xid|nvidia|thunderbolt|bolt'

On the 2026-06-10 restore there were no Xid errors after boot. Only one benign-looking line appeared during bring-up:

nvidia-gpu 0000:04:00.3: i2c timeout error e0000000