Author: Mr. Watson 🦄 Date: 2026-02-19
- Goal
- Hardware context
- Services
- Management tool
- Usage
- Operations
- Why on-demand
- Thunderbolt Hot-Unplug Caveat
Monitor and manage GPU-intensive services (Whisper, RAG, Qwen3-TTS) with automatic lazy-loading and manual control to avoid VRAM exhaustion.
- GPU: NVIDIA RTX 2070 Super 8GB (via eGPU, USB-C/Thunderbolt)
- Total VRAM: 8192 MiB
- Services can NOT run simultaneously: Combined VRAM usage exceeds capacity
VRAM usage per service (approximate):
- Whisper (transcription + diarization): ~3.5 GB
- RAG library (embeddings + reranker): ~1.2 GB
- Qwen3-TTS 1.7B (voice cloning): ~4.3 GB
Combined: ~9 GB → exceeds 8 GB VRAM
GPU service behavior:
-
whisper-web.service(port 8060,/whisperendpoint)- ✨ Auto-loading: Frontend always active, GPU model loads on first job
- Auto-unloads after 120 seconds of inactivity
- Auto-starts on boot
-
qwen3-tts.service(port 8070,/ttsendpoint)- ✨ Auto-loading: Frontend always active, GPU model loads on first job
- Auto-unloads after 120 seconds of inactivity
- Auto-starts on boot
-
rag-library-ingest.service(SFTP inbox watcher)⚠️ Manual control: Must be started/stopped manually withgpu-service- Does NOT auto-start on boot
- Runs continuously when active (no auto-unload)
How it works:
- Service always running: FastAPI frontend available 24/7
- GPU model lazy-loads: Only loaded when first job arrives in queue
- Auto-unload on idle: After 120 seconds with no jobs, model is unloaded and VRAM freed
- Failsafe: If GPU OOM during load, job fails with clear error message
Example timeline:
00:00 - User visits https://beachlab.org/whisper/
00:01 - User uploads audio and clicks "Transcribe"
00:02 - Worker thread detects queued job
00:03 - GPU model begins loading (~10-20s first time)
00:22 - Model loaded, transcription starts
00:45 - Job completes, marked as 'done'
02:45 - No new jobs for 120s → model unloads, VRAM freed
Benefits:
- No 502 errors (frontend always available)
- No manual service management needed
- Efficient VRAM usage (only allocated when needed)
- Multiple users can queue jobs (processed sequentially)
/usr/local/bin/gpu-service — CLI tool for monitoring and manual control
gpu-service statusOutput:
- Service states (active/inactive)
- GPU memory usage per process
- Total VRAM used/available
gpu-service start whisper
gpu-service start rag
gpu-service start ttsImportant: Only start ONE service at a time.
gpu-service stop whisper
gpu-service stop rag
gpu-service stop ttsStop all:
gpu-service stop allTo switch from one GPU service to another:
gpu-service stop whisper
gpu-service start ttsWait 2-3 seconds between stop and start for VRAM cleanup.
Transcription job (automatic):
- Navigate to
https://beachlab.org/whisper/ - Upload audio and submit job
- GPU model loads automatically (first job may take 10-20s)
- Wait for job to complete
- Download transcript
- Model auto-unloads after 2 minutes of inactivity
Voice cloning (automatic):
- Navigate to
https://beachlab.org/tts/ - Upload reference audio + enter text
- GPU model loads automatically (first job may take 10-20s)
- Wait for generation to complete
- Download wav file
- Model auto-unloads after 2 minutes of inactivity
eBook indexing (manual):
- Check GPU status:
gpu-service status - If Whisper/TTS are idle, proceed. If not, wait or use
gpu-service stop all gpu-service start rag- Upload PDFs/EPUBs via SFTP to
/home/sftpuser/library_inbox - Monitor logs:
journalctl -u rag-library-ingest -f gpu-service stop rag(when inbox is empty)
Automatic (Whisper/TTS):
If you submit a job and GPU memory is full:
- Job will be marked as
failed - Error message: "GPU memory full. Please stop other GPU services (gpu-service stop all) and try again."
- Check
gpu-service statusto see what's using VRAM - Stop conflicting service or wait for auto-unload (120s idle)
Manual (RAG):
Before starting RAG, check for conflicts:
gpu-service statusIf Whisper or TTS are using GPU:
- Wait for auto-unload (check logs for "unloading model" message)
- Or force stop:
gpu-service stop all
Then start RAG:
gpu-service start ragsudo systemctl stop whisper-web rag-library-ingest qwen3-ttsOr kill GPU processes directly (last resort):
sudo pkill -9 -f "whisper-service|rag-library|qwen3-tts"- VRAM limit: 8GB is not enough to run all three services simultaneously
- Sporadic use: Whisper, RAG, and TTS are used infrequently, not 24/7
- Resource efficiency: GPU idle when not needed
- User experience: Frontends always accessible, no manual service management needed
Design decisions:
- ✅ Auto-loading (Whisper/TTS): Frontend always available, GPU loads on demand
- No CUDA OOM on startup (model loads when first job arrives)
- Auto-unload after idle timeout (frees VRAM for other services)
- Failsafe: if GPU memory full, job fails with clear message
⚠️ Manual control (RAG): Continuous processing when active- No auto-unload (watcher runs continuously until stopped)
- Requires explicit
gpu-service start ragbefore use - Prevents unexpected VRAM usage when uploading large batches
Alternative approaches considered but rejected:
- ❌ Smaller models: Qwen3-TTS 0.6B has noticeably lower quality
- ❌ Shared VRAM pool: Not supported by PyTorch/CUDA without full model unloading
- ❌ Always-on all services: Exceeds 8GB VRAM capacity
Observed on thebeachlab (NUC11TNKi3 + Razer Core X + RTX 2070 SUPER, June 2026):
- Normal boot with eGPU attached works
boltctlshowsRazer Core Xasauthorizednvidia-smiworks- GPU services (
comfyui,qwen3-tts,whisper-web) can use the card normally
What breaks it reliably:
- Unplugging the Thunderbolt cable while the eGPU is live
- Reconnecting the cable in the same runtime session
What Linux reports when it breaks:
thunderbolt 0-3: device disconnected
pcieport 0000:00:07.0: pciehp: Slot(0): Link Down
NVRM: Xid (PCI:0000:04:00): 79, GPU has fallen off the bus.
NVRM: Xid (PCI:0000:04:00): 154, GPU recovery action changed ... GPU Reset Required
Typical broken-state symptoms after reconnect:
- Core X light comes back on
boltctlmay return toauthorizedlspcimay still list the NVIDIA device, sometimes asrev ffnvidia-smifails withNo devices were found- Server fan can ramp hard during the failure window
Operational rule:
- Do not hot-unplug or hot-replug the Thunderbolt cable while GPU workloads are active
- Treat the eGPU cable as effectively non-hot-swappable for production use on this host
Recovery:
- Stop touching the Thunderbolt cable
- Reboot the host
- Re-check:
boltctl list
lspci | grep -i nvidia
nvidia-smiIf the reboot path does not recover cleanly, escalate to full power-off / power-on.
Notes from local testing:
- Updating BIOS from
0073to0078improved overall stability but did not make hot-unplug safe pcie_port_pm=offis currently kept as part of the stable baseline while investigating- The issue matches known Linux/NVIDIA/Thunderbolt reports around
Xid 79and "fallen off the bus"
Historical context: The Razer Core X PSU died on 2026-03-04, so GPU services were disabled for a while to avoid continuous errors and freezes.
| Service | State |
|---|---|
whisper-web |
enabled + active |
qwen3-tts |
enabled + active |
comfyui |
enabled + active |
nvidia-persistenced |
enabled + active |
egpu-watchdog.timer |
enabled + active |
Telegraf inputs.nvidia_smi |
enabled |
Current runtime after restore:
- eGPU:
Razer Core Xauthorized via Thunderbolt - GPU:
NVIDIA GeForce RTX 2070 SUPER - Driver:
595.71.05 nvidia-smi: OK- Persistence mode: ON
1. Prefer cold boot, not hot-plug:
- Power off host
- Connect/power the Razer Core X
- Wait 5-10 seconds
- Boot host
2. Verify the GPU is visible:
boltctl
lspci | grep -i nvidia
nvidia-smiIf nvidia-smi fails, try:
sudo modprobe nvidia
nvidia-smi3. Re-enable GPU services if they were disabled:
sudo systemctl enable --now nvidia-persistenced
sudo systemctl enable --now whisper-web
sudo systemctl enable --now qwen3-tts
sudo systemctl enable --now comfyui
sudo systemctl enable --now egpu-watchdog.timer4. Re-enable Telegraf monitoring if needed:
Ensure /etc/telegraf/telegraf.d/nuc-timescale.conf contains:
[[inputs.nvidia_smi]]
bin_path = "/usr/local/bin/nvidia-smi-safe.sh"
timeout = "5s"Then:
sudo systemctl restart telegraf
sudo journalctl -u telegraf -n 10 --no-pager | grep -E "Error|nvidia"5. Verify watchdog + heartbeat instrumentation:
systemctl status egpu-watchdog.timer host-heartbeat-log.timer --no-pager
tail -n 20 /var/log/host-heartbeat.logExpected behavior:
- one alert when the eGPU is lost
- one alert when it recovers
- no repeating half-hour alerts while it remains missing
- heartbeat log includes explicit transition lines such as:
event=egpu_lost last_ok=2026-06-15T05:03:29Z detected_at=2026-06-15T05:04:33Z
event=egpu_recovered missing_since=2026-06-15T05:04:33Z detected_at=2026-06-15T05:18:12Z
6. Verify telemetry:
DRY_RUN=true bash /home/pink/.openclaw/workspace/scripts/publish_telemetry.sh | python3 -m json.tool | grep gpuThe gpu field should show real temp/util values instead of null.
7. Quick service test:
curl -I http://localhost:8060/ # whisper-web
curl -I http://localhost:8070/ # qwen3-tts
curl -I http://localhost:8188/ # comfyui
curl http://localhost:8060/openapi.json | jq '.info'
curl http://localhost:8070/openapi.json | jq '.info'Note: whisper-web and qwen3-tts do not expose /health; use /, /docs, or /openapi.json instead.
journalctl -k -b | grep -iE 'NVRM|Xid|nvidia|thunderbolt|bolt'On the 2026-06-10 restore there were no Xid errors after boot. Only one benign-looking line appeared during bring-up:
nvidia-gpu 0000:04:00.3: i2c timeout error e0000000