GPU Service Management (On-Demand)

Author: Mr. Watson 🦄 Date: 2026-02-19

Goal
Hardware context
Services
Management tool
Usage
Operations
Why on-demand
Thunderbolt Hot-Unplug Caveat

Goal

Monitor and manage GPU-intensive services (Whisper, RAG, Qwen3-TTS) with automatic lazy-loading and manual control to avoid VRAM exhaustion.

Hardware context

GPU: NVIDIA RTX 2070 Super 8GB (via eGPU, USB-C/Thunderbolt)
Total VRAM: 8192 MiB
Services can NOT run simultaneously: Combined VRAM usage exceeds capacity

VRAM usage per service (approximate):

Whisper (transcription + diarization): ~3.5 GB
RAG library (embeddings + reranker): ~1.2 GB
Qwen3-TTS 1.7B (voice cloning): ~4.3 GB

Combined: ~9 GB → exceeds 8 GB VRAM

Services

GPU service behavior:

whisper-web.service (port 8060, /whisper endpoint)
- ✨ Auto-loading: Frontend always active, GPU model loads on first job
- Auto-unloads after 120 seconds of inactivity
- Auto-starts on boot
qwen3-tts.service (port 8070, /tts endpoint)
- ✨ Auto-loading: Frontend always active, GPU model loads on first job
- Auto-unloads after 120 seconds of inactivity
- Auto-starts on boot
rag-library-ingest.service (SFTP inbox watcher)
- ⚠️ Manual control: Must be started/stopped manually with gpu-service
- Does NOT auto-start on boot
- Runs continuously when active (no auto-unload)

Auto-Loading Behavior (Whisper/TTS)

How it works:

Service always running: FastAPI frontend available 24/7
GPU model lazy-loads: Only loaded when first job arrives in queue
Auto-unload on idle: After 120 seconds with no jobs, model is unloaded and VRAM freed
Failsafe: If GPU OOM during load, job fails with clear error message

Example timeline:

00:00 - User visits https://beachlab.org/whisper/
00:01 - User uploads audio and clicks "Transcribe"
00:02 - Worker thread detects queued job
00:03 - GPU model begins loading (~10-20s first time)
00:22 - Model loaded, transcription starts
00:45 - Job completes, marked as 'done'
02:45 - No new jobs for 120s → model unloads, VRAM freed

Benefits:

No 502 errors (frontend always available)
No manual service management needed
Efficient VRAM usage (only allocated when needed)
Multiple users can queue jobs (processed sequentially)

Management tool

/usr/local/bin/gpu-service — CLI tool for monitoring and manual control

Usage

Check status

gpu-service status

Output:

Service states (active/inactive)
GPU memory usage per process
Total VRAM used/available

Start a service

gpu-service start whisper
gpu-service start rag
gpu-service start tts

Important: Only start ONE service at a time.

Stop a service

gpu-service stop whisper
gpu-service stop rag
gpu-service stop tts

Stop all:

gpu-service stop all

Switch services

To switch from one GPU service to another:

gpu-service stop whisper
gpu-service start tts

Wait 2-3 seconds between stop and start for VRAM cleanup.

Operations

Typical workflows

Transcription job (automatic):

Navigate to https://beachlab.org/whisper/
Upload audio and submit job
GPU model loads automatically (first job may take 10-20s)
Wait for job to complete
Download transcript
Model auto-unloads after 2 minutes of inactivity

Voice cloning (automatic):

Navigate to https://beachlab.org/tts/
Upload reference audio + enter text
GPU model loads automatically (first job may take 10-20s)
Wait for generation to complete
Download wav file
Model auto-unloads after 2 minutes of inactivity

eBook indexing (manual):

Check GPU status: gpu-service status
If Whisper/TTS are idle, proceed. If not, wait or use gpu-service stop all
gpu-service start rag
Upload PDFs/EPUBs via SFTP to /home/sftpuser/library_inbox
Monitor logs: journalctl -u rag-library-ingest -f
gpu-service stop rag (when inbox is empty)

VRAM conflict handling

Automatic (Whisper/TTS):

If you submit a job and GPU memory is full:

Job will be marked as failed
Error message: "GPU memory full. Please stop other GPU services (gpu-service stop all) and try again."
Check gpu-service status to see what's using VRAM
Stop conflicting service or wait for auto-unload (120s idle)

Manual (RAG):

Before starting RAG, check for conflicts:

gpu-service status

If Whisper or TTS are using GPU:

Wait for auto-unload (check logs for "unloading model" message)
Or force stop: gpu-service stop all

Then start RAG:

gpu-service start rag

Emergency: all services stuck

sudo systemctl stop whisper-web rag-library-ingest qwen3-tts

Or kill GPU processes directly (last resort):

sudo pkill -9 -f "whisper-service|rag-library|qwen3-tts"

Why lazy-loading + manual control

VRAM limit: 8GB is not enough to run all three services simultaneously
Sporadic use: Whisper, RAG, and TTS are used infrequently, not 24/7
Resource efficiency: GPU idle when not needed
User experience: Frontends always accessible, no manual service management needed

Design decisions:

✅ Auto-loading (Whisper/TTS): Frontend always available, GPU loads on demand
- No CUDA OOM on startup (model loads when first job arrives)
- Auto-unload after idle timeout (frees VRAM for other services)
- Failsafe: if GPU memory full, job fails with clear message
⚠️ Manual control (RAG): Continuous processing when active
- No auto-unload (watcher runs continuously until stopped)
- Requires explicit gpu-service start rag before use
- Prevents unexpected VRAM usage when uploading large batches

Alternative approaches considered but rejected:

❌ Smaller models: Qwen3-TTS 0.6B has noticeably lower quality
❌ Shared VRAM pool: Not supported by PyTorch/CUDA without full model unloading
❌ Always-on all services: Exceeds 8GB VRAM capacity

Thunderbolt Hot-Unplug Caveat

Observed on thebeachlab (NUC11TNKi3 + Razer Core X + RTX 2070 SUPER, June 2026):

Normal boot with eGPU attached works
boltctl shows Razer Core X as authorized
nvidia-smi works
GPU services (comfyui, qwen3-tts, whisper-web) can use the card normally

What breaks it reliably:

Unplugging the Thunderbolt cable while the eGPU is live
Reconnecting the cable in the same runtime session

What Linux reports when it breaks:

thunderbolt 0-3: device disconnected
pcieport 0000:00:07.0: pciehp: Slot(0): Link Down
NVRM: Xid (PCI:0000:04:00): 79, GPU has fallen off the bus.
NVRM: Xid (PCI:0000:04:00): 154, GPU recovery action changed ... GPU Reset Required

Typical broken-state symptoms after reconnect:

Core X light comes back on
boltctl may return to authorized
lspci may still list the NVIDIA device, sometimes as rev ff
nvidia-smi fails with No devices were found
Server fan can ramp hard during the failure window

Operational rule:

Do not hot-unplug or hot-replug the Thunderbolt cable while GPU workloads are active
Treat the eGPU cable as effectively non-hot-swappable for production use on this host

Recovery:

Stop touching the Thunderbolt cable
Reboot the host
Re-check:

boltctl list
lspci | grep -i nvidia
nvidia-smi

If the reboot path does not recover cleanly, escalate to full power-off / power-on.

Notes from local testing:

Updating BIOS from 0073 to 0078 improved overall stability but did not make hot-unplug safe
pcie_port_pm=off is currently kept as part of the stable baseline while investigating
The issue matches known Linux/NVIDIA/Thunderbolt reports around Xid 79 and "fallen off the bus"

Recovery after dead PSU/GPU

Historical context: The Razer Core X PSU died on 2026-03-04, so GPU services were disabled for a while to avoid continuous errors and freezes.

Current state (restored on 2026-06-10)

Service	State
`whisper-web`	enabled + active
`qwen3-tts`	enabled + active
`comfyui`	enabled + active
`nvidia-persistenced`	enabled + active
`egpu-watchdog.timer`	enabled + active
Telegraf `inputs.nvidia_smi`	enabled

Current runtime after restore:

eGPU: Razer Core X authorized via Thunderbolt
GPU: NVIDIA GeForce RTX 2070 SUPER
Driver: 595.71.05
nvidia-smi: OK
Persistence mode: ON

Restore procedure (if the eGPU disappears again)

1. Prefer cold boot, not hot-plug:

Power off host
Connect/power the Razer Core X
Wait 5-10 seconds
Boot host

2. Verify the GPU is visible:

boltctl
lspci | grep -i nvidia
nvidia-smi

If nvidia-smi fails, try:

sudo modprobe nvidia
nvidia-smi

3. Re-enable GPU services if they were disabled:

sudo systemctl enable --now nvidia-persistenced
sudo systemctl enable --now whisper-web
sudo systemctl enable --now qwen3-tts
sudo systemctl enable --now comfyui
sudo systemctl enable --now egpu-watchdog.timer

4. Re-enable Telegraf monitoring if needed:

Ensure /etc/telegraf/telegraf.d/nuc-timescale.conf contains:

[[inputs.nvidia_smi]]
  bin_path = "/usr/local/bin/nvidia-smi-safe.sh"
  timeout = "5s"

Then:

sudo systemctl restart telegraf
sudo journalctl -u telegraf -n 10 --no-pager | grep -E "Error|nvidia"

5. Verify watchdog + heartbeat instrumentation:

systemctl status egpu-watchdog.timer host-heartbeat-log.timer --no-pager
tail -n 20 /var/log/host-heartbeat.log

Expected behavior:

one alert when the eGPU is lost
one alert when it recovers
no repeating half-hour alerts while it remains missing
heartbeat log includes explicit transition lines such as:

event=egpu_lost last_ok=2026-06-15T05:03:29Z detected_at=2026-06-15T05:04:33Z
event=egpu_recovered missing_since=2026-06-15T05:04:33Z detected_at=2026-06-15T05:18:12Z

6. Verify telemetry:

DRY_RUN=true bash /home/pink/.openclaw/workspace/scripts/publish_telemetry.sh | python3 -m json.tool | grep gpu

The gpu field should show real temp/util values instead of null.

7. Quick service test:

curl -I http://localhost:8060/      # whisper-web
curl -I http://localhost:8070/      # qwen3-tts
curl -I http://localhost:8188/      # comfyui
curl http://localhost:8060/openapi.json | jq '.info'
curl http://localhost:8070/openapi.json | jq '.info'

Note: whisper-web and qwen3-tts do not expose /health; use /, /docs, or /openapi.json instead.

Log checks

journalctl -k -b | grep -iE 'NVRM|Xid|nvidia|thunderbolt|bolt'

On the 2026-06-10 restore there were no Xid errors after boot. Only one benign-looking line appeared during bring-up:

nvidia-gpu 0000:04:00.3: i2c timeout error e0000000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Service Management (On-Demand)

Goal

Hardware context

Services

Auto-Loading Behavior (Whisper/TTS)

Management tool

Usage

Check status

Start a service

Stop a service

Switch services

Operations

Typical workflows

VRAM conflict handling

Emergency: all services stuck

Why lazy-loading + manual control

Thunderbolt Hot-Unplug Caveat

Recovery after dead PSU/GPU

Current state (restored on 2026-06-10)

Restore procedure (if the eGPU disappears again)

Log checks

FilesExpand file tree

gpu-services.md

Latest commit

History

gpu-services.md

File metadata and controls

GPU Service Management (On-Demand)

Goal

Hardware context

Services

Auto-Loading Behavior (Whisper/TTS)

Management tool

Usage

Check status

Start a service

Stop a service

Switch services

Operations

Typical workflows

VRAM conflict handling

Emergency: all services stuck

Why lazy-loading + manual control

Thunderbolt Hot-Unplug Caveat

Recovery after dead PSU/GPU

Current state (restored on 2026-06-10)

Restore procedure (if the eGPU disappears again)

Log checks