feat(llama-server): Gemma 4 26B MoE via llama-swap with idle unload#2715
Conversation
Replace Qwen3.5-9B dense hybrid DeltaNet (capped at ~8 t/s on the 4070 because recurrent kernels are memory-bandwidth-bound with low arithmetic intensity) with Gemma 4 26B-A4B pure MoE (25.2B total / 3.8B active). Config highlights: - --fit on auto-probes VRAM and picks the optimal --n-cpu-moe split; 512 MiB headroom accommodates CUDA compute-buffer growth during the first prefill on an 11.8 GiB effective-free card. - --batch-size / --ubatch-size 1024 (not 2048) so --fit has room to keep more experts on GPU at the 12 GiB class. - --parallel 2 so moltis and opencode/pr-reviewer do not serialise. - --swa-full + --slot-save-path for persistent slots; --cache-reuse omitted because Gemma 4's shared KV cache architecture breaks it (ggml-org/llama.cpp#21468). - --mlock + --no-mmap prevent mid-generation paging stalls on the CPU-side experts; memory limit raised to 20Gi for prefill peaks. Known caveat: b8840 still ships the unmerged Gemma 4 tool-calling fixes (ggml-org/llama.cpp#21326, #21343). Tool-call emission may include garbage tokens until a newer build lands upstream.
|
|
Overall Grade |
Security Reliability Complexity Hygiene |
Code Review Summary
| Analyzer | Status | Updated (UTC) | Details |
|---|---|---|---|
| JavaScript | Apr 20, 2026 8:15p.m. | Review ↗ | |
| Shell | Apr 20, 2026 8:15p.m. | Review ↗ |
Important
AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.
|
✅ Automated recommendation: APPROVE Analysis engine: self-hosted@http://llama-server.ai.svc.cluster.local/v1 PR Review: feat(llama-server): swap to Gemma 4 26B-A4B MoE with --fit auto-sizingRecommendationAPPROVE - The PR implements a well-documented model swap with appropriate resource adjustments and clear reasoning for each configuration change. Change-by-Change FindingsModel Swap (Qwen3.5-9B → Gemma 4 26B-A4B)
--fit Auto-Sizing
Parallelism Adjustment
Memory Locking
Batch Size Configuration
Sliding Window Attention
Resource Adjustments
Status: ✅ Changes are conservative and well-commented Probe Configuration
Known Caveat
Sources
Standards Compliance
Linked Issue FitNo linked issue found. The PR body provides detailed implementation guidance and acceptance criteria within the description itself. Consider linking to a follow-up issue for tracking the known caveat about tool-calling fixes. Unknowns / Needs Verification
SummaryThis is a well-structured PR with clear reasoning for each configuration change. The model swap from Qwen3.5-9B to Gemma 4 26B-A4B MoE is appropriately configured for the hardware constraints. The only concerns are the missing linked issue and the documented caveat about tool-calling fixes, which are noted in the PR body. |
--- kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server HelmRelease: ai/llama-server
+++ kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server HelmRelease: ai/llama-server
@@ -24,96 +24,71 @@
retries: 2
strategy:
name: RemediateOnFailure
values:
controllers:
app:
+ annotations:
+ reloader.stakater.com/auto: 'true'
containers:
app:
args:
- - --host
- - 0.0.0.0
- - --alias
- - self-hosted
- - -hf
- - unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL
- - --jinja
- - --ctx-size
- - '160000'
- - --parallel
- - '4'
- - --n-gpu-layers
- - '9999'
- - --cache-type-k
- - q8_0
- - --cache-type-v
- - q8_0
- - --override-tensor
- - token_embd\.weight=CUDA0
- - --no-context-shift
- - --threads
- - '4'
- - --threads-batch
- - '4'
- - --temp
- - '0.6'
- - --top-p
- - '0.95'
- - --top-k
- - '20'
- - --metrics
+ - -config
+ - /app/config.yaml
+ - -listen
+ - 0.0.0.0:8080
command:
- - /app/llama-server
+ - /app/llama-swap
env:
HF_HOME: /cache
TZ: Europe/Brussels
image:
- repository: ghcr.io/ggml-org/llama.cpp
- tag: server-cuda@sha256:66664a81cd0476baa150a6063dfb1054e44f99d5f5b09f9094aae6dc68fc8247
+ repository: ghcr.io/mostlygeek/llama-swap
+ tag: cuda@sha256:5533c89b4a7e894f7614a038250591375cec7b5bbb04847ab6744323fc3819ec
probes:
liveness:
custom: true
enabled: true
spec:
- failureThreshold: 6
+ failureThreshold: 3
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
readiness:
custom: true
enabled: true
spec:
- failureThreshold: 6
+ failureThreshold: 3
httpGet:
path: /health
port: 8080
- initialDelaySeconds: 10
+ initialDelaySeconds: 5
periodSeconds: 10
- timeoutSeconds: 5
+ timeoutSeconds: 3
startup:
custom: true
enabled: true
spec:
- failureThreshold: 180
+ failureThreshold: 12
httpGet:
path: /health
port: 8080
- initialDelaySeconds: 10
+ initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
resources:
limits:
- cpu: 4
- memory: 12Gi
+ cpu: 6
+ memory: 20Gi
nvidia.com/gpu: 1
requests:
- cpu: 500m
- memory: 2Gi
+ cpu: 1
+ memory: 1Gi
nvidia.com/gpu: 1
defaultPodOptions:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
@@ -126,12 +101,19 @@
persistence:
cache:
enabled: true
existingClaim: llama-server
globalMounts:
- path: /cache
+ config:
+ globalMounts:
+ - path: /app/config.yaml
+ readOnly: true
+ subPath: config.yaml
+ name: llama-server-config
+ type: configMap
service:
app:
controller: app
ports:
http:
port: 80
--- kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server ConfigMap: ai/llama-server-config
+++ kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server ConfigMap: ai/llama-server-config
@@ -0,0 +1,31 @@
+---
+apiVersion: v1
+data:
+ config.yaml: "---\n# Single GPU: only one model runs at a time \u2014 llama-swap\
+ \ stops the running\n# child before starting the next when a request selects a\
+ \ different model.\n\n# seconds; upper bound for a cold load (large MoE + --no-mmap\
+ \ full RAM copy)\n# to report ready. Default 120s is too tight for our ~60s load\
+ \ + warmup.\nhealthCheckTimeout: 900\n\nmacros:\n # Shared flags; is llama-swap's\
+ \ per-model auto-assigned port.\n common: >\n --host 127.0.0.1 --port \n \
+ \ --jinja\n --cache-type-k q8_0 --cache-type-v q8_0\n --temp 0.6 --top-k\
+ \ 20\n --metrics\n\nmodels:\n gemma-4:\n # 15 min idle unloads; next request\
+ \ reloads (~60s cold start from /cache PVC).\n ttl: 900\n aliases:\n \
+ \ - self-hosted\n # --n-gpu-layers omitted intentionally: MoE (26B total\
+ \ / 4B active per token)\n # stays fast on partial CPU offload; llama.cpp auto-picks\
+ \ layers to fit VRAM.\n cmd: >\n /app/llama-server\n \n -hf\
+ \ unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q3_K_XL\n --fit-ctx 100000\n --fit-target\
+ \ 512\n --parallel 2\n --no-mmap\n --batch-size 1024\n --ubatch-size\
+ \ 1024\n --swa-full\n --slot-save-path /cache/slots\n --min-p 0.0\n\
+ \n qwen-3.5:\n ttl: 900\n cmd: >\n /app/llama-server\n \n \
+ \ -hf unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL\n --ctx-size 160000\n --parallel\
+ \ 4\n --n-gpu-layers 9999\n --override-tensor token_embd\\.weight=CUDA0\n\
+ \ --no-context-shift\n --threads 4\n --threads-batch 4\n --top-p\
+ \ 0.95\n"
+kind: ConfigMap
+metadata:
+ labels:
+ kustomize.toolkit.fluxcd.io/name: llama-server
+ kustomize.toolkit.fluxcd.io/namespace: ai
+ name: llama-server-config
+ namespace: ai
+ |
--- HelmRelease: ai/llama-server Deployment: ai/llama-server
+++ HelmRelease: ai/llama-server Deployment: ai/llama-server
@@ -5,12 +5,14 @@
name: llama-server
labels:
app.kubernetes.io/controller: app
app.kubernetes.io/instance: llama-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: llama-server
+ annotations:
+ reloader.stakater.com/auto: 'true'
namespace: ai
spec:
revisionHistoryLimit: 3
replicas: 1
strategy:
type: Recreate
@@ -50,87 +52,67 @@
- key: nvidia.com/gpu
operator: In
values:
- 'true'
containers:
- args:
- - --host
- - 0.0.0.0
- - --alias
- - self-hosted
- - -hf
- - unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL
- - --jinja
- - --ctx-size
- - '160000'
- - --parallel
- - '4'
- - --n-gpu-layers
- - '9999'
- - --cache-type-k
- - q8_0
- - --cache-type-v
- - q8_0
- - --override-tensor
- - token_embd\.weight=CUDA0
- - --no-context-shift
- - --threads
- - '4'
- - --threads-batch
- - '4'
- - --temp
- - '0.6'
- - --top-p
- - '0.95'
- - --top-k
- - '20'
- - --metrics
+ - -config
+ - /app/config.yaml
+ - -listen
+ - 0.0.0.0:8080
command:
- - /app/llama-server
+ - /app/llama-swap
env:
- name: HF_HOME
value: /cache
- name: TZ
value: Europe/Brussels
- image: ghcr.io/ggml-org/llama.cpp:server-cuda@sha256:66664a81cd0476baa150a6063dfb1054e44f99d5f5b09f9094aae6dc68fc8247
+ image: ghcr.io/mostlygeek/llama-swap:cuda@sha256:5533c89b4a7e894f7614a038250591375cec7b5bbb04847ab6744323fc3819ec
livenessProbe:
- failureThreshold: 6
+ failureThreshold: 3
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
name: app
readinessProbe:
- failureThreshold: 6
+ failureThreshold: 3
httpGet:
path: /health
port: 8080
- initialDelaySeconds: 10
+ initialDelaySeconds: 5
periodSeconds: 10
- timeoutSeconds: 5
+ timeoutSeconds: 3
resources:
limits:
- cpu: 4
- memory: 12Gi
+ cpu: 6
+ memory: 20Gi
nvidia.com/gpu: 1
requests:
- cpu: 500m
- memory: 2Gi
+ cpu: 1
+ memory: 1Gi
nvidia.com/gpu: 1
startupProbe:
- failureThreshold: 180
+ failureThreshold: 12
httpGet:
path: /health
port: 8080
- initialDelaySeconds: 10
+ initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
volumeMounts:
- mountPath: /cache
name: cache
+ - mountPath: /app/config.yaml
+ name: config
+ readOnly: true
+ subPath: config.yaml
volumes:
- name: cache
persistentVolumeClaim:
claimName: llama-server
+ - configMap:
+ name: llama-server-config
+ name: config
|
…requests - Remove --fit on (matches default fit_params=true) - Remove --top-p 0.95 (matches llama.cpp default) - Remove --mlock (redundant alongside --no-mmap on swapless K8s) - Prune now-stale comments; keep WHY-focused ones only - Drop idle resource requests to cpu:1 / memory:1Gi ahead of llama-swap migration where baseline with no model loaded is near-zero
… annotation
- Extract shared flags into a `common` macro; both models reference `${common}`.
- Use `${PORT}`; drop hardcoded `proxy` (schema default matches).
- Tighten file header; note intentional `--n-gpu-layers` omission on gemma-4 MoE.
- Add `reloader.stakater.com/auto` to match peer ai-namespace apps.
Single source of truth for 8080 across probes.httpGet and service.targetPort.
Wraps llama.cpp in llama-swap so the running model unloads when idle and VRAM returns to the node. Gemma 4 26B-A4B MoE (25.2B total / 3.8B active) replaces the previous Qwen3.5-9B dense hybrid, which capped ~8 t/s on the 4070 because DeltaNet recurrent kernels are memory-bandwidth-bound. Qwen stays registered for on-demand use.
Models
gemma-4self-hosted(backward compat with existing consumers)qwen-3.5Only one runs at a time (single GPU); llama-swap stops the running child before starting the next. Idle models unload after 900 s; next request reloads in ~60 s cold-start from the
/cachePVC.healthCheckTimeout: 900gives the cold load time to report ready.Consumers keep hitting the unchanged endpoint
http://llama-server.ai.svc.cluster.local/v1and select a model via the request body'smodelfield.gemma-4 config notes
--fit-ctx 100000 --fit-target 512auto-probes VRAM and picks the--n-cpu-moesplit;--parallel 2keeps the KV cache on GPU.--no-mmap+--swa-full+--slot-save-path /cache/slotsso experts don't page mid-generation and slots survive swaps.--n-gpu-layersintentionally omitted — MoE (3.8B active per token) stays fast on partial CPU offload; llama.cpp auto-picks layers to fit VRAM.qwen-3.5 config notes
Retains its original flags:
--ctx-size 160000 --parallel 4 --n-gpu-layers 9999 --override-tensor token_embd\.weight=CUDA0 --no-context-shift --threads 4. Context-shift stays disabled because the hybrid DeltaNet+Attention architecture corrupts SSM state on shift.Shared flags (macros)
--host 127.0.0.1 --port ${PORT} --jinja --cache-type-k/v q8_0 --temp 0.6 --top-k 20 --metricsfactored into acommonmacro.${PORT}is llama-swap's auto-assigned per-model port.Ops
reloader.stakater.com/auto: "true"on the controller so ConfigMap edits recycle the pod (matches the otheraiapps)./health(always-200 regardless of model state). Startup budget 60 s since the proxy itself starts in < 5 s with no model loaded.20Gimemory limit accommodates--no-mmapfull-RAM weight copy + KV-cache overspill + OS.8080is defined once via&portinprobes.httpGet.portand referenced fromservice.targetPort.Known caveat
Gemma 4 tool-calling fixes (ggml-org/llama.cpp#21326, ggml-org/llama.cpp#21343) are not yet in the upstream llama.cpp bundled with the llama-swap image. Tool-call emission may include garbage tokens until a newer build lands upstream.