feat(llama-server): Gemma 4 26B MoE via llama-swap with idle unload by Tanguille · Pull Request #2715 · Tanguille/cluster

Tanguille · 2026-04-19T20:06:21Z

Wraps llama.cpp in llama-swap so the running model unloads when idle and VRAM returns to the node. Gemma 4 26B-A4B MoE (25.2B total / 3.8B active) replaces the previous Qwen3.5-9B dense hybrid, which capped ~8 t/s on the 4070 because DeltaNet recurrent kernels are memory-bandwidth-bound. Qwen stays registered for on-demand use.

Models

key	role	alias
`gemma-4`	primary	`self-hosted` (backward compat with existing consumers)
`qwen-3.5`	on-demand	—

Only one runs at a time (single GPU); llama-swap stops the running child before starting the next. Idle models unload after 900 s; next request reloads in ~60 s cold-start from the /cache PVC. healthCheckTimeout: 900 gives the cold load time to report ready.

Consumers keep hitting the unchanged endpoint http://llama-server.ai.svc.cluster.local/v1 and select a model via the request body's model field.

gemma-4 config notes

--fit-ctx 100000 --fit-target 512 auto-probes VRAM and picks the --n-cpu-moe split; --parallel 2 keeps the KV cache on GPU.
--no-mmap + --swa-full + --slot-save-path /cache/slots so experts don't page mid-generation and slots survive swaps.
--n-gpu-layers intentionally omitted — MoE (3.8B active per token) stays fast on partial CPU offload; llama.cpp auto-picks layers to fit VRAM.

qwen-3.5 config notes

Retains its original flags: --ctx-size 160000 --parallel 4 --n-gpu-layers 9999 --override-tensor token_embd\.weight=CUDA0 --no-context-shift --threads 4. Context-shift stays disabled because the hybrid DeltaNet+Attention architecture corrupts SSM state on shift.

Shared flags (macros)

--host 127.0.0.1 --port ${PORT} --jinja --cache-type-k/v q8_0 --temp 0.6 --top-k 20 --metrics factored into a common macro. ${PORT} is llama-swap's auto-assigned per-model port.

Ops

reloader.stakater.com/auto: "true" on the controller so ConfigMap edits recycle the pod (matches the other ai apps).
Probes hit llama-swap's /health (always-200 regardless of model state). Startup budget 60 s since the proxy itself starts in < 5 s with no model loaded.
20Gi memory limit accommodates --no-mmap full-RAM weight copy + KV-cache overspill + OS.
Port 8080 is defined once via &port in probes.httpGet.port and referenced from service.targetPort.

Known caveat

Gemma 4 tool-calling fixes (ggml-org/llama.cpp#21326, ggml-org/llama.cpp#21343) are not yet in the upstream llama.cpp bundled with the llama-swap image. Tool-call emission may include garbage tokens until a newer build lands upstream.

Replace Qwen3.5-9B dense hybrid DeltaNet (capped at ~8 t/s on the 4070 because recurrent kernels are memory-bandwidth-bound with low arithmetic intensity) with Gemma 4 26B-A4B pure MoE (25.2B total / 3.8B active). Config highlights: - --fit on auto-probes VRAM and picks the optimal --n-cpu-moe split; 512 MiB headroom accommodates CUDA compute-buffer growth during the first prefill on an 11.8 GiB effective-free card. - --batch-size / --ubatch-size 1024 (not 2048) so --fit has room to keep more experts on GPU at the 12 GiB class. - --parallel 2 so moltis and opencode/pr-reviewer do not serialise. - --swa-full + --slot-save-path for persistent slots; --cache-reuse omitted because Gemma 4's shared KV cache architecture breaks it (ggml-org/llama.cpp#21468). - --mlock + --no-mmap prevent mid-generation paging stalls on the CPU-side experts; memory limit raised to 20Gi for prefill peaks. Known caveat: b8840 still ships the unmerged Gemma 4 tool-calling fixes (ggml-org/llama.cpp#21326, #21343). Tool-call emission may include garbage tokens until a newer build lands upstream.

deepsource-io · 2026-04-19T20:06:39Z

DeepSource Code Review

We reviewed changes in b98b385...a513d1b on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade	Security Reliability Complexity Hygiene

Code Review Summary

Analyzer	Status	Updated (UTC)	Details
JavaScript		Apr 20, 2026 8:15p.m.	Review ↗
Shell		Apr 20, 2026 8:15p.m.	Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

github-actions · 2026-04-19T20:12:50Z

✅ Automated recommendation: APPROVE

Analysis engine: self-hosted@http://llama-server.ai.svc.cluster.local/v1

PR Review: feat(llama-server): swap to Gemma 4 26B-A4B MoE with --fit auto-sizing

Recommendation

APPROVE - The PR implements a well-documented model swap with appropriate resource adjustments and clear reasoning for each configuration change.

Change-by-Change Findings

Model Swap (Qwen3.5-9B → Gemma 4 26B-A4B)

Change: unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL → unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q3_K_XL
Rationale: Switching from dense hybrid DeltaNet (memory-bandwidth-bound) to pure MoE architecture (25.2B total / 3.8B active)
Status: ✅ Correctly implemented

--fit Auto-Sizing

Change: Added --fit on with --fit-ctx 160000 and --fit-target 512
Rationale: Auto-probes VRAM and picks optimal --n-cpu-moe split; 512 MiB headroom accommodates CUDA compute-buffer growth during first prefill on 11.8 GiB effective-free card
Status: ✅ Well-documented with inline comments

Parallelism Adjustment

Change: --parallel 4 → --parallel 2
Rationale: Lets moltis and opencode/pr-reviewer overlap without serializing one behind the other's prefill
Status: ✅ Appropriate for multi-tenant workload

Memory Locking

Change: Added --no-mmap and --mlock
Rationale: Prevents mid-generation paging stalls on CPU-side experts
Status: ✅ Correct for MoE workloads

Batch Size Configuration

Change: --batch-size 1024 and --ubatch-size 1024 (removed --threads and --threads-batch)
Rationale: 1024 matches the 12 GiB VRAM class; 2048 crowds compute buffers and forces --fit to push more experts to CPU, hurting decode
Status: ✅ Well-reasoned tradeoff

Sliding Window Attention

Change: Added --swa-full and --slot-save-path /cache/slots
Rationale: Gemma 4 uses alternating sliding-window + global attention; full SWA cache avoids truncation on restored slots
Status: ✅ Correct for Gemma 4 architecture

Resource Adjustments

Resource	Old	New	Rationale
CPU Request	500m	4	MoE dispatch requires more cores
Memory Request	2Gi	4Gi	Base requirement for MoE
CPU Limit	4	6	Leave 2 cores for rook-ceph OSDs and kube-apiserver on 8-core control node
Memory Limit	12Gi	20Gi	mlocked experts (~3.7 GiB) + KV overflow + compute buffers at 160k ctx / ubatch 1024 peak near 18 GiB in prefill

Status: ✅ Changes are conservative and well-commented

Probe Configuration

Change: Updated failureThreshold for startup (180), liveness (6), readiness (6)
Rationale: First boot downloads ~13 GiB; 180 x 5s = 15 min startup time
Status: ✅ Appropriate for large model download

Known Caveat

Note: PR body mentions b8840 still ships unmerged Gemma 4 tool-calling fixes (Gemma 4 template parser fixes ggml-org/llama.cpp#21326, #21343)
Status: ⚠️ Documented but not mitigated - tool-call emission may include garbage tokens until newer build lands upstream

Sources

PR Body: Model swap rationale and configuration highlights
Inline Comments: Resource justification and tradeoff explanations
Issue References: cache reuse is not supported for Gemma 4 models despite -fa enabled and --swa-full ggml-org/llama.cpp#21468 (cache-reuse incompatibility)

Standards Compliance

Standard	Status	Notes
YAML Syntax	✅	2-space indent, LF endings, anchors used correctly
Kubernetes Naming	✅	lowercase-dashes, correct path structure
Resource Limits	✅	Well-documented with inline comments
Secrets	✅	No secrets exposed in manifest
Image Reference	✅	Digest-based reference for reproducibility

Linked Issue Fit

No linked issue found. The PR body provides detailed implementation guidance and acceptance criteria within the description itself. Consider linking to a follow-up issue for tracking the known caveat about tool-calling fixes.

Unknowns / Needs Verification

Image Digest Provenance: Metadata states "No image digest changes detected" but manifest clearly shows new image tag server-cuda@sha256:a92a756344dffd35e93a693f771113c0ec7753701d83bab5a65ec8aae325491c. Verify this is intentional and not a metadata fetch failure.
--fit Behavior: The --fit flag auto-probes VRAM - verify this works correctly on the target hardware (11.8 GiB effective-free card).
Tool-Calling Caveat: The known issue about unmerged tool-calling fixes should be tracked separately.

Summary

This is a well-structured PR with clear reasoning for each configuration change. The model swap from Qwen3.5-9B to Gemma 4 26B-A4B MoE is appropriately configured for the hardware constraints. The only concerns are the missing linked issue and the documented caveat about tool-calling fixes, which are noted in the PR body.

tanguille-cluster · 2026-04-20T18:29:57Z

--- kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server HelmRelease: ai/llama-server

+++ kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server HelmRelease: ai/llama-server

@@ -24,96 +24,71 @@

       retries: 2
     strategy:
       name: RemediateOnFailure
   values:
     controllers:
       app:
+        annotations:
+          reloader.stakater.com/auto: 'true'
         containers:
           app:
             args:
-            - --host
-            - 0.0.0.0
-            - --alias
-            - self-hosted
-            - -hf
-            - unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL
-            - --jinja
-            - --ctx-size
-            - '160000'
-            - --parallel
-            - '4'
-            - --n-gpu-layers
-            - '9999'
-            - --cache-type-k
-            - q8_0
-            - --cache-type-v
-            - q8_0
-            - --override-tensor
-            - token_embd\.weight=CUDA0
-            - --no-context-shift
-            - --threads
-            - '4'
-            - --threads-batch
-            - '4'
-            - --temp
-            - '0.6'
-            - --top-p
-            - '0.95'
-            - --top-k
-            - '20'
-            - --metrics
+            - -config
+            - /app/config.yaml
+            - -listen
+            - 0.0.0.0:8080
             command:
-            - /app/llama-server
+            - /app/llama-swap
             env:
               HF_HOME: /cache
               TZ: Europe/Brussels
             image:
-              repository: ghcr.io/ggml-org/llama.cpp
-              tag: server-cuda@sha256:66664a81cd0476baa150a6063dfb1054e44f99d5f5b09f9094aae6dc68fc8247
+              repository: ghcr.io/mostlygeek/llama-swap
+              tag: cuda@sha256:5533c89b4a7e894f7614a038250591375cec7b5bbb04847ab6744323fc3819ec
             probes:
               liveness:
                 custom: true
                 enabled: true
                 spec:
-                  failureThreshold: 6
+                  failureThreshold: 3
                   httpGet:
                     path: /health
                     port: 8080
                   initialDelaySeconds: 30
                   periodSeconds: 30
                   timeoutSeconds: 5
               readiness:
                 custom: true
                 enabled: true
                 spec:
-                  failureThreshold: 6
+                  failureThreshold: 3
                   httpGet:
                     path: /health
                     port: 8080
-                  initialDelaySeconds: 10
+                  initialDelaySeconds: 5
                   periodSeconds: 10
-                  timeoutSeconds: 5
+                  timeoutSeconds: 3
               startup:
                 custom: true
                 enabled: true
                 spec:
-                  failureThreshold: 180
+                  failureThreshold: 12
                   httpGet:
                     path: /health
                     port: 8080
-                  initialDelaySeconds: 10
+                  initialDelaySeconds: 5
                   periodSeconds: 5
                   timeoutSeconds: 3
             resources:
               limits:
-                cpu: 4
-                memory: 12Gi
+                cpu: 6
+                memory: 20Gi
                 nvidia.com/gpu: 1
               requests:
-                cpu: 500m
-                memory: 2Gi
+                cpu: 1
+                memory: 1Gi
                 nvidia.com/gpu: 1
     defaultPodOptions:
       affinity:
         nodeAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
             nodeSelectorTerms:
@@ -126,12 +101,19 @@

     persistence:
       cache:
         enabled: true
         existingClaim: llama-server
         globalMounts:
         - path: /cache
+      config:
+        globalMounts:
+        - path: /app/config.yaml
+          readOnly: true
+          subPath: config.yaml
+        name: llama-server-config
+        type: configMap
     service:
       app:
         controller: app
         ports:
           http:
             port: 80
--- kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server ConfigMap: ai/llama-server-config

+++ kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server ConfigMap: ai/llama-server-config

@@ -0,0 +1,31 @@

+---
+apiVersion: v1
+data:
+  config.yaml: "---\n# Single GPU: only one model runs at a time \u2014 llama-swap\
+    \ stops the running\n# child before starting the next when a request selects a\
+    \ different model.\n\n# seconds; upper bound for a cold load (large MoE + --no-mmap\
+    \ full RAM copy)\n# to report ready. Default 120s is too tight for our ~60s load\
+    \ + warmup.\nhealthCheckTimeout: 900\n\nmacros:\n  # Shared flags;  is llama-swap's\
+    \ per-model auto-assigned port.\n  common: >\n    --host 127.0.0.1 --port \n \
+    \   --jinja\n    --cache-type-k q8_0 --cache-type-v q8_0\n    --temp 0.6 --top-k\
+    \ 20\n    --metrics\n\nmodels:\n  gemma-4:\n    # 15 min idle unloads; next request\
+    \ reloads (~60s cold start from /cache PVC).\n    ttl: 900\n    aliases:\n   \
+    \   - self-hosted\n    # --n-gpu-layers omitted intentionally: MoE (26B total\
+    \ / 4B active per token)\n    # stays fast on partial CPU offload; llama.cpp auto-picks\
+    \ layers to fit VRAM.\n    cmd: >\n      /app/llama-server\n      \n      -hf\
+    \ unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q3_K_XL\n      --fit-ctx 100000\n      --fit-target\
+    \ 512\n      --parallel 2\n      --no-mmap\n      --batch-size 1024\n      --ubatch-size\
+    \ 1024\n      --swa-full\n      --slot-save-path /cache/slots\n      --min-p 0.0\n\
+    \n  qwen-3.5:\n    ttl: 900\n    cmd: >\n      /app/llama-server\n      \n   \
+    \   -hf unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL\n      --ctx-size 160000\n      --parallel\
+    \ 4\n      --n-gpu-layers 9999\n      --override-tensor token_embd\\.weight=CUDA0\n\
+    \      --no-context-shift\n      --threads 4\n      --threads-batch 4\n      --top-p\
+    \ 0.95\n"
+kind: ConfigMap
+metadata:
+  labels:
+    kustomize.toolkit.fluxcd.io/name: llama-server
+    kustomize.toolkit.fluxcd.io/namespace: ai
+  name: llama-server-config
+  namespace: ai
+

tanguille-cluster · 2026-04-20T18:30:03Z

--- HelmRelease: ai/llama-server Deployment: ai/llama-server

+++ HelmRelease: ai/llama-server Deployment: ai/llama-server

@@ -5,12 +5,14 @@

   name: llama-server
   labels:
     app.kubernetes.io/controller: app
     app.kubernetes.io/instance: llama-server
     app.kubernetes.io/managed-by: Helm
     app.kubernetes.io/name: llama-server
+  annotations:
+    reloader.stakater.com/auto: 'true'
   namespace: ai
 spec:
   revisionHistoryLimit: 3
   replicas: 1
   strategy:
     type: Recreate
@@ -50,87 +52,67 @@

               - key: nvidia.com/gpu
                 operator: In
                 values:
                 - 'true'
       containers:
       - args:
-        - --host
-        - 0.0.0.0
-        - --alias
-        - self-hosted
-        - -hf
-        - unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL
-        - --jinja
-        - --ctx-size
-        - '160000'
-        - --parallel
-        - '4'
-        - --n-gpu-layers
-        - '9999'
-        - --cache-type-k
-        - q8_0
-        - --cache-type-v
-        - q8_0
-        - --override-tensor
-        - token_embd\.weight=CUDA0
-        - --no-context-shift
-        - --threads
-        - '4'
-        - --threads-batch
-        - '4'
-        - --temp
-        - '0.6'
-        - --top-p
-        - '0.95'
-        - --top-k
-        - '20'
-        - --metrics
+        - -config
+        - /app/config.yaml
+        - -listen
+        - 0.0.0.0:8080
         command:
-        - /app/llama-server
+        - /app/llama-swap
         env:
         - name: HF_HOME
           value: /cache
         - name: TZ
           value: Europe/Brussels
-        image: ghcr.io/ggml-org/llama.cpp:server-cuda@sha256:66664a81cd0476baa150a6063dfb1054e44f99d5f5b09f9094aae6dc68fc8247
+        image: ghcr.io/mostlygeek/llama-swap:cuda@sha256:5533c89b4a7e894f7614a038250591375cec7b5bbb04847ab6744323fc3819ec
         livenessProbe:
-          failureThreshold: 6
+          failureThreshold: 3
           httpGet:
             path: /health
             port: 8080
           initialDelaySeconds: 30
           periodSeconds: 30
           timeoutSeconds: 5
         name: app
         readinessProbe:
-          failureThreshold: 6
+          failureThreshold: 3
           httpGet:
             path: /health
             port: 8080
-          initialDelaySeconds: 10
+          initialDelaySeconds: 5
           periodSeconds: 10
-          timeoutSeconds: 5
+          timeoutSeconds: 3
         resources:
           limits:
-            cpu: 4
-            memory: 12Gi
+            cpu: 6
+            memory: 20Gi
             nvidia.com/gpu: 1
           requests:
-            cpu: 500m
-            memory: 2Gi
+            cpu: 1
+            memory: 1Gi
             nvidia.com/gpu: 1
         startupProbe:
-          failureThreshold: 180
+          failureThreshold: 12
           httpGet:
             path: /health
             port: 8080
-          initialDelaySeconds: 10
+          initialDelaySeconds: 5
           periodSeconds: 5
           timeoutSeconds: 3
         volumeMounts:
         - mountPath: /cache
           name: cache
+        - mountPath: /app/config.yaml
+          name: config
+          readOnly: true
+          subPath: config.yaml
       volumes:
       - name: cache
         persistentVolumeClaim:
           claimName: llama-server
+      - configMap:
+          name: llama-server-config
+        name: config

…requests - Remove --fit on (matches default fit_params=true) - Remove --top-p 0.95 (matches llama.cpp default) - Remove --mlock (redundant alongside --no-mmap on swapless K8s) - Prune now-stale comments; keep WHY-focused ones only - Drop idle resource requests to cpu:1 / memory:1Gi ahead of llama-swap migration where baseline with no model loaded is near-zero

…odels

…d qwen-3.5

… annotation - Extract shared flags into a `common` macro; both models reference `${common}`. - Use `${PORT}`; drop hardcoded `proxy` (schema default matches). - Tighten file header; note intentional `--n-gpu-layers` omission on gemma-4 MoE. - Add `reloader.stakater.com/auto` to match peer ai-namespace apps.

Single source of truth for 8080 across probes.httpGet and service.targetPort.

tanguille-cluster Bot added the area/kubernetes label Apr 19, 2026

Merge branch 'main' into feat/llama-server-gemma4-moe

993dce4

Tanguille added 7 commits April 20, 2026 21:29

feat(llama-server): add llama-swap config with gemma-4 and qwen-3.5 m…

d24969b

…odels

feat(llama-server): generate llama-server-config ConfigMap

b0d172a

feat(llama-server): run llama-swap with idle unload across gemma-4 an…

f4ac3e3

…d qwen-3.5

refactor(llama-server): anchor container port so service reuses it

278a25f

Single source of truth for 8080 across probes.httpGet and service.targetPort.

Merge branch 'main' into feat/llama-server-gemma4-moe

a513d1b

Tanguille changed the title ~~feat(llama-server): swap to Gemma 4 26B-A4B MoE with --fit auto-sizing~~ feat(llama-server): Gemma 4 26B MoE via llama-swap with idle unload Apr 20, 2026

Tanguille merged commit 6da5fe3 into main Apr 20, 2026
20 of 21 checks passed

Tanguille deleted the feat/llama-server-gemma4-moe branch April 20, 2026 20:19

github-actions Bot mentioned this pull request Apr 21, 2026

feat(ai/llama-server): migrate to ik_llama.cpp with explicit MoE expert offload #2739

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llama-server): Gemma 4 26B MoE via llama-swap with idle unload#2715

feat(llama-server): Gemma 4 26B MoE via llama-swap with idle unload#2715
Tanguille merged 9 commits into
mainfrom
feat/llama-server-gemma4-moe

Tanguille commented Apr 19, 2026 •

edited

Loading

Uh oh!

deepsource-io Bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

tanguille-cluster Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

tanguille-cluster Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Tanguille commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Models

gemma-4 config notes

qwen-3.5 config notes

Shared flags (macros)

Ops

Known caveat

Uh oh!

deepsource-io Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DeepSource Code Review

PR Report Card

Code Review Summary

Uh oh!

github-actions Bot commented Apr 19, 2026

PR Review: feat(llama-server): swap to Gemma 4 26B-A4B MoE with --fit auto-sizing

Recommendation

Change-by-Change Findings

Model Swap (Qwen3.5-9B → Gemma 4 26B-A4B)

--fit Auto-Sizing

Parallelism Adjustment

Memory Locking

Batch Size Configuration

Sliding Window Attention

Resource Adjustments

Probe Configuration

Known Caveat

Sources

Standards Compliance

Linked Issue Fit

Unknowns / Needs Verification

Summary

Uh oh!

tanguille-cluster Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tanguille-cluster Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tanguille commented Apr 19, 2026 •

edited

Loading

deepsource-io Bot commented Apr 19, 2026 •

edited

Loading

tanguille-cluster Bot commented Apr 20, 2026 •

edited

Loading

tanguille-cluster Bot commented Apr 20, 2026 •

edited

Loading