Skip to content

feat(llama-server): Gemma 4 26B MoE via llama-swap with idle unload#2715

Merged
Tanguille merged 9 commits into
mainfrom
feat/llama-server-gemma4-moe
Apr 20, 2026
Merged

feat(llama-server): Gemma 4 26B MoE via llama-swap with idle unload#2715
Tanguille merged 9 commits into
mainfrom
feat/llama-server-gemma4-moe

Conversation

@Tanguille
Copy link
Copy Markdown
Owner

@Tanguille Tanguille commented Apr 19, 2026

Wraps llama.cpp in llama-swap so the running model unloads when idle and VRAM returns to the node. Gemma 4 26B-A4B MoE (25.2B total / 3.8B active) replaces the previous Qwen3.5-9B dense hybrid, which capped ~8 t/s on the 4070 because DeltaNet recurrent kernels are memory-bandwidth-bound. Qwen stays registered for on-demand use.

Models

key role alias
gemma-4 primary self-hosted (backward compat with existing consumers)
qwen-3.5 on-demand

Only one runs at a time (single GPU); llama-swap stops the running child before starting the next. Idle models unload after 900 s; next request reloads in ~60 s cold-start from the /cache PVC. healthCheckTimeout: 900 gives the cold load time to report ready.

Consumers keep hitting the unchanged endpoint http://llama-server.ai.svc.cluster.local/v1 and select a model via the request body's model field.

gemma-4 config notes

  • --fit-ctx 100000 --fit-target 512 auto-probes VRAM and picks the --n-cpu-moe split; --parallel 2 keeps the KV cache on GPU.
  • --no-mmap + --swa-full + --slot-save-path /cache/slots so experts don't page mid-generation and slots survive swaps.
  • --n-gpu-layers intentionally omitted — MoE (3.8B active per token) stays fast on partial CPU offload; llama.cpp auto-picks layers to fit VRAM.

qwen-3.5 config notes

Retains its original flags: --ctx-size 160000 --parallel 4 --n-gpu-layers 9999 --override-tensor token_embd\.weight=CUDA0 --no-context-shift --threads 4. Context-shift stays disabled because the hybrid DeltaNet+Attention architecture corrupts SSM state on shift.

Shared flags (macros)

--host 127.0.0.1 --port ${PORT} --jinja --cache-type-k/v q8_0 --temp 0.6 --top-k 20 --metrics factored into a common macro. ${PORT} is llama-swap's auto-assigned per-model port.

Ops

  • reloader.stakater.com/auto: "true" on the controller so ConfigMap edits recycle the pod (matches the other ai apps).
  • Probes hit llama-swap's /health (always-200 regardless of model state). Startup budget 60 s since the proxy itself starts in < 5 s with no model loaded.
  • 20Gi memory limit accommodates --no-mmap full-RAM weight copy + KV-cache overspill + OS.
  • Port 8080 is defined once via &port in probes.httpGet.port and referenced from service.targetPort.

Known caveat

Gemma 4 tool-calling fixes (ggml-org/llama.cpp#21326, ggml-org/llama.cpp#21343) are not yet in the upstream llama.cpp bundled with the llama-swap image. Tool-call emission may include garbage tokens until a newer build lands upstream.

Replace Qwen3.5-9B dense hybrid DeltaNet (capped at ~8 t/s on the 4070
because recurrent kernels are memory-bandwidth-bound with low arithmetic
intensity) with Gemma 4 26B-A4B pure MoE (25.2B total / 3.8B active).

Config highlights:
- --fit on auto-probes VRAM and picks the optimal --n-cpu-moe split;
  512 MiB headroom accommodates CUDA compute-buffer growth during the
  first prefill on an 11.8 GiB effective-free card.
- --batch-size / --ubatch-size 1024 (not 2048) so --fit has room to
  keep more experts on GPU at the 12 GiB class.
- --parallel 2 so moltis and opencode/pr-reviewer do not serialise.
- --swa-full + --slot-save-path for persistent slots; --cache-reuse
  omitted because Gemma 4's shared KV cache architecture breaks it
  (ggml-org/llama.cpp#21468).
- --mlock + --no-mmap prevent mid-generation paging stalls on the
  CPU-side experts; memory limit raised to 20Gi for prefill peaks.

Known caveat: b8840 still ships the unmerged Gemma 4 tool-calling fixes
(ggml-org/llama.cpp#21326, #21343). Tool-call emission may include
garbage tokens until a newer build lands upstream.
@deepsource-io
Copy link
Copy Markdown
Contributor

deepsource-io Bot commented Apr 19, 2026

DeepSource Code Review

We reviewed changes in b98b385...a513d1b on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade   Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
JavaScript Apr 20, 2026 8:15p.m. Review ↗
Shell Apr 20, 2026 8:15p.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

@github-actions
Copy link
Copy Markdown

Automated recommendation: APPROVE

Analysis engine: self-hosted@http://llama-server.ai.svc.cluster.local/v1

PR Review: feat(llama-server): swap to Gemma 4 26B-A4B MoE with --fit auto-sizing

Recommendation

APPROVE - The PR implements a well-documented model swap with appropriate resource adjustments and clear reasoning for each configuration change.

Change-by-Change Findings

Model Swap (Qwen3.5-9B → Gemma 4 26B-A4B)

  • Change: unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XLunsloth/gemma-4-26B-A4B-it-GGUF:UD-Q3_K_XL
  • Rationale: Switching from dense hybrid DeltaNet (memory-bandwidth-bound) to pure MoE architecture (25.2B total / 3.8B active)
  • Status: ✅ Correctly implemented

--fit Auto-Sizing

  • Change: Added --fit on with --fit-ctx 160000 and --fit-target 512
  • Rationale: Auto-probes VRAM and picks optimal --n-cpu-moe split; 512 MiB headroom accommodates CUDA compute-buffer growth during first prefill on 11.8 GiB effective-free card
  • Status: ✅ Well-documented with inline comments

Parallelism Adjustment

  • Change: --parallel 4--parallel 2
  • Rationale: Lets moltis and opencode/pr-reviewer overlap without serializing one behind the other's prefill
  • Status: ✅ Appropriate for multi-tenant workload

Memory Locking

  • Change: Added --no-mmap and --mlock
  • Rationale: Prevents mid-generation paging stalls on CPU-side experts
  • Status: ✅ Correct for MoE workloads

Batch Size Configuration

  • Change: --batch-size 1024 and --ubatch-size 1024 (removed --threads and --threads-batch)
  • Rationale: 1024 matches the 12 GiB VRAM class; 2048 crowds compute buffers and forces --fit to push more experts to CPU, hurting decode
  • Status: ✅ Well-reasoned tradeoff

Sliding Window Attention

  • Change: Added --swa-full and --slot-save-path /cache/slots
  • Rationale: Gemma 4 uses alternating sliding-window + global attention; full SWA cache avoids truncation on restored slots
  • Status: ✅ Correct for Gemma 4 architecture

Resource Adjustments

Resource Old New Rationale
CPU Request 500m 4 MoE dispatch requires more cores
Memory Request 2Gi 4Gi Base requirement for MoE
CPU Limit 4 6 Leave 2 cores for rook-ceph OSDs and kube-apiserver on 8-core control node
Memory Limit 12Gi 20Gi mlocked experts (~3.7 GiB) + KV overflow + compute buffers at 160k ctx / ubatch 1024 peak near 18 GiB in prefill

Status: ✅ Changes are conservative and well-commented

Probe Configuration

  • Change: Updated failureThreshold for startup (180), liveness (6), readiness (6)
  • Rationale: First boot downloads ~13 GiB; 180 x 5s = 15 min startup time
  • Status: ✅ Appropriate for large model download

Known Caveat

Sources

Standards Compliance

Standard Status Notes
YAML Syntax 2-space indent, LF endings, anchors used correctly
Kubernetes Naming lowercase-dashes, correct path structure
Resource Limits Well-documented with inline comments
Secrets No secrets exposed in manifest
Image Reference Digest-based reference for reproducibility

Linked Issue Fit

No linked issue found. The PR body provides detailed implementation guidance and acceptance criteria within the description itself. Consider linking to a follow-up issue for tracking the known caveat about tool-calling fixes.

Unknowns / Needs Verification

  1. Image Digest Provenance: Metadata states "No image digest changes detected" but manifest clearly shows new image tag server-cuda@sha256:a92a756344dffd35e93a693f771113c0ec7753701d83bab5a65ec8aae325491c. Verify this is intentional and not a metadata fetch failure.
  2. --fit Behavior: The --fit flag auto-probes VRAM - verify this works correctly on the target hardware (11.8 GiB effective-free card).
  3. Tool-Calling Caveat: The known issue about unmerged tool-calling fixes should be tracked separately.

Summary

This is a well-structured PR with clear reasoning for each configuration change. The model swap from Qwen3.5-9B to Gemma 4 26B-A4B MoE is appropriately configured for the hardware constraints. The only concerns are the missing linked issue and the documented caveat about tool-calling fixes, which are noted in the PR body.

@tanguille-cluster
Copy link
Copy Markdown

tanguille-cluster Bot commented Apr 20, 2026

--- kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server HelmRelease: ai/llama-server

+++ kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server HelmRelease: ai/llama-server

@@ -24,96 +24,71 @@

       retries: 2
     strategy:
       name: RemediateOnFailure
   values:
     controllers:
       app:
+        annotations:
+          reloader.stakater.com/auto: 'true'
         containers:
           app:
             args:
-            - --host
-            - 0.0.0.0
-            - --alias
-            - self-hosted
-            - -hf
-            - unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL
-            - --jinja
-            - --ctx-size
-            - '160000'
-            - --parallel
-            - '4'
-            - --n-gpu-layers
-            - '9999'
-            - --cache-type-k
-            - q8_0
-            - --cache-type-v
-            - q8_0
-            - --override-tensor
-            - token_embd\.weight=CUDA0
-            - --no-context-shift
-            - --threads
-            - '4'
-            - --threads-batch
-            - '4'
-            - --temp
-            - '0.6'
-            - --top-p
-            - '0.95'
-            - --top-k
-            - '20'
-            - --metrics
+            - -config
+            - /app/config.yaml
+            - -listen
+            - 0.0.0.0:8080
             command:
-            - /app/llama-server
+            - /app/llama-swap
             env:
               HF_HOME: /cache
               TZ: Europe/Brussels
             image:
-              repository: ghcr.io/ggml-org/llama.cpp
-              tag: server-cuda@sha256:66664a81cd0476baa150a6063dfb1054e44f99d5f5b09f9094aae6dc68fc8247
+              repository: ghcr.io/mostlygeek/llama-swap
+              tag: cuda@sha256:5533c89b4a7e894f7614a038250591375cec7b5bbb04847ab6744323fc3819ec
             probes:
               liveness:
                 custom: true
                 enabled: true
                 spec:
-                  failureThreshold: 6
+                  failureThreshold: 3
                   httpGet:
                     path: /health
                     port: 8080
                   initialDelaySeconds: 30
                   periodSeconds: 30
                   timeoutSeconds: 5
               readiness:
                 custom: true
                 enabled: true
                 spec:
-                  failureThreshold: 6
+                  failureThreshold: 3
                   httpGet:
                     path: /health
                     port: 8080
-                  initialDelaySeconds: 10
+                  initialDelaySeconds: 5
                   periodSeconds: 10
-                  timeoutSeconds: 5
+                  timeoutSeconds: 3
               startup:
                 custom: true
                 enabled: true
                 spec:
-                  failureThreshold: 180
+                  failureThreshold: 12
                   httpGet:
                     path: /health
                     port: 8080
-                  initialDelaySeconds: 10
+                  initialDelaySeconds: 5
                   periodSeconds: 5
                   timeoutSeconds: 3
             resources:
               limits:
-                cpu: 4
-                memory: 12Gi
+                cpu: 6
+                memory: 20Gi
                 nvidia.com/gpu: 1
               requests:
-                cpu: 500m
-                memory: 2Gi
+                cpu: 1
+                memory: 1Gi
                 nvidia.com/gpu: 1
     defaultPodOptions:
       affinity:
         nodeAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
             nodeSelectorTerms:
@@ -126,12 +101,19 @@

     persistence:
       cache:
         enabled: true
         existingClaim: llama-server
         globalMounts:
         - path: /cache
+      config:
+        globalMounts:
+        - path: /app/config.yaml
+          readOnly: true
+          subPath: config.yaml
+        name: llama-server-config
+        type: configMap
     service:
       app:
         controller: app
         ports:
           http:
             port: 80
--- kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server ConfigMap: ai/llama-server-config

+++ kubernetes/apps/ai/llama-server/app Kustomization: ai/llama-server ConfigMap: ai/llama-server-config

@@ -0,0 +1,31 @@

+---
+apiVersion: v1
+data:
+  config.yaml: "---\n# Single GPU: only one model runs at a time \u2014 llama-swap\
+    \ stops the running\n# child before starting the next when a request selects a\
+    \ different model.\n\n# seconds; upper bound for a cold load (large MoE + --no-mmap\
+    \ full RAM copy)\n# to report ready. Default 120s is too tight for our ~60s load\
+    \ + warmup.\nhealthCheckTimeout: 900\n\nmacros:\n  # Shared flags;  is llama-swap's\
+    \ per-model auto-assigned port.\n  common: >\n    --host 127.0.0.1 --port \n \
+    \   --jinja\n    --cache-type-k q8_0 --cache-type-v q8_0\n    --temp 0.6 --top-k\
+    \ 20\n    --metrics\n\nmodels:\n  gemma-4:\n    # 15 min idle unloads; next request\
+    \ reloads (~60s cold start from /cache PVC).\n    ttl: 900\n    aliases:\n   \
+    \   - self-hosted\n    # --n-gpu-layers omitted intentionally: MoE (26B total\
+    \ / 4B active per token)\n    # stays fast on partial CPU offload; llama.cpp auto-picks\
+    \ layers to fit VRAM.\n    cmd: >\n      /app/llama-server\n      \n      -hf\
+    \ unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q3_K_XL\n      --fit-ctx 100000\n      --fit-target\
+    \ 512\n      --parallel 2\n      --no-mmap\n      --batch-size 1024\n      --ubatch-size\
+    \ 1024\n      --swa-full\n      --slot-save-path /cache/slots\n      --min-p 0.0\n\
+    \n  qwen-3.5:\n    ttl: 900\n    cmd: >\n      /app/llama-server\n      \n   \
+    \   -hf unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL\n      --ctx-size 160000\n      --parallel\
+    \ 4\n      --n-gpu-layers 9999\n      --override-tensor token_embd\\.weight=CUDA0\n\
+    \      --no-context-shift\n      --threads 4\n      --threads-batch 4\n      --top-p\
+    \ 0.95\n"
+kind: ConfigMap
+metadata:
+  labels:
+    kustomize.toolkit.fluxcd.io/name: llama-server
+    kustomize.toolkit.fluxcd.io/namespace: ai
+  name: llama-server-config
+  namespace: ai
+

@tanguille-cluster
Copy link
Copy Markdown

tanguille-cluster Bot commented Apr 20, 2026

--- HelmRelease: ai/llama-server Deployment: ai/llama-server

+++ HelmRelease: ai/llama-server Deployment: ai/llama-server

@@ -5,12 +5,14 @@

   name: llama-server
   labels:
     app.kubernetes.io/controller: app
     app.kubernetes.io/instance: llama-server
     app.kubernetes.io/managed-by: Helm
     app.kubernetes.io/name: llama-server
+  annotations:
+    reloader.stakater.com/auto: 'true'
   namespace: ai
 spec:
   revisionHistoryLimit: 3
   replicas: 1
   strategy:
     type: Recreate
@@ -50,87 +52,67 @@

               - key: nvidia.com/gpu
                 operator: In
                 values:
                 - 'true'
       containers:
       - args:
-        - --host
-        - 0.0.0.0
-        - --alias
-        - self-hosted
-        - -hf
-        - unsloth/Qwen3.5-9B-GGUF:UD-Q5_K_XL
-        - --jinja
-        - --ctx-size
-        - '160000'
-        - --parallel
-        - '4'
-        - --n-gpu-layers
-        - '9999'
-        - --cache-type-k
-        - q8_0
-        - --cache-type-v
-        - q8_0
-        - --override-tensor
-        - token_embd\.weight=CUDA0
-        - --no-context-shift
-        - --threads
-        - '4'
-        - --threads-batch
-        - '4'
-        - --temp
-        - '0.6'
-        - --top-p
-        - '0.95'
-        - --top-k
-        - '20'
-        - --metrics
+        - -config
+        - /app/config.yaml
+        - -listen
+        - 0.0.0.0:8080
         command:
-        - /app/llama-server
+        - /app/llama-swap
         env:
         - name: HF_HOME
           value: /cache
         - name: TZ
           value: Europe/Brussels
-        image: ghcr.io/ggml-org/llama.cpp:server-cuda@sha256:66664a81cd0476baa150a6063dfb1054e44f99d5f5b09f9094aae6dc68fc8247
+        image: ghcr.io/mostlygeek/llama-swap:cuda@sha256:5533c89b4a7e894f7614a038250591375cec7b5bbb04847ab6744323fc3819ec
         livenessProbe:
-          failureThreshold: 6
+          failureThreshold: 3
           httpGet:
             path: /health
             port: 8080
           initialDelaySeconds: 30
           periodSeconds: 30
           timeoutSeconds: 5
         name: app
         readinessProbe:
-          failureThreshold: 6
+          failureThreshold: 3
           httpGet:
             path: /health
             port: 8080
-          initialDelaySeconds: 10
+          initialDelaySeconds: 5
           periodSeconds: 10
-          timeoutSeconds: 5
+          timeoutSeconds: 3
         resources:
           limits:
-            cpu: 4
-            memory: 12Gi
+            cpu: 6
+            memory: 20Gi
             nvidia.com/gpu: 1
           requests:
-            cpu: 500m
-            memory: 2Gi
+            cpu: 1
+            memory: 1Gi
             nvidia.com/gpu: 1
         startupProbe:
-          failureThreshold: 180
+          failureThreshold: 12
           httpGet:
             path: /health
             port: 8080
-          initialDelaySeconds: 10
+          initialDelaySeconds: 5
           periodSeconds: 5
           timeoutSeconds: 3
         volumeMounts:
         - mountPath: /cache
           name: cache
+        - mountPath: /app/config.yaml
+          name: config
+          readOnly: true
+          subPath: config.yaml
       volumes:
       - name: cache
         persistentVolumeClaim:
           claimName: llama-server
+      - configMap:
+          name: llama-server-config
+        name: config
 

…requests

- Remove --fit on (matches default fit_params=true)
- Remove --top-p 0.95 (matches llama.cpp default)
- Remove --mlock (redundant alongside --no-mmap on swapless K8s)
- Prune now-stale comments; keep WHY-focused ones only
- Drop idle resource requests to cpu:1 / memory:1Gi ahead of llama-swap migration
  where baseline with no model loaded is near-zero
… annotation

- Extract shared flags into a `common` macro; both models reference `${common}`.
- Use `${PORT}`; drop hardcoded `proxy` (schema default matches).
- Tighten file header; note intentional `--n-gpu-layers` omission on gemma-4 MoE.
- Add `reloader.stakater.com/auto` to match peer ai-namespace apps.
Single source of truth for 8080 across probes.httpGet and service.targetPort.
@Tanguille Tanguille changed the title feat(llama-server): swap to Gemma 4 26B-A4B MoE with --fit auto-sizing feat(llama-server): Gemma 4 26B MoE via llama-swap with idle unload Apr 20, 2026
@Tanguille Tanguille merged commit 6da5fe3 into main Apr 20, 2026
20 of 21 checks passed
@Tanguille Tanguille deleted the feat/llama-server-gemma4-moe branch April 20, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant