Local LLM inference on AMD APU hardware using llama.cpp. Self-contained - no system ROCm install required. Vulkan (RADV) is the default backend for best stability on RDNA3 iGPUs.
The goal is reasonably-performing agentic AI development on an Ayaneo Flip KB (7840U / 32GB) handheld - usable when there is no network. No API keys, no per-token costs, no cloud dependency. Cached state survives reboots and power outages (the Flip has a battery).
CLIO is optimized for this implementation. It serializes tool definitions with deterministic JSON key ordering and reuses conversation state to maximize cache hits across agentic turns. System prompts, tool descriptions, and compressed context - the static content sent on every API call - are cached and persisted to disk so they're available immediately on the next request.
git clone --recurse-submodules https://github.com/fewtarius/llama-ai.git
cd llama-ai
# Build Vulkan backend (default)
./scripts/rebuild.sh
# Drop a GGUF model in models/, then:
./llama-run.sh --server
# -> http://localhost:9090To build with ROCm support (optional, has stability issues on some architectures):
./scripts/rebuild.sh --both # Vulkan + ROCm
./scripts/rebuild.sh --rocm # ROCm onlyAMD APUs share system RAM with the GPU. Use apply-ttm-kernel-params.sh to configure GTT:
# Set GTT to 18GB (total GPU memory: 6GB VRAM + 18GB GTT = 24GB)
sudo ./scripts/apply-ttm-kernel-params.sh 18
sudo rebootWrites kernel parameters (amdgpu.gttsize, amdgpu.vis_vramlimit, ttm.pages_limit) to your bootloader config. Also calls amd-smi set -G as a runtime hint, but kernel parameters are the authoritative method that persists across reboots.
Supports GRUB (SteamFork 3.7) and systemd-boot (SteamFork 3.8+). Tested on SteamFork - may not work with other distributions.
GTT size defaults to auto-detected value based on total system RAM (reserves 6GB for OS). Override with the first argument or LLAMA_GTT_SIZE env var.
Verify after reboot:
cat /proc/cmdline | tr ' ' '\n' | grep -E "amdgpu|ttm"Auto-detects AMD GPU via PCI device ID and sets HSA_OVERRIDE_GFX_VERSION for ROCm.
Supported: Cezanne (5800H), Phoenix (780M), Hawk Point (890M/780M), Strix Point (890M/880M), Strix Halo, Sephiroth, Rembrandt (680M/660M), Mendocino (610M), Renoir, Lucienne. Falls back to amd-smi for authoritative detection when PCI IDs are ambiguous (e.g. Cezanne and Van Gogh share the same PCI ID). To add your device, edit the GPU_MAP in scripts/detect-gpu.sh.
Override detection:
LLAMA_GFX_VERSION_OVERRIDE=11.0.3 ./llama-run.sh --serverdetect-gpu.sh also detects the CPU ISA level and generates optimal cmake flags:
| CPU | ISA Level | CMake Flags |
|---|---|---|
| Zen 4 (7840U) | avx512_bf16 | -DGGML_AVX512=ON -DGGML_AVX512_BF16=ON -DGGML_AVX512_VNNI=ON |
| Zen 3 (5800H) | avx2 | -DGGML_AVX2=ON -DGGML_AVX=ON -DGGML_FMA=ON |
| Apple Silicon | apple_silicon | (none - ARM NEON auto-detected) |
Previously, the Vulkan build was compiled with GGML_NATIVE=OFF and GGML_AVX512=OFF, leaving AVX-512 code paths compiled out on Zen 4 hardware that supports them. This cost 5-15% generation speed on Vulkan and 30-100% on CPU-offloaded layers. Now rebuild.sh uses $LLAMA_CMAKE_CPU_FLAGS to enable the right ISA level.
Override:
LLAMA_CPU_ISA_OVERRIDE=avx2 ./scripts/rebuild.sh# List models found in models/
./llama-run.sh --list-models
# Start server (auto-detects model, Vulkan backend)
./llama-run.sh --server
# Specific model and backend
./llama-run.sh --server gemma-4-26b --backend vulkan
# Download a model
./llama-run.sh --download Qwen3-14B --quant Q4_K_M
# List available backends
./llama-run.sh --list-backends
# Rebuild options
./scripts/rebuild.sh # Vulkan only (default)
./scripts/rebuild.sh --rocm # ROCm only
./scripts/rebuild.sh --both # Vulkan + ROCm
./scripts/rebuild.sh --rebuild # Full rebuild from scratchModels are auto-profiled based on filename characteristics. MoE models get checkpoint strategies and reasoning format; SSM/Mamba models get context-shift disabled; large dense models get optimized batch sizes. Profiles are assigned dynamically - no hard-coded model names.
SSD-backed KV cache persists conversation state across server restarts. Enabled by default for all non-SSM models - the cache directory is kv-cache/. When available, ROCm is auto-detected as a secondary backend option.
The bottleneck in agentic AI isn't generation speed (the model produces tokens as fast as the GPU allows). The bottleneck is prompt evaluation - reprocessing the entire prompt before the model can generate its first token.
Every API call in an agentic workflow sends static content: system prompt, tool definitions, compressed conversation context. Without caching, this content is re-evaluated from scratch on every single call. A 15K-token prompt means 2-4 minutes before the model starts responding. With SSD cache, the same prompt evaluates in 1-4 seconds.
The SSD-backed KV cache has three tiers with automatic promotion and demotion:
- Hot tier - Checkpoints from the current session, kept in RAM. Instant restore when the same conversation continues. After 2 turns of inactivity, hot checkpoints are demoted to warm.
- Warm tier - Checkpoints from previous sessions in the same server run. In RAM until memory pressure forces demotion to cold. After 4 turns of inactivity, warm checkpoints are demoted to cold.
- Cold tier - On-disk checkpoints with token prefixes. Survives server restarts. Each conversation gets up to the ring buffer limit of cold checkpoints on disk. When the limit is exceeded, the oldest cold checkpoint is deleted. Up to 16 conversations are tracked simultaneously (configurable with
--cache-ssd-max-conversations).
When an API request arrives, the server searches for a matching checkpoint in three stages:
-
Same-conversation (Tier 1) - Matches by conversation hash (
conv_hash), a FNV-1a hash of the first 1024 task tokens. This finds the checkpoint from a previous turn of the same conversation. Fast, accurate, and the most common hit path. -
Shared prefix (Tier 2) - Cross-conversation match using
n_past(the common prefix length). This reuses cached system prompt evaluation across different conversations with the same model. Works because the first N tokens are identical - tool definitions, system instructions, etc. -
Cold-start token prefix (Tier 3) - Used on server restart when
n_past == 0. The server compares the prompt's first tokens against every checkpoint's stored token prefix (up to 4096 tokens per checkpoint). This has two phases:- Chain match - Same conversation, full prefix matches. The largest checkpoint from the same conversation is preferred, even if it's large - the recurrent state is content-accurate.
- Safe match - Cross-conversation or partial prefix. Only checkpoints whose
n_tokensfits within the common prefix (LCP) are considered. This avoids restoring recurrent state computed from different conversation content.
Overflow handling differs by match type. Same-conversation checkpoints (Tier 1 and Tier 3 chain) skip size and staleness checks entirely - the recurrent state is content-accurate, so any same-conv checkpoint is valid. If the checkpoint covers more tokens than the current task, n_past is capped in the restore layer to leave room for new token evaluation instead of resetting. Cross-conversation matches (Tier 2 and Tier 3 safe) skip oversized checkpoints at the search layer, since the recurrent state was computed from different conversation content.
Each checkpoint is stored as a separate file (ckpt-N.bin) in kv-cache/{conv_hash}/ with metadata in index.bin. Turn tracking survives server restarts - the next turn counter is seeded from the maximum turn ID found on disk, so warm-tier entries from a previous server run start aging from turn 0 of the new run rather than being immediately demoted.
Every checkpoint carries:
conv_hash- Conversation identity (first 1024 tokens)compat_hash- Model configuration hash (architecture, dimensions, cache types). Checkpoints with mismatched compat hashes are rejected, preventing silent corruption when switching between models.token_prefix- First 4096 tokens for cold-start prefix matchingturn_id- Tracks when the checkpoint was last accessed for tier management
When a cold checkpoint is identified for loading, the server issues posix_fadvise(POSIX_FADV_WILLNEED) on Linux (or readahead() on macOS) to trigger kernel page cache prefetch. This overlaps SSD I/O with CPU work (token matching, state restoration setup) and reduces cold TTFT by ~0.5-0.75s for typical checkpoint sizes.
The KV cache (attention state) and recurrent state (for hybrid MoE models) are restored from the checkpoint. Only tokens beyond the checkpoint's coverage need evaluation. A 15K-token prompt might need just a handful of new tokens evaluated - the rest is restored from disk in under a second.
The cache is persisted automatically after each turn. No manual management needed.
Real agentic workloads send 12-20K tokens of system prompt and tool definitions on every API call, growing to 32-64K tokens with compressed conversation context. Every token is re-evaluated from scratch without caching.
The benchmark uses scaled-down prompts to demonstrate cache mechanics and prove the speedup is real. The same principles apply at production sizes - speedup ratios increase with prompt length.
| Size | Tokens | What it measures |
|---|---|---|
| Small | ~1,100 | Cache overhead and baseline speedup |
| Medium | ~5,200 | Checkpoint matching and partial restore |
| Large | ~15,500 | Full checkpoint restore with large prefix |
Each size runs twice:
- Cold - Empty cache, server starts fresh. The entire prompt is evaluated from scratch.
- Warm - Server restarts with existing SSD cache. The server restores the matching checkpoint from disk and evaluates only the delta.
The key metric is TTFT (Time To First Token) - how long before the model starts generating. Generation speed doesn't change with caching (same model, same hardware). What changes is the wait before generation begins.
Tested on Ayaneo Flip KB (7840U / 780M / 32GB / Vulkan). 128 output tokens, ctx 32768, all GPU layers.
| Size | Tokens | Cold TTFT | Warm TTFT | Speedup | Gen TPS |
|---|---|---|---|---|---|
| Small | ~1145 | 9.1s | 0.4s | 22.2x | 20.3 |
| Medium | ~5237 | 66.6s (1.1min) | 1.1s | 59.3x | 12.5 |
| Large | ~15.5K | 419.5s (7.0min) | 2.8s | 148.8x | 6.3 |
Cold prompt eval: 36.9-126.3 t/s. Warm: 2802-5493 t/s. Cached: 15485/15489 tokens at large size.
| Size | Tokens | Cold TTFT | Warm TTFT | Speedup | Gen TPS |
|---|---|---|---|---|---|
| Small | ~1413 | 8.1s | 0.8s | 10.7x | 16.5 |
| Medium | ~6083 | 34.3s | 1.0s | 32.8x | 15.8 |
| Large | ~17.3K | 114.8s (1.9min) | 1.5s | 79.1x | 14.5 |
Cold prompt eval: 151-174 t/s. Warm: 1855-11955 t/s. Cached: 17343/17347 tokens at large size.
| Size | Tokens | Cold TTFT | Warm TTFT | Speedup | Gen TPS |
|---|---|---|---|---|---|
| Small | ~1243 | 8.8s | 0.4s | 20.1x | 21.4 |
| Medium | ~5409 | 39.1s | 0.6s | 61.9x | 20.7 |
| Large | ~15.7K | 125.1s (2.1min) | 1.1s | 117.8x | 19.0 |
Cold prompt eval: 126-142 t/s. Warm: 2851-14808 t/s. Cached: 15717/15721 tokens at large size. 35B parameters with only 3B active - the fastest model tested on the Flip. The SSD cache restores both attention KV state and recurrent state from disk. Only 4 new tokens need evaluation at large size.
All models on Ayaneo Flip KB (7840U / 780M / 32GB / Vulkan):
| Model | Params | Large cold | Large warm | Speedup | Gen TPS | Type |
|---|---|---|---|---|---|---|
| GLM-4.7-Flash | 14B | 419.5s (7.0min) | 2.8s | 148.8x | 6.3 | Dense |
| Gemma 4 26B | 26B | 114.8s (1.9min) | 1.5s | 79.1x | 14.5 | Dense |
| Qwen3.6-35B | 35B | 125.1s (2.1min) | 1.1s | 117.8x | 19.0 | MoE hybrid |
Generation speed (t/s) is unaffected by caching - the speedup is entirely in prompt evaluation. What caching changes is whether you wait 2-7 minutes or 1-3 seconds before the model starts responding.
Full benchmark data (server logs, API responses, timing stats): benchmarks/20260602-0852/
# Full benchmark: all models, Vulkan backend
./scripts/benchmark.sh
# Single model
./scripts/benchmark.sh --model GLM-4.7-Flash-Q4_K_M.gguf
# Both backends
./scripts/benchmark.sh --backend bothUses public domain text from The Count of Monte Cristo (Project Gutenberg), cached locally in scratch/pg1184.txt. Each prompt appends "Summarize this passage in one sentence." to keep generation short (128 tokens).
benchmarks/YYYYMMDD-HHMM/
├── vulkan/
│ ├── GLM-4.7-Flash-Q4_K_M/
│ │ ├── server-small-cold.log # Server log (cold run)
│ │ ├── server-small-warm.log # Server log (warm run)
│ │ ├── small-cold-response.json # Raw API response
│ │ ├── small-cold-stats.json # Extracted timing stats
│ │ ├── small-warm-response.json
│ │ ├── small-warm-stats.json
│ │ ├── small-result.json # Cold vs warm comparison
│ │ ├── summary.json # All sizes aggregated
│ │ └── summary.md # Human-readable table
│ └── summary.json / summary.md # Aggregate across models
└── rocm/ ...
This cache was built for CLIO, an AI coding assistant that sends 20-32K tokens of system prompt, tool definitions, and compressed conversation context on every API call. Without caching, every turn would re-evaluate all 20K+ tokens from scratch.
A CLIO session consists of alternating tool call turns (the LLM decides what tool to run) and response turns (the LLM generates a user-visible message). Tool call turns are short - the model outputs a tool call JSON (~30-150 tokens). Response turns are longer - the model generates commands, code, and explanations.
Every turn includes the same static prefix: system prompt, tool definitions, project context. As the conversation grows, compressed summaries of earlier messages are appended. The static portion is ~20K tokens; the dynamic conversation portion grows from ~5K to ~12K.
A single prompt - "Please evaluate this project and share your opinion of it." - sent to CLIO running on Qwen3.6-35B-A3B (MoE hybrid, Q4_K_XL, Vulkan, Ayaneo Flip KB). Seven turns, 7 minutes total.
The model explores the project on its own: listing files, reading the README, checking git history, reading scripts, then writes a detailed evaluation. Each turn sends the full conversation context (17-29K tokens) to the API. The cache determines how much of that context needs re-evaluation.
| Turn | Action | Tokens | Cached | Cache% | TTFT | Est. Cold TTFT | Speedup |
|---|---|---|---|---|---|---|---|
| T0 | Cold start, explore project | 17,880 | 4,096 | 23% | 126s | 161s | 1.3x |
| T1 | Read files, git log | 18,851 | 17,965 | 95% | 15s | 170s | 11.6x |
| T2 | Read more files | 19,017 | 17,779 | 93% | 0s | 171s | - |
| T3 | Read + wc + git | 25,293 | 19,231 | 76% | 71s | 228s | 3.2x |
| T4 | Read more files | 26,637 | 25,510 | 96% | 19s | 240s | 12.7x |
| T5 | Read more files | 28,298 | 26,813 | 95% | 23s | 255s | 10.9x |
| T6 | Write final response | 29,316 | 17,878 | 61% | 161s | 264s | 1.6x |
Cache% = tokens restored from cache / total tokens. Est. Cold TTFT = tokens / 111 t/s (measured cold rate from T0). Generation speed: 17.6-19.1 t/s (unaffected by caching).
Total: 7 minutes actual vs ~25 minutes estimated without cache.
The same prompt evaluated against two cloud-hosted models:
MiniMax M2.7 - completed in 4 turns, ~42 seconds total:
| Turn | Action | Tokens In | Tokens Out | Duration | Tool Calls |
|---|---|---|---|---|---|
| T0 | Read README, list scripts | 17,129 | 115 | 6.1s | 2 |
| T1 | Read llama-run.sh, git log | 22,088 | 116 | 5.3s | 2 |
| T2 | Read more files | 23,847 | 176 | 5.7s | 2 |
| T3 | Write final response | 26,214 | 455 | 25.3s | 0 |
Qwen3.5-35B-A3B via OpenRouter - completed in 4 turns, ~39 seconds total:
| Turn | Action | Tokens In | Tokens Out | TTFT | Duration | Tool Calls |
|---|---|---|---|---|---|---|
| T0 | Read README, list dir | 17,880 | 194 | 3.4s | 3.4s | 3 |
| T1 | Read llama-run.sh, scripts | 34,916 | 200 | - | 3.4s | 3 |
| T2 | Read rebuild.sh, list scripts | 40,785 | 117 | - | 3.0s | 2 |
| T3 | Write final response | 45,656 | 1,154 | 18.1s | 20.1s | 0 |
Qwen3.6-35B-A3B via OpenRouter - failed to complete. The model produced thinking output but never issued tool calls, stalling after two attempts. The same model works correctly when run locally, suggesting this is an API/provider-specific issue rather than an architectural limitation.
Cloud models have near-zero TTFT because the prompt is evaluated on clusters of GPUs. The local model with SSD cache achieves comparable per-turn latency on high-cache turns (15-23s) but takes longer on cache misses (71-161s). The tradeoff: local inference is private, offline-capable, and has no per-token cost.
T0 - Cold start (126s TTFT). Server just started. No in-memory cache. The SSD cache had a checkpoint from a previous conversation with 4,096 tokens of matching prefix (system prompt + tool definitions). The server restored those 4,096 tokens from disk and evaluated the remaining 13,784. Without any cache, all 17,880 tokens would need evaluation at ~111 t/s, taking ~161s. The partial SSD hit saved 35s.
T1 - Read files, git log (15s TTFT, 11.6x speedup). The in-memory checkpoint from T0 covers 17,880 tokens. T1's prompt shares the first 17,965 tokens with T0's context. Only 886 new tokens need evaluation. The cache divergence at token 17,965 was a minor difference (tool call format: recursive vs False). 11.6x faster than cold.
T2 - Read more files (0s TTFT, 93% cache). Similar to T1 - the in-memory checkpoint covers 93% of the prompt. Only 1,238 new tokens. The model produced 215 tokens of tool calls across 3 parallel tool invocations. TTFT was effectively instant because the prompt was almost entirely cached.
T3 - Read + wc + git (71s TTFT, 3.2x speedup). The conversation grew significantly - tool results from T2 added ~6K tokens. The in-memory checkpoint diverged at token 17,878 (the boundary between the static prefix and the dynamic conversation). Only 19,231 of 25,293 tokens were cached. Still 3.2x faster than the estimated 228s cold time.
T4 - Read more files (19s TTFT, 12.7x speedup). The checkpoint from T3 covers most of T4's prompt. Only 1,127 new tokens need evaluation. 12.7x faster than cold.
T5 - Read more files (23s TTFT, 10.9x speedup). Similar to T4. The conversation grew slightly. 1,485 new tokens. 10.9x faster than cold.
T6 - Write final response (161s TTFT, 1.6x speedup). The model wrote a 1,007-token evaluation. The conversation context diverged from the previous checkpoint at token 17,878 (same boundary as T3 - the static/dynamic split). Only 17,878 of 29,316 tokens were cached, leaving 11,438 to evaluate. Still 1.6x faster than the estimated 264s cold time.
Three patterns emerge across the seven turns:
-
High cache hit (93-96%, T1/T2/T4/T5): The prompt is nearly identical to the previous turn. The checkpoint covers the static prefix plus most of the conversation. Only 886-1,485 new tokens need evaluation. TTFT drops to 0-23 seconds (10-12x faster than cold). This is the common case for tool-call turns where the model reads files and the conversation grows by a small amount.
-
Moderate cache hit (76%, T3): The conversation grew significantly (tool results added ~6K tokens). The checkpoint covers the static prefix but the dynamic portion needs full evaluation. Still 3.2x faster than cold.
-
Low cache hit (23-61%, T0/T6): Cold start (T0) or context divergence at the static/dynamic boundary (T6). The cache still saves 35-103 seconds compared to full cold evaluation, but the majority of tokens need re-evaluation.
Multi-tenant deployments need isolation between users sharing the same server. This fork adds three dimensions of isolation:
The user_id field is a first-class request parameter. Pass it in the request body:
{
"model": "...",
"messages": [...],
"llama_user_id": "tenant-42-user-7"
}OpenAI SDK callers pass it through extra_body:
client.chat.completions.create(
model="...",
messages=[...],
extra_body={"llama_user_id": "tenant-42-user-7"},
)Validated to ^[a-zA-Z0-9\-_]+$ with a 512-char ceiling. Empty string is valid (anonymous bucket).
When user_id is present, the SSD page manager routes checkpoints to a separate u/ namespace on disk:
{ssd_path}/{hash_hex}/ # anonymous (conv_hash)
{ssd_path}/u/{hash_hex}/ # user-scoped (fnv1a(user_id))
Cross-user lookup is disabled for user-scoped requests. A user can only access their own cached state, never another user's directory.
--max-concurrent-per-user N caps the number of simultaneous slots a single user_id can occupy. When the cap is hit, the server returns HTTP 429 with a rate_limit_error type:
{
"error": {
"code": 429,
"message": "User 'tenant-42-user-7' has reached the concurrent request limit (2)",
"type": "rate_limit_error"
}
}Slot allocation also prefers slots already owned by the requesting user (cache affinity). An empty slot (post-release) is fair game for any user.
Default: 0 (unlimited). Set to 1 for strict one-at-a-time, or 2-3 for concurrent with backpressure.
Design rationale: docs/development/user-isolation-design.md
MoE models (Qwen3.5/3.6, DeepSeek, Mixtral) activate only a subset of experts per token. This fork adds real-time expert activation tracking via two HTTP endpoints:
Returns per-layer expert activation counts, frequencies, and token counts:
{
"n_expert": 256,
"n_expert_used": 8,
"total_tokens": 1500,
"tracking_enabled": true,
"layers": [
{
"layer": 0,
"activations": [
{"expert": 42, "count": 150, "frequency": 0.0125},
{"expert": 7, "count": 148, "frequency": 0.0123},
...
]
},
...
]
}Enable/disable tracking and optionally reset counters:
{"enabled": true, "reset": true}This is Phase 1 of the MoE expert tiering design - instrumentation only, no compute changes. Future phases will use this data to reorder experts for cache locality and offload cold experts to RAM/SSD.
This fork maintains patches on top of llama.cpp that improve performance of agentic AI workloads with hybrid MoE models on AMD APU hardware.
Persistent cross-session KV cache that survives server restarts. Hot/warm/cold tiering with automatic promotion and demotion keeps frequently-used conversation state in RAM while evicting stale entries to disk.
- Hot tier: Recently-used checkpoints kept in RAM for instant restore. Demoted to warm after 2 inactive turns.
- Warm tier: Checkpoints from previous sessions, kept in RAM until memory pressure forces demotion. Demoted to cold after 4 inactive turns.
- Cold tier: On-disk checkpoints with token prefixes for cross-session matching
- Ring buffer eviction: Per-conversation ring buffer prevents unbounded disk growth. Oldest checkpoints are evicted when space is needed.
- Three-tier search: Same-conversation match by conversation hash, shared-prefix match by n_past, and cold-start token prefix comparison with chain/safe phases
- Kernel readahead:
posix_fadvise(POSIX_FADV_WILLNEED)on Linux,readahead()on macOS. Overlaps SSD I/O with CPU work for ~0.5-0.75s TTFT reduction on cold cache hits. - Checkpoint overflow prevention: Same-conversation checkpoints are accepted regardless of size (recurrent state is content-accurate) and capped in the restore layer. Cross-conversation oversized checkpoints are skipped at the search layer. Prevents "no tokens to decode" crashes.
- Turn-based tiering: Checkpoints track turn activity across server restarts for accurate promotion/demotion
- Cold start recovery: On server restart, automatically searches SSD cache by token prefix match. Same-conversation checkpoints are restored even if larger than the current task -
n_pastis capped with overflow margin instead of falling through to full reprocessing. - Conversation-aware matching: Checkpoints carry conversation hash and model compatibility hash. Mismatched checkpoints are rejected, so switching models or conversations doesn't corrupt cached state.
- Per-conversation directories: Each conversation gets its own directory (
kv-cache/{conv_hash}/). Switching conversations doesn't corrupt cached state. Multiple independent chat threads operate in parallel without interference. - Smart eviction: Scores checkpoints by age, size, and task overlap to preserve what's most useful
- MLA model support: DeepSeek2/DeepSeek3 MLA models get checkpoint support via
llama_model_is_mla()detection
CLI flags: --cache-ssd, --cache-ssd-checkpoints, --cache-ssd-hot-window, --cache-ssd-warm-window, --cache-ssd-max-cold, --cache-ssd-page-size, --cache-ssd-max-conversations
Hybrid models (Qwen3.5/3.6 MoE) combine transformer attention with recurrent state (like Mamba). Upstream checkpoint restore was broken for these architectures, causing silent KV cache exhaustion and no tokens to decode crashes after 2-3 conversation turns. 13 incremental fixes:
- KV cache shifting: Hybrid models need different position tracking than dense models - pos_min/pos_max don't capture recurrent state coverage
- Checkpoint erasure: When conversation content diverges, only attention cells are cleared, preserving recurrent state for reuse
- Checkpoint overflow prevention: Same-conversation checkpoints are accepted regardless of size (recurrent state is content-accurate) and capped in the restore layer. Cross-conversation oversized checkpoints are skipped at the search layer. Prevents the fatal
batch.n_tokens = 0crash - seq_rm_attn_only: New API that clears attention KV entries without disturbing recurrent state - critical for checkpoint restore correctness
- QWEN35MOE architecture filter: Correctly identifies which layers are attention vs. recurrent for state management
- Checkpoint search condition: Hybrid checkpoint restore uses
n_tokens <= n_pastto prevent restoring recurrent state from stale (diverged) conversation content. The previous>=condition allowed checkpoints past the cache divergence point, causing degraded output on multi-turn conversations
- Per-user concurrency cap:
--max-concurrent-per-user Nlimits simultaneous slots per user. Returns HTTP 429 when the cap is hit. - User-scoped KV cache:
user_idroutes checkpoints tou/namespace on disk, preventing cross-user cache contamination - Slot affinity: Slot allocation prefers slots already owned by the requesting user for cache locality
- Request threading:
user_idis threaded from the HTTP request body throughserver_taskto slot allocation and cache routing
- GET /expert-stats: Per-layer expert activation counts, frequencies, and token counts
- POST /expert-tracking: Enable/disable tracking and reset counters
- C API:
llama_expert_tracking_enable(),llama_expert_stats_get(),llama_model_n_expert(),llama_model_n_expert_used() - Reads
ffn_moe_argsorttensors from the compute graph after each decode to track which experts are activated per token
- Scoring-based prompt cache eviction: Replaced FIFO eviction with scoring by age, size, and task token overlap. Conversations with long common prefixes stay cached longer
- Text context on cache divergence: Debug logging shows the actual tokens where cache diverged, making prompt engineering and tool output debugging tractable
- Checkpoint eviction under memory pressure: Automatically frees checkpoints when KV cache hits capacity limits
- Conversation-aware checkpoint matching: Uses model config validation to prevent mismatched checkpoint restoration
- CLIO integration: CLIO serializes tool definitions with deterministic JSON key ordering and reuses conversation state to maximize cache hits across agentic turns. System prompts, tool descriptions, and compressed context sent on every API call are cached and persisted to disk.
- Auto-mlock tuning:
llama-run.shcompares model size againstRLIMIT_MEMLOCKand disables--mlockwhen the limit is too small, eliminating startup warnings - SSD cache defaults: Enabled by default for all non-SSM models in
llama-run.sh. The--cache-ssd-max-conversationsflag (default: 16) controls how many conversation directories are tracked simultaneously. - CPU ISA auto-detection:
detect-gpu.shreads/proc/cpuinfoand generates optimal cmake flags for the detected CPU (AVX-512 BF16 on Zen 4, AVX2 on Zen 3, etc.). Previously, the Vulkan build was compiled withGGML_NATIVE=OFFandGGML_AVX512=OFF, leaving AVX-512 code paths disabled on hardware that supports them.
├── llama-run.sh # Main entry point
├── llama.cpp/ # Submodule - ggml-org/llama.cpp
├── scripts/
│ ├── rebuild.sh # Build script (Vulkan default, optional ROCm)
│ ├── env.sh # Environment setup (source before using tools)
│ ├── detect-gpu.sh # GPU/APU and CPU ISA auto-detection library
│ ├── benchmark.sh # Prompt cache performance testing
│ └── apply-ttm-kernel-params.sh # GPU memory config (GRUB + systemd-boot)
├── src/
│ ├── llama-cpp-rocm/ # ROCm build output + build.sh
│ └── llama-cpp-vulkan/ # Vulkan build output + build.sh
├── deps/ # ROCm SDK (downloaded by rebuild.sh)
├── models/ # GGUF files
├── kv-cache/ # SSD-backed KV cache (per-conversation directories)
├── scratch/ # Transient working files (benchmark source text)
└── benchmarks/ # Benchmark results with full server logs
Source code: GPL-3.0-or-later Documentation: CC-BY-NC-SA-4.0
llama.cpp is MIT-licensed. ROCm components carry AMD's license.