End-to-end documentation to set up your own local & fully private LLM server on Debian. Equipped with chat, web search, RAG, model management, MCP servers, image generation, and TTS.
-
Updated
Mar 2, 2026
End-to-end documentation to set up your own local & fully private LLM server on Debian. Equipped with chat, web search, RAG, model management, MCP servers, image generation, and TTS.
A robust, production-ready Python toolkit to automate the synchronization between a directory of .gguf model files and a llama-swap config.yaml
The operations layer for your local LLM stack
Auto-configure opencode to use a local llama-swap instance with model and context detection
Config-driven local LLM toolkit for llama.cpp and llama-swap, with a FastAPI Web UI, eval/benchmark helpers, and deployment packaging.
LLM routing proxy for coding harnesses. Auto-routes to cloud or local inference via Bonsai LLM classification. Fallback, prompt rewriting, MCP code review, Signal/Discord Remote Communication
Custom Llama Swap Container Image
Launch and optimize llama.cpp servers automatically across Linux, macOS, and Windows using hardware detection and configuration tuning.
Autonomous overnight LLM eval pipeline for local GGUF models — multi-turn agentic tasks, dimension-routed dual-judge scoring, SQLite-backed comparison reports. Built for llama.cpp + llama-swap on dual-GPU rigs.
Start/stop your Llama Swap models with ulauncher
Cursor-Auto / Claude-tier-style serving for local GGUF models on Mac (M4 Max, 64 GB). FastAPI router fronts llama-swap + llama.cpp, classifying each request into a coder, planner, or uncensored-planner tier. OpenAI-compatible API, opencode integration, per-project subshell, one `llmstack` console-script.
Add a description, image, and links to the llama-swap topic page so that developers can more easily learn about it.
To associate your repository with the llama-swap topic, visit your repo's landing page and select "manage topics."