- Multiple Backends: Switch between llama.cpp and Microsoft Foundry
- Web Dashboard: Browser-based UI for managing backends and models
- OpenAI-Compatible API: Drop-in replacement for OpenAI API at
/v1 - RPC Clustering: Distribute inference across multiple machines (llama.cpp)
| Backend | Platform | Model Format |
|---|---|---|
| llama.cpp | macOS, Linux | GGUF |
| Foundry | macOS, Windows | ONNX |
- Nix package manager
- For llama.cpp:
llama-serverin PATH (installed via Nix flake) - For Foundry:
foundryCLI andfoundry-local-sdk(macOS/Windows only)
# Clone the repository
git clone https://github.com/FHNW-Security-Lab/local-llm-toolbox.git
cd local-llm-toolbox
# Enter the development environment (auto-detects GPU)
./dev
# Start the toolbox
./toolbox serveThe ./dev script auto-detects your GPU and selects the appropriate environment:
- macOS: Metal (Apple Silicon)
- Linux with NVIDIA: CUDA
- Linux with AMD/Intel: Vulkan
- No GPU: CPU-only with BLAS
You can also select a specific environment directly:
./dev nvidia # CUDA (NVIDIA GPU)
./dev vulkan # Vulkan (AMD/Intel)
./dev cpu # CPU only (BLAS)Or use Nix directly:
nix develop # Default (Vulkan on Linux, Metal on macOS)
nix develop .#nvidia # CUDA
nix develop .#vulkan # Vulkan
nix develop .#cpu # CPU onlyThe dashboard will be available at http://localhost:8090
For llama.cpp (GGUF models):
Download GGUF files to ~/.local/share/models/:
# Example: Download from HuggingFace
cd ~/.local/share/models
wget https://huggingface.co/bartowski/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/Qwen2.5-1.5B-Instruct-Q4_K_M.ggufFor Foundry (ONNX models):
Use the download command or the web dashboard:
./toolbox download phi-4 foundryStart the toolbox and open http://localhost:8090 in your browser:
./toolbox serveFrom the dashboard you can:
- Start/stop backends
- Browse and load models
- Chat with loaded models
- Configure RPC clustering (llama.cpp)
- Monitor system resources
# Start the dashboard + API server
./toolbox serve
./toolbox serve --debug # Enable verbose logging
# Backend management
./toolbox start llama # Start llama.cpp backend
./toolbox start foundry # Start Foundry backend
./toolbox stop # Stop the active backend
./toolbox status # Show current status
# Model management
./toolbox models # List available models
./toolbox models llama # List models for specific backend
./toolbox load <model> # Load a model
./toolbox unload # Unload current model
./toolbox download <model> # Download a model (Foundry only)
# RPC worker (run on worker machines for distributed inference)
./toolbox rpc llama # Start as RPC workerThe API is available at http://localhost:8090/v1 and supports:
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completions (streaming supported)POST /v1/completions- Text completionsPOST /v1/embeddings- Embeddings (if supported by backend)
Example with curl:
curl http://localhost:8090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-1.5b-instruct-q4_k_m",
"messages": [{"role": "user", "content": "Hello!"}]
}'Example with OpenAI Python client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8090/v1",
api_key="not-needed" # No auth required
)
response = client.chat.completions.create(
model="qwen2.5-1.5b-instruct-q4_k_m",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)Copy config.env.example to .env and adjust as needed:
cp config.env.example .env| Variable | Default | Description |
|---|---|---|
DASHBOARD_PORT |
8090 | Web dashboard and API port |
LLAMA_MODELS_DIR |
~/.local/share/models | Where to look for GGUF models |
LLAMA_PORT |
8080 | llama-server port |
LLAMA_CTX_SIZE |
8192 | Context window size |
LLAMA_GPU_LAYERS |
99 | Layers to offload to GPU |
FOUNDRY_PORT |
5273 | Foundry service port |
See config.env.example for all available options.
Distribute inference across multiple machines using llama.cpp's RPC feature.
1. On each worker machine:
Clone the repo and start the RPC worker:
git clone https://github.com/your-org/local-llm-toolbox.git
cd local-llm-toolbox
./dev
./toolbox rpc llamaThe worker will start:
- RPC server on port 50052 (tensor offload)
- Control API on port 50053 (management)
2. On the main machine:
Configure workers in .env:
LLAMA_RPC_WORKERS=192.168.1.10,192.168.1.11Then start normally:
./toolbox serve
./toolbox start llama
./toolbox load my-modelThe main node will automatically reset all workers before loading each model, ensuring a clean state.
- Workers expose GPU/CPU compute via llama.cpp's
rpc-server - The main node's control API resets workers before each model load
- Model weights are distributed across all nodes proportionally to available memory
- Tensor caching (
-cflag) speeds up repeated model loads
| Variable | Default | Description |
|---|---|---|
LLAMA_RPC_WORKERS |
(empty) | Comma-separated worker IPs/hostnames |
LLAMA_RPC_CONTROL_PORT |
50053 | Control API port on workers |
- Workers must be started manually on each machine
- If the main node crashes, workers may need a restart to clear leaked GPU memory
- Use
./toolbox rpc llama --helpfor worker options