Run, benchmark, and serve Large Language Models locally with llama.cpp on GPU.
This repo gives you one-command scripts, persistent model management, OpenAI-compatible serving, and a repeatable benchmarking pipeline with plots.
- Local inference with CUDA via Docker (
ghcr.io/ggerganov/llama.cpp:full-cuda) - OpenAI-compatible server (
/v1/chat/completions) for easy app integration - Self-contained model workflow — first run downloads the GGUF into
models/, later runs reuse it - Benchmarks that matter — automated sweeps + CSV + Markdown summary + charts
- Polished automation —
Makefile+venvso anyone can reproduce results
# 1) Clone
git clone https://github.com/shuvanon/local-llm-setup.git
cd local-llm-setup
# 2) Set up Python env for analysis & plots (matplotlib, pandas)
make venv
source .venv/bin/activate
# 3) Try a single prompt (downloads model on first run)
./scripts/run_llm.sh "Write a intro about federated learning." 64
# 4) Start API server (OpenAI-compatible)
./scripts/serve_llm.sh
# in another terminal:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d @examples/chat_request.json
# 5) Run a batch sweep + summarize (CSV + Markdown + charts)
make benchmark