go-llama

Go bindings and a unified server/CLI for llama.cpp.

Run a local LLM server with a REST API, manage GGUF models, and use the go-llama CLI for chat, completion, embeddings, and tokenization.

Features

Command Line Interface: Interactive chat and completion tooling
HTTP API Server: REST endpoints for chat, completion, embeddings, and model management
Model Management: Pull, cache, load, unload, and delete GGUF models
Streaming: Incremental token streaming for chat and completion
GPU Support: CUDA, Vulkan, and Metal (macOS) acceleration via llama.cpp
Docker Support: Pre-built images for CPU, CUDA, and Vulkan targets

Some work still to do on the chat endpoint. The following are not yet included, but will eventually be supported:

Multi-modal support (images, audio, PDF's, etc)
Reasoning/Thinking support
OpenAI or Anthropic compatible API
Tool calling
Grammar (JSON format output)
Text-to-Speech (Audio output)

Quick Start

Start the server with Docker:

docker volume create go-llama
docker run -d --name go-llama \
  -v go-llama:/data -p 8083:8083 \
  ghcr.io/mutablelogic/go-llama run

Then use the CLI to interact with the server:

export GOLLAMA_ADDR="localhost:8083"

# Pull a model (Hugging Face URL or hf:// scheme)
go-llama pull hf://bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q4_K_M.gguf

# List models
go-llama models

# Load a model into memory
go-llama load Llama-3.2-1B-Instruct-Q4_K_M.gguf

# Chat (interactive)
go-llama chat Llama-3.2-1B-Instruct-Q4_K_M.gguf "You are a helpful assistant"
# Completion
go-llama complete Llama-3.2-1B-Instruct-Q4_K_M.gguf "Explain KV cache in two sentences"

Model Support

go-llama works with GGUF models supported by llama.cpp. Models can be pulled from Hugging Face using:

https://huggingface.co/<org>/<repo>/blob/<branch>/<file>.gguf
hf://<org>/<repo>/<file>.gguf

The default model cache directory is ${XDG_CACHE_HOME}/go-llama (or system temp) and can be overridden with GOLLAMA_DIR.

Docker Deployment

Docker containers are published for Linux AMD64 and ARM64. Variants include:

CPU and Vulkan: ghcr.io/mutablelogic/go-llama
CUDA: ghcr.io/mutablelogic/go-llama-cuda

Use the run command inside the container to start the server. For GPU usage, ensure the host has the appropriate drivers and runtime.

CLI Usage Examples

Client-only commands:

Command	Description	Example
`models`	List available models	`go-llama models`
`model`	Get model details	`go-llama model phi-4-q4_k_m.gguf`
`pull`	Download a model	`go-llama pull hf://org/repo/model.gguf`
`load`	Load a model into memory	`go-llama load phi-4-q4_k_m.gguf`
`unload`	Unload a model from memory	`go-llama unload phi-4-q4_k_m.gguf`
`delete`	Delete a model	`go-llama delete phi-4-q4_k_m.gguf`
`chat`	Interactive chat	`go-llama chat phi-4-q4_k_m.gguf "system"`
`complete`	Text completion	`go-llama complete phi-4-q4_k_m.gguf "prompt"`
`embed`	Generate embeddings	`go-llama embed phi-4-q4_k_m.gguf "text"`
`tokenize`	Convert text to tokens	`go-llama tokenize phi-4-q4_k_m.gguf "text"`
`detokenize`	Convert tokens to text	`go-llama detokenize phi-4-q4_k_m.gguf 1 2 3`

Use go-llama --help or go-llama <command> --help for full options. Server commands:

Command	Description	Example
`gpuinfo`	Show GPU information	`go-llama gpuinfo`
`run`	Run the HTTP server	`go-llama run --http.addr localhost:8083`

Development

Project Structure

cmd contains the CLI and server entrypoint
pkg/llamacpp contains the high-level service and HTTP handlers
- httpclient/ - client for the server API
- httphandler/ - HTTP handlers and routing
- schema/ - API types
sys/llamacpp contains native bindings to llama.cpp
sys/gguf contains GGUF parsing helpers
third_party/llama.cpp is the upstream llama.cpp submodule
etc/ contains Dockerfiles

Building

# Build server binary
make go-llama

# Build client-only binary
make go-llama-client

# Build Docker images
make docker

Use GGML_CUDA=1 or GGML_VULKAN=1 to build GPU variants.

Contributing & License

Please file issues and feature requests in GitHub issues. Licensed under Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
cmd/go-llama		cmd/go-llama
docs		docs
etc		etc
pkg		pkg
sys		sys
testdata		testdata
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
error.go		error.go
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-llama

Features

Quick Start

Model Support

Docker Deployment

CLI Usage Examples

Development

Project Structure

Building

Contributing & License

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Languages

License

mutablelogic/go-llama

Folders and files

Latest commit

History

Repository files navigation

go-llama

Features

Quick Start

Model Support

Docker Deployment

CLI Usage Examples

Development

Project Structure

Building

Contributing & License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Languages

Packages