Skip to content

mutablelogic/go-llama

Repository files navigation

go-llama

Go Reference License

Go bindings and a unified server/CLI for llama.cpp.

Run a local LLM server with a REST API, manage GGUF models, and use the go-llama CLI for chat, completion, embeddings, and tokenization.

Features

  • Command Line Interface: Interactive chat and completion tooling
  • HTTP API Server: REST endpoints for chat, completion, embeddings, and model management
  • Model Management: Pull, cache, load, unload, and delete GGUF models
  • Streaming: Incremental token streaming for chat and completion
  • GPU Support: CUDA, Vulkan, and Metal (macOS) acceleration via llama.cpp
  • Docker Support: Pre-built images for CPU, CUDA, and Vulkan targets

Some work still to do on the chat endpoint. The following are not yet included, but will eventually be supported:

  • Multi-modal support (images, audio, PDF's, etc)
  • Reasoning/Thinking support
  • OpenAI or Anthropic compatible API
  • Tool calling
  • Grammar (JSON format output)
  • Text-to-Speech (Audio output)

Quick Start

Start the server with Docker:

docker volume create go-llama
docker run -d --name go-llama \
  -v go-llama:/data -p 8083:8083 \
  ghcr.io/mutablelogic/go-llama run

Then use the CLI to interact with the server:

export GOLLAMA_ADDR="localhost:8083"

# Pull a model (Hugging Face URL or hf:// scheme)
go-llama pull hf://bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q4_K_M.gguf

# List models
go-llama models

# Load a model into memory
go-llama load Llama-3.2-1B-Instruct-Q4_K_M.gguf

# Chat (interactive)
go-llama chat Llama-3.2-1B-Instruct-Q4_K_M.gguf "You are a helpful assistant"
# Completion
go-llama complete Llama-3.2-1B-Instruct-Q4_K_M.gguf "Explain KV cache in two sentences"

Model Support

go-llama works with GGUF models supported by llama.cpp. Models can be pulled from Hugging Face using:

  • https://huggingface.co/<org>/<repo>/blob/<branch>/<file>.gguf
  • hf://<org>/<repo>/<file>.gguf

The default model cache directory is ${XDG_CACHE_HOME}/go-llama (or system temp) and can be overridden with GOLLAMA_DIR.

Docker Deployment

Docker containers are published for Linux AMD64 and ARM64. Variants include:

  • CPU and Vulkan: ghcr.io/mutablelogic/go-llama
  • CUDA: ghcr.io/mutablelogic/go-llama-cuda

Use the run command inside the container to start the server. For GPU usage, ensure the host has the appropriate drivers and runtime.

CLI Usage Examples

Client-only commands:

Command Description Example
models List available models go-llama models
model Get model details go-llama model phi-4-q4_k_m.gguf
pull Download a model go-llama pull hf://org/repo/model.gguf
load Load a model into memory go-llama load phi-4-q4_k_m.gguf
unload Unload a model from memory go-llama unload phi-4-q4_k_m.gguf
delete Delete a model go-llama delete phi-4-q4_k_m.gguf
chat Interactive chat go-llama chat phi-4-q4_k_m.gguf "system"
complete Text completion go-llama complete phi-4-q4_k_m.gguf "prompt"
embed Generate embeddings go-llama embed phi-4-q4_k_m.gguf "text"
tokenize Convert text to tokens go-llama tokenize phi-4-q4_k_m.gguf "text"
detokenize Convert tokens to text go-llama detokenize phi-4-q4_k_m.gguf 1 2 3

Use go-llama --help or go-llama <command> --help for full options. Server commands:

Command Description Example
gpuinfo Show GPU information go-llama gpuinfo
run Run the HTTP server go-llama run --http.addr localhost:8083

Development

Project Structure

  • cmd contains the CLI and server entrypoint
  • pkg/llamacpp contains the high-level service and HTTP handlers
    • httpclient/ - client for the server API
    • httphandler/ - HTTP handlers and routing
    • schema/ - API types
  • sys/llamacpp contains native bindings to llama.cpp
  • sys/gguf contains GGUF parsing helpers
  • third_party/llama.cpp is the upstream llama.cpp submodule
  • etc/ contains Dockerfiles

Building

# Build server binary
make go-llama

# Build client-only binary
make go-llama-client

# Build Docker images
make docker

Use GGML_CUDA=1 or GGML_VULKAN=1 to build GPU variants.

Contributing & License

Please file issues and feature requests in GitHub issues. Licensed under Apache 2.0.

About

Go bindings for llama.cpp, an LLM inference engine

Topics

Resources

License

Stars

Watchers

Forks

Packages