GitHub - skylinesi/optiml: Acceleration library for LLM agents.

High-speed Large Language Model (LLM) inference on consumer-grade hardware—right on your PC. Large-scale agent deployment is no longer a datacenter privilege.

OptiML accelerates local inference by exploiting activation locality: a compact set of "hot" neurons fire frequently across inputs, while the long tail of "cold" neurons is input-dependent. OptiML places the hot subset on the GPU and schedules the cold subset on the CPU, delivering strong throughput with low VRAM on everyday hardware.

OptiML in Action

demo.mp4

llama.cpp (left) vs. OptiML (right) on a single RTX 5080 (2.7x speedup!)

Highlights

Run large models on a PC: Achieve server-class throughput with one consumer GPU + CPU.
Hybrid CPU/GPU execution: Keep frequently activated ("hot") neurons on the GPU; compute the long tail ("cold") on the CPU.
Lower VRAM pressure: Fit bigger models via quantization and activation-aware placement.
Practical & lightweight: Simple CLI, Python API, and an HTTP demo server for quick local deployment.

Project Motivation

LLMs exhibit power-law activation locality: a small, stable subset of neurons accounts for the majority of activations. OptiML identifies this subset and pins it to the GPU for fast reuse, while streaming the less frequent activations on the CPU. This co-design of placement and scheduling balances latency, throughput, and memory usage, enabling large-model serving on commodity PCs.

Supported Models (examples)

Decoder-only transformer families commonly distributed in GGUF or other quantized formats (e.g., LLaMA-style variants). Coverage expands with operator/back-end availability.

Requirements

A consumer GPU (NVIDIA/AMD/Apple Silicon) with recent drivers/toolkit
Modern CPU with AVX2 (or Apple Silicon)
CMake ≥ 3.20, a C/C++ toolchain
Python 3.9+ (optional, for bindings and scripts)

Quickstart

1) Build from source

git clone https://github.com/NU-QRG/optiml.git
cd optiml
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
  -DOPTIML_CUBLAS=ON \        # CUDA (NVIDIA)
  -DOPTIML_METAL=OFF \        # Apple Silicon (toggle as needed)
  -DOPTIML_OPENCL=OFF         # Other GPU backends (toggle as needed)
cmake --build . -j

Tip: Toggle the back-ends that match your machine (e.g., set OPTIML_METAL=ON on Apple Silicon).

2) (Optional) Python bindings

cd bindings/python
pip install -e .

3) Prepare a model

OptiML works well with standard GGUF models. If you have original weights, first convert to GGUF, then optionally quantize:

# Example: quantize a GGUF model to Q4_K
./build/optiml-quantize --input <model path> --output model-q4_k.gguf --type q4_k

4) Run text generation (CLI)

./build/optiml-cli --model model-q4_k.gguf --prompt "Explain activation locality in one paragraph." --n-predict 128

5) Start the HTTP demo server

./examples/server/optiml-server --model model-q4_k.gguf --host 127.0.0.1 --port 8080

Open the provided minimal web UI and chat locally. The server exposes a simple REST API you can call from any client.

How OptiML Works

Measure activation locality – Identify neurons that are consistently active across inputs.
Partition neurons – Tag a small "hot" set and a large "cold" set per layer.
Place & cache – Pin hot neurons and related weights on the GPU; compute cold activations on the CPU.
Hybrid scheduling – Overlap CPU/GPU compute and data movement; apply quantization to reduce memory and improve throughput.

Quantization

OptiML supports common integer/block quantization schemes (e.g., Q4_K, others in GGUF ecosystems) to shrink model size with minimal quality loss. Use optiml-quantize and verify trade-offs with perplexity scripts.

Benchmarking

We provide scripts to measure tokens/s, latency, and perplexity across quant levels, sequence lengths, and batch sizes.

# Throughput / latency
./build/optiml-bench --model <model path> --n-predict 256 --batch 1

# Perplexity
python examples/perplexity/perplexity.py --model <model path> --data <dataset file>

Record VRAM/RAM usage, tokens/s, and quality metrics to compare settings on your hardware.

Python API (preview)

You can also build OptiML-accelerated models using our Python APIs. A simple example is provided below. Note that Python bindings are still in an early stage. Expect bugs. If you run into any issue, file a bug report in the repository's issue section.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

path = "Optiml/Optiml-7B-Instruct"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

# User can directly use the chat interface
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.9, top_p=0.9)
# print(responds)
messages = [
    {"role": "user", "content": "Write an article about Artificial Intelligence."},
]
prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

model_outputs = model.generate(
    **model_inputs,
    do_sample=True,
    max_new_tokens=1024,
    top_p=0.9,
    temperature=0.9
)

output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

Build Notes

Enable exactly one GPU back-end that matches your device (CUBLAS, METAL, OPENCL, …).
For very large models, more VRAM helps, but Optiml’s hybrid placement reduces the requirement.
Use quantization to lower memory and often improve speed on PC-class hardware.
Ensure release builds (-DCMAKE_BUILD_TYPE=Release) for best performance.

FAQ

Which models work best? Decoder-only transformer families in GGUF with available kernels generally perform well.

Do I need a high-end GPU? Not necessarily. The hybrid layout reduces VRAM pressure by keeping the long tail on the CPU, making consumer GPUs practical.

How is this different from pure-GPU engines? OptiML co-designs placement and scheduling around activation locality, trading a modest amount of CPU work for the ability to serve larger models efficiently on a PC.

Roadmap

Broader model/operator coverage
Additional quantization modes and calibration tools
Auto-tuning for more platforms
Extended demos (agents, RAG, function calling)

Track progress and propose features via issues/discussions.

Contributing

Contributions are welcome! When filing issues, include:

OS/driver/toolkit versions
CPU/GPU model and RAM/VRAM
Model/quant settings
Exact commands and logs

Acknowledgments

OptiML builds on community progress in activation-aware execution, hybrid CPU/GPU scheduling, quantization, and open model formats. Thanks to contributors who make local LLMs fast and accessible.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github		.github
Optiml-py		Optiml-py
assets		assets
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
gguf-py		gguf-py
grammars		grammars
media		media
models		models
pocs		pocs
prompts		prompts
scripts		scripts
spm-headers		spm-headers
tests		tests
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Package.swift		Package.swift
README.md		README.md
SHA256SUMS		SHA256SUMS
atomic_windows.h		atomic_windows.h
build.zig		build.zig
codecov.yml		codecov.yml
convert-dense.py		convert-dense.py
convert-hf-to-Optiml-gguf.py		convert-hf-to-Optiml-gguf.py
convert.py		convert.py
flake.lock		flake.lock
flake.nix		flake.nix
ggml-alloc.c		ggml-alloc.c
ggml-alloc.h		ggml-alloc.h
ggml-backend-impl.h		ggml-backend-impl.h
ggml-backend.c		ggml-backend.c
ggml-backend.h		ggml-backend.h
ggml-cuda.cu		ggml-cuda.cu
ggml-cuda.h		ggml-cuda.h
ggml-impl.h		ggml-impl.h
ggml-metal.h		ggml-metal.h
ggml-metal.m		ggml-metal.m
ggml-metal.metal		ggml-metal.metal
ggml-mpi.c		ggml-mpi.c
ggml-mpi.h		ggml-mpi.h
ggml-opencl.cpp		ggml-opencl.cpp
ggml-opencl.h		ggml-opencl.h
ggml-quants.c		ggml-quants.c
ggml-quants.h		ggml-quants.h
ggml.c		ggml.c
ggml.h		ggml.h
llama.cpp		llama.cpp
llama.h		llama.h
mypy.ini		mypy.ini
requirements.txt		requirements.txt
run_with_preset.py		run_with_preset.py
unicode.h		unicode.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OptiML in Action

Highlights

Project Motivation

Supported Models (examples)

Requirements

Quickstart

1) Build from source

2) (Optional) Python bindings

3) Prepare a model

4) Run text generation (CLI)

5) Start the HTTP demo server

How OptiML Works

Quantization

Benchmarking

Python API (preview)

Build Notes

FAQ

Roadmap

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OptiML in Action

Highlights

Project Motivation

Supported Models (examples)

Requirements

Quickstart

1) Build from source

2) (Optional) Python bindings

3) Prepare a model

4) Run text generation (CLI)

5) Start the HTTP demo server

How OptiML Works

Quantization

Benchmarking

Python API (preview)

Build Notes

FAQ

Roadmap

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages