AIKit ✨

AIKit is a quick, easy, and local or cloud-agnostic way to get started to host and deploy large language models (LLMs) for inference. No GPU, internet access or additional tools are needed to get started except for Docker!

AIKit uses LocalAI under-the-hood to run inference. LocalAI provides a drop-in replacement REST API that is OpenAI API compatible, so you can use any OpenAI API compatible client, such as Kubectl AI, Chatbot-UI and many more, to send requests to open-source LLMs powered by AIKit!

Features

🐳 No GPU, Internet access or additional tools needed except for Docker!
🤏 Minimal image size, resulting in less vulnerabilities and smaller attack surface with a custom distroless-based image
🚀 Easy to use declarative configuration
✨ OpenAI API compatible to use with any OpenAI API compatible client
📸 Multi-modal model support
🖼️ Image generation support with Stable Diffusion
🦙 Support for GGUF (llama), GPTQ (exllama or exllama2), EXL2 (exllama2), and GGML (llama-ggml) formats
🚢 Kubernetes deployment ready
📦 Supports multiple models with a single image
🖥️ Supports GPU-accelerated inferencing with NVIDIA GPUs
🔐 Signed images for aikit and pre-made models

Quick Start

You can get started with AIKit quickly on your local machine without a GPU!

docker run -d --rm -p 8080:8080 ghcr.io/sozercan/llama2:7b

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "llama-2-7b-chat",
    "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
  }'

Output should be similar to:

{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

That's it! 🎉 API is OpenAI compatible so this is a drop-in replacement for any OpenAI API compatible client.

Demos

See demos for demos and examples.

Pre-made Models

AIKit comes with pre-made models that you can use out-of-the-box!

CPU

Model	Optimization	Parameters	Command	License
🦙 Llama 2	Chat	7B	`docker run -d --rm -p 8080:8080 ghcr.io/sozercan/llama2:7b`	Llama 2
🦙 Llama 2	Chat	13B	`docker run -d --rm -p 8080:8080 ghcr.io/sozercan/llama2:13b`	Llama 2
🐬 Orca 2		13B	`docker run -d --rm -p 8080:8080 ghcr.io/sozercan/orca2:13b`	Microsoft Research
Ⓜ️ Mixtral	Instruct	8x7B	`docker run -d --rm -p 8080:8080 ghcr.io/sozercan/mixtral:8x7b`	Apache
🅿️ Phi 2	Instruct	2.7B	`docker run -d --rm -p 8080:8080 ghcr.io/sozercan/phi2:2.7b`	MIT

NVIDIA CUDA

Model	Optimization	Parameters	Command	License
🦙 Llama 2 Chat	Chat	7B	`docker run -d --rm --gpus all -p 8080:8080 ghcr.io/sozercan/llama2:7b-cuda`	Llama 2
🦙 Llama 2 Chat	Chat	13B	`docker run -d --rm --gpus all -p 8080:8080 ghcr.io/sozercan/llama2:13b-cuda`	Llama 2
🐬 Orca 2		13B	`docker run -d --rm --gpus all -p 8080:8080 ghcr.io/sozercan/orca2:13b-cuda`	Microsoft Research
Ⓜ️ Mixtral	Instruct	8x7B	`docker run -d --rm --gpus all -p 8080:8080 ghcr.io/sozercan/mixtral:8x7b-cuda`	Apache
🅿️ Phi 2	Instruct	2.7B	`docker run -d --rm --gpus all -p 8080:8080 ghcr.io/sozercan/phi2:2.7b-cuda`	MIT

Note

Please see models folder for pre-made model definitions.

If not being offloaded to GPU VRAM, minimum of 8GB of RAM is required for 7B models, 16GB of RAM to run 13B models, and 32GB of RAM to run 8x7B models.

CPU models requires minimum of AVX instruction set. You can check if your CPU supports AVX by running grep avx /proc/cpuinfo.

CUDA models includes CUDA v12. They are used with NVIDIA GPU acceleration.

Getting Started

Creating an image

Note

This section shows how to create a custom image with models of your choosing. If you want to use one of the pre-made models, skip to running models.

Create an aikitfile.yaml with the following structure:

#syntax=ghcr.io/sozercan/aikit:latest
apiVersion: v1alpha1
models:
  - name: llama-2-7b-chat
    source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

Tip

This is the simplest way to get started to build an image. For full aikitfile specification, see specs.

First, create a buildx buildkit instance. Alternatively, if you are using Docker v24 with containerd image store enabled, you can skip this step.

docker buildx create --use --name aikit-builder

Then build your image with:

docker buildx build . -t my-model -f aikitfile.yaml --load

This will build a local container image with your model(s). You can see the image with:

docker images
REPOSITORY    TAG       IMAGE ID       CREATED             SIZE
my-model      latest    e7b7c5a4a2cb   About an hour ago   5.51GB

Running models

You can start the inferencing server for your models with:

# for pre-made models, replace "my-model" with the image name
docker run -d --rm -p 8080:8080 my-model

You can then send requests to localhost:8080 to run inference from your models. For example:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-7b-chat",
     "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
   }'
{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Kubernetes Deployment

It is easy to get started to deploy your models to Kubernetes!

Make sure you have a Kubernetes cluster running and kubectl is configured to talk to it, and your model images are accessible from the cluster.

Tip

You can use kind to create a local Kubernetes cluster for testing purposes.

# create a deployment
# for pre-made models, replace "my-model" with the image name
kubectl create deployment my-llm-deployment --image=my-model

# expose it as a service
kubectl expose deployment my-llm-deployment --port=8080 --target-port=8080 --name=my-llm-service

# easy to scale up and down as needed
kubectl scale deployment my-llm-deployment --replicas=3

# port-forward for testing locally
kubectl port-forward service/my-llm-service 8080:8080

# send requests to your model
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-7b-chat",
     "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
   }'
{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Tip

For an example Kubernetes deployment and service YAML, see kubernetes folder. Please note that these are examples, you may need to customize them (such as properly configured resource requests and limits) based on your needs.

GPU Acceleration Support

Note

At this time, only NVIDIA GPU acceleration is supported. Please open an issue if you'd like to see support for other GPU vendors.

NVIDIA

AIKit supports GPU accelerated inferencing with NVIDIA Container Toolkit. You must also have NVIDIA Drivers installed on your host machine.

For Kubernetes, NVIDIA GPU Operator provides a streamlined way to install the NVIDIA drivers and container toolkit to configure your cluster to use GPUs.

To get started with GPU-accelerated inferencing, make sure to set the following in your aikitfile and build your model.

runtime: cuda         # use NVIDIA CUDA runtime

For llama backend, set the following in your config:

f16: true             # use float16 precision
gpu_layers: 35        # number of layers to offload to GPU
low_vram: true        # for devices with low VRAM

Tip

Make sure to customize these values based on your model and GPU specs.

Note

For exllama and exllama2 backends, GPU acceleration is enabled by default and cannot be disabled.

After building the model, you can run it with --gpus all flag to enable GPU support:

# for pre-made models, replace "my-model" with the image name
docker run --rm --gpus all -p 8080:8080 my-model

If GPU acceleration is working, you'll see output that is similar to following in the debug logs:

5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr ggml_init_cublas: found 1 CUDA devices:
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr   Device 0: Tesla T4, compute capability 7.5
...
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: using CUDA for GPU acceleration
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: mem required  =   70.41 MB (+ 2048.00 MB per state)
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading 32 repeating layers to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading non-repeating layers to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading v cache to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading k cache to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloaded 35/35 layers to GPU
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: VRAM used: 5869 MB

Acknowledgements and Credits

LocalAI for providing the inference engine
Mockerfile for the inspiration and sample code
Huggingface and TheBloke for providing the models

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github		.github
cmd/frontend		cmd/frontend
docs		docs
examples		examples
kubernetes		kubernetes
models		models
pkg		pkg
test		test
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIKit ✨

Features

Quick Start

Demos

Pre-made Models

CPU

NVIDIA CUDA

Getting Started

Creating an image

Running models

Kubernetes Deployment

GPU Acceleration Support

NVIDIA

Acknowledgements and Credits

About

Releases

Packages

Languages

License

step-security-bot/aikit

Folders and files

Latest commit

History

Repository files navigation

AIKit ✨

Features

Quick Start

Demos

Pre-made Models

CPU

NVIDIA CUDA

Getting Started

Creating an image

Running models

Kubernetes Deployment

GPU Acceleration Support

NVIDIA

Acknowledgements and Credits

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages