An index of self hosted AI products
- vllm production system focused on batching
- sglang production system focused on structured generation and agentic use
- aphrodite engine a fork of vllm with more quantization support
- lorax production system focused on dynamic LORAs by predibase
- ollama llamacpp wrapper with some extra features, designed for developer laptops
- koboldcpp fork of llamacpp designed for roleplay
- lmdeploy multimodal server by internLM
- llama.cpp lightweight llm runtime for CPU/GPU
- exllamav2 lightweight llm runtime for GPU. fast quantization + supports TP with any number of GPUs
- mlc-llm optimised llm runtime for many backends. Can run on wasm.
- TensorRT-llm nvidia's official runtime for their GPUs
- ctranslate2 C based inference engine for many model types
- hf transformers not the fastest but supports the most models