Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
-
Updated
May 19, 2026 - C++
Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
GPU-aware inference mesh for large-scale AI serving
GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).
Mixed-vendor GPU inference cluster manager with speculative decoding
🚀 ClipServe: A fast API server for embedding text, images, and performing zero-shot classification using OpenAI’s CLIP model. Powered by FastAPI, Redis, and CUDA for lightning-fast, scalable AI applications. Transform texts and images into embeddings or classify images with custom labels—all through easy-to-use endpoints. 🌐📊
A FastAPI server for querying Google's Gemma Translate AI models for translations
A high-performance deep learning model inference server based on TensorRT, supporting fast inference for Embedding, Reranker, and NLI models.
Open-source developer tool for testing deAPI.ai endpoints — unified AI inference API for image, video, audio, transcription, OCR and more
Docker based GPU inference of machine learning models
Continuous batching for TTS — like vLLM, but for voice. Serve 10+ simultaneous text-to-speech requests on a single GPU.
Generating images with diffusion models on a mobile device, with an intranet GPU box as backend
End-to-end scalable ML inference on EKS: KEDA-driven pod autoscaling with Prometheus custom metrics, Cluster Autoscaler for GPU node scaling, and NVIDIA GPU time-slicing to run multiple pods per GPU.
Making TLB invalidation observable, attributable, and measurable in modern AI workloads.
ModelSpec is an open, declarative specification for describing how AI models especially LLMs are deployed, served, and operated in production. It captures execution, serving, and orchestration intent to enable validation, reasoning, and automation across modern AI infrastructure.
Instant setup scripts for cloud-based LLM development.
HAL 9000 voice cloning with Qwen3-TTS streaming - OpenAI-compatible REST API
Personal text-to-speech webapp powered by VoxCPM2 — voice design, controllable cloning, and ultimate cloning. Next.js on Vercel + Modal GPU.
4-tier asynchronous LLM cascade system achieving 120 tokens/sec on constrained hardware using llama.cpp, speculative decoding, and GPU+CPU parallel inference
Qwen3.6 27B MTP on Modal H100 with llama.cpp and an OpenAI-compatible API
High-performance Python architecture for multi-stream NVDEC decoding and GPU inference using DLPack and PyTorch CUDA IPC to bypass the GIL.
Add a description, image, and links to the gpu-inference topic page so that developers can more easily learn about it.
To associate your repository with the gpu-inference topic, visit your repo's landing page and select "manage topics."