#

vllm

Here are 1,799 public repositories matching this topic...

modelscope / FunASR

Open-source speech recognition toolkit for training, inference, streaming ASR, VAD, punctuation, speaker diarization pipelines, and OpenAI-compatible/MCP serving.

Updated Jul 24, 2026
Python

meta-llama / llama-cookbook

Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started with Inference, Fine-Tuning, RAG. We also show you how to solve end to end problems using Llama model family and using them on various provider services

python machine-learning ai pytorch llama finetuning llm langchain vllm llama2

Updated May 19, 2026
Jupyter Notebook

halfrost / Halfrost-Field

✍🏻 Source Code Deep Dives, System Design & Engineering Blogs | Halfrost-Field 冰霜之地：源码解析、系统设计与工程实践笔记

Updated Jul 24, 2026
Go

Orchestra-Research / AI-Research-SKILLs

Comprehensive open-source library of AI research and engineering skills for any AI model. Package the skills and your claude code/codex/gemini agent will be an AI research agent with full horsepower. Maintained by Orchestra Research.

ai skills gemini codex claude ai-research machine-leanring megatron huggingface gpt-5 vllm grpo claude-code claude-skills

Updated Jun 16, 2026
TeX

LMCache / LMCache

LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

fast amd cuda inference pytorch speed rocm kv-cache llm vllm

Updated Jul 25, 2026
Python

OpenRLHF / OpenRLHF

An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)

reinforcement-learning raylib transformers proximal-policy-optimization large-language-models reinforcement-learning-from-human-feedback vllm visual-language-models

Updated Jul 14, 2026
Python

xorbitsai / inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

Updated Jul 25, 2026
Python

ai-dynamo / dynamo

A Datacenter Scale Distributed Inference Serving Framework

kubernetes rust routing-engine omni diffusion vllm llm-inference tensorrt-llm sglang disaggregated-serving

Updated Jul 25, 2026
Rust

Mooncake

kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

reinforcement-learning inference rdma disaggregation llm vllm sglang kvcache trt-llm tokenspeed

Updated Jul 25, 2026
C++

kserve / kserve

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

Updated Jul 25, 2026
Go

UltraRAG

OpenBMB / UltraRAG

A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines

flask demo ui mcp openai easy gpt embedding vlm multimodal rag sentence-transformers huggingface-transformers llm vllm qwen deepseek

Updated Jul 25, 2026
Python

Awesome-LLM-Inference

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

mla vllm llm-inference awesome-llm flash-attention tensorrt-llm paged-attention deepseek flash-attention-3 deepseek-v3 minimax-01 deepseek-r1 flash-mla qwen3

Updated Jun 23, 2026
Python

gpustack / gpustack

A GPU cluster manager for high-performance AI model serving (vLLM, SGLang) and on-demand SSH-accessible GPU instances.

cuda inference openai llama maas rocm ascend llm llm-serving vllm genai llm-inference qwen deepseek sglang distributed-inference high-performance-inference mindie

Updated Jul 24, 2026
Python

katanaml / sparrow

Structured data extraction, instruction calling and agentic workflows with ML, LLM and Vision LLM

computer-vision machinelearning huggingface-transformers documentai llm vllm agentic-ai

Updated Jun 30, 2026
Python

mostlygeek / llama-swap

Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etc

golang openai llama openai-api llamacpp vllm localllm localllama

Updated Jul 25, 2026
Go

vllm-project / semantic-router

Intelligent Mixture-of-Models Router for Efficient Heterogeneous LLMs Inference

kubernetes rust golang mcp fine-tuning pii-detection mixture-of-models huggingface-transformers bert-classification llm prompt-engineering vllm huggingface-candle ai-gateway semantic-router prompt-guard llmrouter openclaw

Updated Jul 25, 2026
Go

skyzh / tiny-llm

A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.

python course serving llm large-language-model vllm qwen qwen2

Updated Jul 25, 2026
Python

PaddlePaddle / FastDeploy

High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

inference openai serving ernie llm llm-serving vllm ernie-45 ernie-45-vl

Updated Jul 21, 2026
Python

lemony-ai / cascadeflow

Cascading runtime for AI agents. Optimize cost, latency, quality, and policy decisions inside the agent loop.

Updated Jul 1, 2026
Python

containers / ramalama

RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.

ai containers cuda intel hip hacktoberfest inference-server podman llm llamacpp vllm

Updated Jul 25, 2026
Python

Improve this page

Add a description, image, and links to the vllm topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vllm topic, visit your repo's landing page and select "manage topics."