YAIE (Yet Another Inference Engine) is an educational project designed to help students and developers understand how modern LLM inference engines work. This implementation is inspired by state-of-the-art systems like SGLang, vLLM, FlashInfer and other efficient inference engines, focusing on continuous batching, radial attention, and FlashInfer-style optimizations.
Modern LLM inference engines like SGLang, vLLM, and TensorRT-LLM implement sophisticated techniques to maximize throughput and minimize latency. YAIE demonstrates these concepts through a simplified but educational implementation that focuses on:
- Continuous Batching: Dynamically batching incoming requests to maximize GPU utilization
- Radial Attention: Efficient attention mechanism with prefix sharing and paged KV-cache
- OpenAI Compatibility: Server mode provides OpenAI-compatible API
- Modular Design: Clean architecture separating concerns for easy learning
- Two Operation Modes:
- Server mode (
yaie serve) with OpenAI-compatible API - CLI chat mode (
yaie chat) for interactive conversations
- Server mode (
- HuggingFace Integration: Automatic model downloading and caching
- Continuous Batching: Efficient request scheduling for better throughput
- Paged KV-Cache: Memory-efficient key-value cache management
- Radial Attention: Prefix sharing for similar requests
- Educational Focus: Clear, well-documented code with learning resources
The engine follows a modular architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Layer │ │ Engine Core │ │ Model/Kernels │
│ (FastAPI) │◄──►│ (Scheduler, │◄──►│ (PyTorch/ │
│ │ │ Attention) │ │ CUDA) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ CLI Layer │ │ Model Loading │ │ Memory Mgmt │
│ (yaie serve/ │ │ (HuggingFace │ │ (Paged Cache) │
│ yaie chat) │ │ Integration) │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
- CLI Interface: Entry point for both server and chat modes
- API Server: FastAPI-based server with OpenAI-compatible endpoints
- Inference Engine: Core processing logic with scheduler and attention
- Scheduler: Continuous batching with request management
- Radial Attention: Efficient attention with prefix sharing
- Model Loader: HuggingFace model and tokenizer management
- KV-Cache Manager: Paged cache for efficient memory usage
-
Clone the repository:
git clone https://github.com/yourusername/YAIE.git cd YAIE -
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the package:
pip install -e .
Start the server with a specific model:
yaie serve microsoft/DialoGPT-medium --host localhost --port 8000The server will:
- Check for the model in local HuggingFace cache
- Download if not present
- Start an OpenAI-compatible API server
API endpoints:
POST /v1/chat/completions- Chat completionsGET /v1/models- List available models
Start an interactive chat session:
yaie chat microsoft/DialoGPT-mediumThe chat will:
- Check for the model in local HuggingFace cache
- Download if not present
- Start an interactive chat session
To build the custom CUDA kernels for optimized performance:
# Using the build script
./build_kernels.sh
# Or using make
make build-kernels
# Or directly with Python
python setup_kernels.py build_ext --inplaceNote: Kernel building requires:
- CUDA toolkit installed
- PyTorch with CUDA support
- Compatible GPU with compute capability >= 6.0
If CUDA is not available, the engine will run in CPU-only mode.
By implementing the kernels and components in this project, you will learn:
-
Continuous Batching Concepts:
- Problem: Traditional batching requires all requests to have the same length
- Solution: Dynamically batch requests and handle them at different stages
-
Paged KV-Cache Management:
- Problem: KV-cache memory fragmentation with variable-length requests
- Solution: Use paged memory management similar to OS virtual memory
-
Radial Attention & Prefix Sharing:
- Problem: Redundant computation for requests with similar prefixes
- Solution: Share computed attention across requests with common prefixes (SGLang-style)
-
FlashInfer-Style Optimizations:
- Problem: Inefficient memory access patterns during attention computation
- Solution: Optimized attention for both prefill and decode phases
-
CUDA Kernel Programming:
- Efficient GPU memory access patterns
- Parallel computation for attention mechanisms
- Memory bandwidth optimization
-
System Performance Optimization:
- Latency vs. throughput trade-offs
- Memory management strategies
- Batch size optimization
This project provides a detailed guide for implementing the various kernels and components:
- Implementation Guide: Complete documentation of all kernels that need to be implemented
-
Attention Kernels:
- FlashAttention forward and backward
- Paged attention
- RoPE (Rotary Position Embedding)
-
Normalization Kernels:
- RMS normalization
-
Activation Kernels:
- SiLU and multiplication fusion
-
Memory Management:
- KV-cache management with paging
- Block allocation and deallocation
-
CPU Fallbacks:
- CPU implementations for when GPU is not available
This is an educational project, and contributions that improve the learning experience are welcome:
- Add more detailed comments explaining complex concepts
- Create additional examples or tutorials
- Improve documentation and explanations
- Add more model compatibility
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by SGLang, vLLM, and other efficient inference engines
- Built on top of HuggingFace Transformers library
- Educational reference implementation for learning purposes
We would like to acknowledge the significant contributions of several state-of-the-art inference engines that inspired this educational project:
- vLLM: For pioneering the concept of PagedAttention and efficient memory management in LLM serving
- SGLang: For introducing radial attention and highly optimized prompt processing techniques
- TensorRT-LLM: For demonstrating the power of optimized inference through NVIDIA's TensorRT technology
- LightLLM: For showing how to implement efficient inference with various optimization techniques
These projects have advanced the field of LLM inference significantly, and this educational engine draws concepts and inspiration from their innovative approaches to continuous batching, attention optimization, and memory management.
Mini-YAIE comes with a comprehensive educational guide (built with mdbook) that covers:
- Core Concepts (LLM Inference, Continuous Batching, Radix Attention)
- System Architecture Deep Dive
- Step-by-step Implementation Guides for Python and CUDA kernels
To serve the documentation locally:
# Install mdbook (if not already installed)
cargo install mdbook
# Serve the docs
make docs-serve
# Or manually: mdbook serve docsOpen http://localhost:3000 in your browser.
