YAIE: Educational LLM Inference Engine

YAIE (Yet Another Inference Engine) is an educational project designed to help students and developers understand how modern LLM inference engines work. This implementation is inspired by state-of-the-art systems like SGLang, vLLM, FlashInfer and other efficient inference engines, focusing on continuous batching, radial attention, and FlashInfer-style optimizations.

Overview

Modern LLM inference engines like SGLang, vLLM, and TensorRT-LLM implement sophisticated techniques to maximize throughput and minimize latency. YAIE demonstrates these concepts through a simplified but educational implementation that focuses on:

Continuous Batching: Dynamically batching incoming requests to maximize GPU utilization
Radial Attention: Efficient attention mechanism with prefix sharing and paged KV-cache
OpenAI Compatibility: Server mode provides OpenAI-compatible API
Modular Design: Clean architecture separating concerns for easy learning

Features

Two Operation Modes:
- Server mode (yaie serve) with OpenAI-compatible API
- CLI chat mode (yaie chat) for interactive conversations
HuggingFace Integration: Automatic model downloading and caching
Continuous Batching: Efficient request scheduling for better throughput
Paged KV-Cache: Memory-efficient key-value cache management
Radial Attention: Prefix sharing for similar requests
Educational Focus: Clear, well-documented code with learning resources

Architecture

The engine follows a modular architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   API Layer     │    │  Engine Core    │    │  Model/Kernels  │
│  (FastAPI)      │◄──►│  (Scheduler,   │◄──►│  (PyTorch/     │
│                 │    │  Attention)    │    │  CUDA)         │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         ▲                       ▲                       ▲
         │                       │                       │
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   CLI Layer     │    │  Model Loading  │    │  Memory Mgmt    │
│  (yaie serve/   │    │  (HuggingFace  │    │  (Paged Cache)  │
│   yaie chat)    │    │  Integration)   │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Key Components:

CLI Interface: Entry point for both server and chat modes
API Server: FastAPI-based server with OpenAI-compatible endpoints
Inference Engine: Core processing logic with scheduler and attention
Scheduler: Continuous batching with request management
Radial Attention: Efficient attention with prefix sharing
Model Loader: HuggingFace model and tokenizer management
KV-Cache Manager: Paged cache for efficient memory usage

Installation

Clone the repository:

git clone https://github.com/yourusername/YAIE.git
cd YAIE

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the package:
```
pip install -e .
```

Usage

Server Mode (OpenAI API Compatible)

Start the server with a specific model:

yaie serve microsoft/DialoGPT-medium --host localhost --port 8000

The server will:

Check for the model in local HuggingFace cache
Download if not present
Start an OpenAI-compatible API server

API endpoints:

POST /v1/chat/completions - Chat completions
GET /v1/models - List available models

CLI Chat Mode

Start an interactive chat session:

yaie chat microsoft/DialoGPT-medium

The chat will:

Check for the model in local HuggingFace cache
Download if not present
Start an interactive chat session

Building Kernels

To build the custom CUDA kernels for optimized performance:

# Using the build script
./build_kernels.sh

# Or using make
make build-kernels

# Or directly with Python
python setup_kernels.py build_ext --inplace

Note: Kernel building requires:

CUDA toolkit installed
PyTorch with CUDA support
Compatible GPU with compute capability >= 6.0

If CUDA is not available, the engine will run in CPU-only mode.

Learning Objectives

By implementing the kernels and components in this project, you will learn:

Continuous Batching Concepts:
- Problem: Traditional batching requires all requests to have the same length
- Solution: Dynamically batch requests and handle them at different stages
Paged KV-Cache Management:
- Problem: KV-cache memory fragmentation with variable-length requests
- Solution: Use paged memory management similar to OS virtual memory
Radial Attention & Prefix Sharing:
- Problem: Redundant computation for requests with similar prefixes
- Solution: Share computed attention across requests with common prefixes (SGLang-style)
FlashInfer-Style Optimizations:
- Problem: Inefficient memory access patterns during attention computation
- Solution: Optimized attention for both prefill and decode phases
CUDA Kernel Programming:
- Efficient GPU memory access patterns
- Parallel computation for attention mechanisms
- Memory bandwidth optimization
System Performance Optimization:
- Latency vs. throughput trade-offs
- Memory management strategies
- Batch size optimization

Implementation Guide

This project provides a detailed guide for implementing the various kernels and components:

Implementation Guide: Complete documentation of all kernels that need to be implemented

Kernels to Implement:

Attention Kernels:
- FlashAttention forward and backward
- Paged attention
- RoPE (Rotary Position Embedding)
Normalization Kernels:
- RMS normalization
Activation Kernels:
- SiLU and multiplication fusion
Memory Management:
- KV-cache management with paging
- Block allocation and deallocation
CPU Fallbacks:
- CPU implementations for when GPU is not available

Contributing

This is an educational project, and contributions that improve the learning experience are welcome:

Add more detailed comments explaining complex concepts
Create additional examples or tutorials
Improve documentation and explanations
Add more model compatibility

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by SGLang, vLLM, and other efficient inference engines
Built on top of HuggingFace Transformers library
Educational reference implementation for learning purposes

We would like to acknowledge the significant contributions of several state-of-the-art inference engines that inspired this educational project:

vLLM: For pioneering the concept of PagedAttention and efficient memory management in LLM serving
SGLang: For introducing radial attention and highly optimized prompt processing techniques
TensorRT-LLM: For demonstrating the power of optimized inference through NVIDIA's TensorRT technology
LightLLM: For showing how to implement efficient inference with various optimization techniques

These projects have advanced the field of LLM inference significantly, and this educational engine draws concepts and inspiration from their innovative approaches to continuous batching, attention optimization, and memory management.

📚 Documentation

Mini-YAIE comes with a comprehensive educational guide (built with mdbook) that covers:

Core Concepts (LLM Inference, Continuous Batching, Radix Attention)
System Architecture Deep Dive
Step-by-step Implementation Guides for Python and CUDA kernels

Viewing the Docs

To serve the documentation locally:

# Install mdbook (if not already installed)
cargo install mdbook

# Serve the docs
make docs-serve
# Or manually: mdbook serve docs

Open http://localhost:3000 in your browser.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
Mini-YAIE/tests		Mini-YAIE/tests
docs		docs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build_kernels.sh		build_kernels.sh
guide.md		guide.md
logo.png		logo.png
pyproject.toml		pyproject.toml
requirement.txt		requirement.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YAIE: Educational LLM Inference Engine

Table of Contents

Overview

Features

Architecture

Key Components:

Installation

Usage

Server Mode (OpenAI API Compatible)

CLI Chat Mode

Building Kernels

Learning Objectives

Implementation Guide

Kernels to Implement:

Contributing

License

Acknowledgments

📚 Documentation

Viewing the Docs

About

Uh oh!

Releases

Packages

Languages

License

Ammar-Alnagar/YAIE

Folders and files

Latest commit

History

Repository files navigation

YAIE: Educational LLM Inference Engine

Table of Contents

Overview

Features

Architecture

Key Components:

Installation

Usage

Server Mode (OpenAI API Compatible)

CLI Chat Mode

Building Kernels

Learning Objectives

Implementation Guide

Kernels to Implement:

Contributing

License

Acknowledgments

📚 Documentation

Viewing the Docs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages