Skip to content

lumia431/photon_infer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PhotonInfer

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

English | δΈ­ζ–‡ | Live Demo

License: MIT CUDA C++20


πŸš€ Performance Highlights

PhotonInfer delivers production-grade inference performance for LLMs with advanced batching capabilities. Supports Llama-3.2 and Qwen3 models.

Single Request Inference

Model PhotonInfer llama.cpp Speedup
Llama 3.2 1B 185 tok/s 252 tok/s 0.73Γ— (llama.cpp faster)

TTFT (Time To First Token): 387ms @ 100-token prompt (INT8 quantization)

Batched Inference Throughput

Batch Size PhotonInfer llama.cpp Speedup
4 410 tok/s 252 tok/s 1.63Γ—
8 720 tok/s 255 tok/s 2.82Γ—
16 787 tok/s 253 tok/s 3.07Γ—

Tested on: NVIDIA A100, Llama 3.2 1B, Q8/INT8 quantization

✨ Key Features

🎯 vLLM-Style Continuous Batching

  • Token-level dynamic scheduling: Add new requests mid-generation without waiting for batch completion
  • Two-phase scheduler: Seamlessly continue running requests while admitting new ones
  • Request state tracking: Precise num_computed_tokens management for efficient resume
  • Perfect for production: High-concurrency inference services with real-time responsiveness

⚑ GPU-Optimized Kernels

  • Batched Paged Attention: Block-level KV cache management with efficient memory utilization
  • Vectorized Memory Access: float4 loads for 2-4Γ— bandwidth efficiency
  • Fused Operations: Zero-copy GPU sampling, batched RoPE, and fused normalization
  • INT8 Quantization: Group-wise quantization with cuBLASLt INT8 GEMM support
  • Optimized Softmax: CUB BlockReduce for numerically stable attention computation

πŸ—οΈ Modern C++20 Architecture

  • Type-Safe Error Handling: Rust-inspired Result<T, E> type for explicit error propagation
  • Zero-Copy Design: Extensive use of std::span and move semantics
  • Device Agnostic: Unified interface for CPU and CUDA backends
  • Concepts & Ranges: Compile-time constraints and expressive type safety

πŸš€ Quick Start

Prerequisites

  • Compiler: GCC 12+ (C++20 support required)
  • CMake: 3.20+
  • CUDA Toolkit: 12.0+ (tested on 12.5)
  • GPU: NVIDIA GPU with Compute Capability 7.0+

Download Model

Download a pre-quantized model to get started quickly:

https://huggingface.co/Lummy666/llama-3.2-1B-Instruct

Build

Option 1: Build from Source

# Clone repository
cd photon_infer

# Configure with CUDA
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DPHOTON_BUILD_CUDA=ON ..

# Build
cmake --build . -j$(nproc)

# Install (optional)
sudo cmake --install .

After installation, you can run the web server directly from anywhere:

photon_web_server \
    --port 5728 \
    --model /path/to/llama-3.2-1B-Instruct \
    --tokenizer /path/to/llama-3.2-1B-Instruct/tokenizer.json

The installation will place:

  • photon_web_server β†’ /usr/local/bin/
  • Static web files β†’ /photon_infer/web/static/
  • Core library β†’ /usr/local/lib/

To uninstall:

cd build
sudo cmake --build . --target uninstall

Option 2: Use Docker (Recommended)

# Pull the pre-built Docker image
docker pull lumia431/photon_infer:latest

# Run the container with GPU support
docker run --rm --gpus all -p 5728:5728 -e PORT=5728 lumia431/photon_infer:latest

The web interface will be available at http://localhost:5728

πŸ”¬ Technical Details

INT8 Quantization

  • Group-wise quantization: Configurable group size (32, 64, 128)
  • cuBLASLt integration: Hardware-accelerated INT8 GEMM
  • Minimal accuracy loss: < 1% perplexity degradation on Llama models

Paged Attention

  • Block-level KV cache: Efficient memory allocation without fragmentation
  • Dynamic sequence management: Per-sequence cache offsets for flexible scheduling
  • Batched cache operations: Single kernel for multi-sequence K/V writes

Continuous Batching Scheduler

  • Two-phase scheduling:
    1. Phase 1: Continue all RUNNING requests (no interruption)
    2. Phase 2: Admit WAITING requests to fill remaining capacity
  • Request states: WAITING β†’ RUNNING β†’ FINISHED (with PREEMPTED support)
  • Token-level granularity: num_computed_tokens tracking for precise resume

πŸ›£οΈ Roadmap

  • Core Infrastructure: Tensor, operators, memory management
  • LLaMA Model: Full transformer implementation with CPU/GPU kernels
  • INT8 Quantization: Group-wise quantization with cuBLASLt
  • Paged Attention: Block-level KV cache management
  • Continuous Batching: vLLM-style dynamic request scheduling
  • Flash Attention 2: IO-aware attention for long sequences
  • Multi-GPU Support: Tensor parallelism for large models
  • FP16/BF16 Mixed Precision: Enhanced throughput on modern GPUs
  • Speculative Decoding: Multi-token generation with draft model

πŸ“– Documentation

🀝 Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

πŸ“ License

MIT License - see LICENSE for details.

πŸ™ Acknowledgments

  • Architecture inspired by vLLM
  • Kernel optimizations reference llama.cpp
  • Error handling design from Rust's Result<T, E>

Built with ❀️ for high-performance LLM inference

About

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published