PhotonInfer

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

🚀 Performance Highlights

PhotonInfer delivers production-grade inference performance for LLMs with advanced batching capabilities. Supports Llama-3.2 and Qwen3 models.

Single Request Inference

Model	PhotonInfer	llama.cpp	Speedup
Llama 3.2 1B	185 tok/s	252 tok/s	0.73× (llama.cpp faster)

TTFT (Time To First Token): 387ms @ 100-token prompt (INT8 quantization)

Batched Inference Throughput

Batch Size	PhotonInfer	llama.cpp	Speedup
4	410 tok/s	252 tok/s	1.63×
8	720 tok/s	255 tok/s	2.82×
16	787 tok/s	253 tok/s	3.07×

Tested on: NVIDIA A100, Llama 3.2 1B, Q8/INT8 quantization

✨ Key Features

🎯 vLLM-Style Continuous Batching

Token-level dynamic scheduling: Add new requests mid-generation without waiting for batch completion
Two-phase scheduler: Seamlessly continue running requests while admitting new ones
Request state tracking: Precise num_computed_tokens management for efficient resume
Perfect for production: High-concurrency inference services with real-time responsiveness

⚡ GPU-Optimized Kernels

Batched Paged Attention: Block-level KV cache management with efficient memory utilization
Vectorized Memory Access: float4 loads for 2-4× bandwidth efficiency
Fused Operations: Zero-copy GPU sampling, batched RoPE, and fused normalization
INT8 Quantization: Group-wise quantization with cuBLASLt INT8 GEMM support
Optimized Softmax: CUB BlockReduce for numerically stable attention computation

🏗️ Modern C++20 Architecture

Type-Safe Error Handling: Rust-inspired Result<T, E> type for explicit error propagation
Zero-Copy Design: Extensive use of std::span and move semantics
Device Agnostic: Unified interface for CPU and CUDA backends
Concepts & Ranges: Compile-time constraints and expressive type safety

🚀 Quick Start

Prerequisites

Compiler: GCC 12+ (C++20 support required)
CMake: 3.20+
CUDA Toolkit: 12.0+ (tested on 12.5)
GPU: NVIDIA GPU with Compute Capability 7.0+

Download Model

Download a pre-quantized model to get started quickly:

https://huggingface.co/Lummy666/llama-3.2-1B-Instruct

Build

Option 1: Build from Source

# Clone repository
cd photon_infer

# Configure with CUDA
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DPHOTON_BUILD_CUDA=ON ..

# Build
cmake --build . -j$(nproc)

# Install (optional)
sudo cmake --install .

After installation, you can run the web server directly from anywhere:

photon_web_server \
    --port 5728 \
    --model /path/to/llama-3.2-1B-Instruct \
    --tokenizer /path/to/llama-3.2-1B-Instruct/tokenizer.json

The installation will place:

photon_web_server → /usr/local/bin/
Static web files → /photon_infer/web/static/
Core library → /usr/local/lib/

To uninstall:

cd build
sudo cmake --build . --target uninstall

Option 2: Use Docker (Recommended)

# Pull the pre-built Docker image
docker pull lumia431/photon_infer:latest

# Run the container with GPU support
docker run --rm --gpus all -p 5728:5728 -e PORT=5728 lumia431/photon_infer:latest

The web interface will be available at http://localhost:5728

🔬 Technical Details

INT8 Quantization

Group-wise quantization: Configurable group size (32, 64, 128)
cuBLASLt integration: Hardware-accelerated INT8 GEMM
Minimal accuracy loss: < 1% perplexity degradation on Llama models

Paged Attention

Block-level KV cache: Efficient memory allocation without fragmentation
Dynamic sequence management: Per-sequence cache offsets for flexible scheduling
Batched cache operations: Single kernel for multi-sequence K/V writes

Continuous Batching Scheduler

Two-phase scheduling:
1. Phase 1: Continue all RUNNING requests (no interruption)
2. Phase 2: Admit WAITING requests to fill remaining capacity
Request states: WAITING → RUNNING → FINISHED (with PREEMPTED support)
Token-level granularity: num_computed_tokens tracking for precise resume

🛣️ Roadmap

Core Infrastructure: Tensor, operators, memory management
LLaMA Model: Full transformer implementation with CPU/GPU kernels
INT8 Quantization: Group-wise quantization with cuBLASLt
Paged Attention: Block-level KV cache management
Continuous Batching: vLLM-style dynamic request scheduling
Flash Attention 2: IO-aware attention for long sequences
Multi-GPU Support: Tensor parallelism for large models
FP16/BF16 Mixed Precision: Enhanced throughput on modern GPUs
Speculative Decoding: Multi-token generation with draft model

📖 Documentation

🤝 Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

📝 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Architecture inspired by vLLM
Kernel optimizations reference llama.cpp
Error handling design from Rust's Result<T, E>

Built with ❤️ for high-performance LLM inference

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
cmake		cmake
demo		demo
docker		docker
include/photon		include/photon
src		src
tests		tests
web		web
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PhotonInfer

🚀 Performance Highlights

Single Request Inference

Batched Inference Throughput

✨ Key Features

🎯 vLLM-Style Continuous Batching

⚡ GPU-Optimized Kernels

🏗️ Modern C++20 Architecture

🚀 Quick Start

Prerequisites

Download Model

Build

Option 1: Build from Source

Option 2: Use Docker (Recommended)

🔬 Technical Details

INT8 Quantization

Paged Attention

Continuous Batching Scheduler

🛣️ Roadmap

📖 Documentation

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

lumia431/photon_infer

Folders and files

Latest commit

History

Repository files navigation

PhotonInfer

🚀 Performance Highlights

Single Request Inference

Batched Inference Throughput

✨ Key Features

🎯 vLLM-Style Continuous Batching

⚡ GPU-Optimized Kernels

🏗️ Modern C++20 Architecture

🚀 Quick Start

Prerequisites

Download Model

Build

Option 1: Build from Source

Option 2: Use Docker (Recommended)

🔬 Technical Details

INT8 Quantization

Paged Attention

Continuous Batching Scheduler

🛣️ Roadmap

📖 Documentation

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages