A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching
PhotonInfer delivers production-grade inference performance for LLMs with advanced batching capabilities. Supports Llama-3.2 and Qwen3 models.
| Model | PhotonInfer | llama.cpp | Speedup |
|---|---|---|---|
| Llama 3.2 1B | 185 tok/s | 252 tok/s | 0.73Γ (llama.cpp faster) |
TTFT (Time To First Token): 387ms @ 100-token prompt (INT8 quantization)
| Batch Size | PhotonInfer | llama.cpp | Speedup |
|---|---|---|---|
| 4 | 410 tok/s | 252 tok/s | 1.63Γ |
| 8 | 720 tok/s | 255 tok/s | 2.82Γ |
| 16 | 787 tok/s | 253 tok/s | 3.07Γ |
Tested on: NVIDIA A100, Llama 3.2 1B, Q8/INT8 quantization
- Token-level dynamic scheduling: Add new requests mid-generation without waiting for batch completion
- Two-phase scheduler: Seamlessly continue running requests while admitting new ones
- Request state tracking: Precise
num_computed_tokensmanagement for efficient resume - Perfect for production: High-concurrency inference services with real-time responsiveness
- Batched Paged Attention: Block-level KV cache management with efficient memory utilization
- Vectorized Memory Access:
float4loads for 2-4Γ bandwidth efficiency - Fused Operations: Zero-copy GPU sampling, batched RoPE, and fused normalization
- INT8 Quantization: Group-wise quantization with cuBLASLt INT8 GEMM support
- Optimized Softmax: CUB BlockReduce for numerically stable attention computation
- Type-Safe Error Handling: Rust-inspired
Result<T, E>type for explicit error propagation - Zero-Copy Design: Extensive use of
std::spanand move semantics - Device Agnostic: Unified interface for CPU and CUDA backends
- Concepts & Ranges: Compile-time constraints and expressive type safety
- Compiler: GCC 12+ (C++20 support required)
- CMake: 3.20+
- CUDA Toolkit: 12.0+ (tested on 12.5)
- GPU: NVIDIA GPU with Compute Capability 7.0+
Download a pre-quantized model to get started quickly:
https://huggingface.co/Lummy666/llama-3.2-1B-Instruct
# Clone repository
cd photon_infer
# Configure with CUDA
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DPHOTON_BUILD_CUDA=ON ..
# Build
cmake --build . -j$(nproc)
# Install (optional)
sudo cmake --install .After installation, you can run the web server directly from anywhere:
photon_web_server \
--port 5728 \
--model /path/to/llama-3.2-1B-Instruct \
--tokenizer /path/to/llama-3.2-1B-Instruct/tokenizer.jsonThe installation will place:
photon_web_serverβ/usr/local/bin/- Static web files β
/photon_infer/web/static/ - Core library β
/usr/local/lib/
To uninstall:
cd build
sudo cmake --build . --target uninstall# Pull the pre-built Docker image
docker pull lumia431/photon_infer:latest
# Run the container with GPU support
docker run --rm --gpus all -p 5728:5728 -e PORT=5728 lumia431/photon_infer:latestThe web interface will be available at http://localhost:5728
- Group-wise quantization: Configurable group size (32, 64, 128)
- cuBLASLt integration: Hardware-accelerated INT8 GEMM
- Minimal accuracy loss: < 1% perplexity degradation on Llama models
- Block-level KV cache: Efficient memory allocation without fragmentation
- Dynamic sequence management: Per-sequence cache offsets for flexible scheduling
- Batched cache operations: Single kernel for multi-sequence K/V writes
- Two-phase scheduling:
- Phase 1: Continue all RUNNING requests (no interruption)
- Phase 2: Admit WAITING requests to fill remaining capacity
- Request states: WAITING β RUNNING β FINISHED (with PREEMPTED support)
- Token-level granularity:
num_computed_tokenstracking for precise resume
- Core Infrastructure: Tensor, operators, memory management
- LLaMA Model: Full transformer implementation with CPU/GPU kernels
- INT8 Quantization: Group-wise quantization with cuBLASLt
- Paged Attention: Block-level KV cache management
- Continuous Batching: vLLM-style dynamic request scheduling
- Flash Attention 2: IO-aware attention for long sequences
- Multi-GPU Support: Tensor parallelism for large models
- FP16/BF16 Mixed Precision: Enhanced throughput on modern GPUs
- Speculative Decoding: Multi-token generation with draft model
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
- Architecture inspired by vLLM
- Kernel optimizations reference llama.cpp
- Error handling design from Rust's
Result<T, E>
Built with β€οΈ for high-performance LLM inference