Axion is a high-performance LLM serving platform built with Rust that provides OpenAI-compatible APIs for chat completions, embeddings, and reranking. Designed for production environments, Axion delivers exceptional throughput and low latency through advanced optimization techniques.
- Dual Backend System: Automatically uses MAX serve for supported models, falls back to Candle for unsupported models
- OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints with full compatibility
- Streaming Support: Real-time streaming responses using Server-Sent Events (SSE)
- Request Caching: LRU cache system for faster repeated requests
- Continuous Batching: Efficient request batching for improved throughput
- Multi-Model Support: Extensive support for Llama, Qwen3, Gemma, Mistral, GLM4, Granite, Olmo, and other architectures
- Hardware Acceleration: Automatic GPU detection and utilization
- Memory Efficiency: Optimized memory management and KV-cache reuse
- Concurrent Processing: High-throughput request handling
- Adaptive Batching: Dynamic batch formation based on request patterns
- Endpoint:
POST /v1/chat/completions - Features: Streaming and non-streaming responses, full OpenAI parameter compatibility
- Example:
curl -X POST http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 150,
"stream": false
}'- Endpoint:
POST /v1/embeddings - Features: High-performance embeddings using fastembed
- Example:
curl -X POST http://localhost:3000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, World!",
"model": "BAAI/bge-small-en-v1.5"
}'- Endpoint:
POST /v1/rerank - Features: Semantic reranking using fastembed for improved search results
- Example:
curl -X POST http://localhost:3000/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"query": "what is a panda?",
"documents": ["A bear species", "A software library", "An animal"],
"model": "BAAI/bge-reranker-base",
"top_n": 3
}'- Endpoint:
GET /health - Returns: Server status, backend availability, loaded model information, and system metrics
┌─────────────┐
│ Client │
└─────┬───────┘
│ HTTPS
▼
┌─────────────────────────┐
│ Axion Server │
│ (Axum + Tower HTTP) │
└─────┬───────────────────┘
│
├──► Cache Layer (LRU)
│
├──► Continuous Batcher
│
▼
┌─────────────────────────┐
│ Inference Engine │
│ (Smart Routing) │
└─────┬───────────────────┘
│
├──► MAX Client ──────► max serve (OpenAI API)
│ │
│ ▼
│ Model Process
│
└──► Candle Backend ──► Native Inference
│ (Llama, Qwen, etc.)
│
└──► GPU/CPU Execution
- Rust: Latest stable version (1.70+)
- Git LFS: For large model files
- MAX CLI (Optional): For MAX backend support
- CUDA (Optional): For GPU acceleration
- Clone the repository:
git clone <repository-url>
cd axion
git lfs install
git lfs pull- Build the project:
cargo build --release- Run with default settings:
# Use default model
cargo run --release
# Specify a model
MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct" cargo run --release
# With custom configuration
MODEL_NAME="microsoft/Phi-3-mini-4k-instruct" \
SERVER_PORT=8080 \
RUST_LOG=axion=info \
cargo run --releaseMODEL_NAME: Primary model to serve (default:meta-llama/Llama-3.2-3B-Instruct)MAX_SEQ_LEN: Maximum sequence length (default:4096)
SERVER_HOST: Server host address (default:0.0.0.0)SERVER_PORT: Server port (default:3000)MAX_CONNECTIONS: Maximum concurrent connections (default:100)
CACHE_CAPACITY: Number of cached responses (default:1000)BATCH_TIMEOUT_MS: Batching timeout in milliseconds (default:50)MAX_BATCH_SIZE: Maximum batch size (default:8)CONCURRENT_REQUESTS: Maximum concurrent requests (default:10)
RUST_LOG: Logging level (default:axion=info,tower_http=info)
When a model is supported by MAX, Axion automatically:
- Spawns
max serve --model {model_name} - Waits for MAX to become ready
- Routes all requests to MAX's OpenAI-compatible endpoint
- Monitors health and manages process lifecycle
MAX supports:
- Llama models (Llama, Llama2, Llama3, Llama3.1, Llama3.2)
- Mistral models (Mistral, MistralNeMo, Mixtral)
- Qwen models (Qwen, Qwen2, Qwen3)
- Gemma models (Gemma, Gemma2)
- Phi models (Phi, Phi2, Phi3)
- DeepSeek models
- And other HuggingFace transformers
If MAX is unavailable or unsupported, Axion uses Candle:
- Loads model using model-specific implementation
- Performs native inference with Candle framework
- Automatically uses GPU if available
- Applies model-specific optimizations
Candle supports:
- Llama family models
- Qwen3 and quantized variants
- Gemma family models
- Mistral family models
- GLM4 family models
- IBM Granite models
- OLMo models
- And other transformer architectures
- LRU cache with configurable capacity (default: 1000 entries)
- Caches non-streaming chat completions based on model parameters
- Cache key includes model, messages, temperature, and other relevant parameters
- Thread-safe implementation for concurrent access
- Dynamic batch formation with configurable timeout
- Configurable maximum batch size
- Reduces computational overhead for concurrent requests
- Maintains low latency through intelligent batching
- Automatic GPU detection and utilization
- CUDA support for NVIDIA GPUs
- Optimized memory management for GPU inference
- CPU optimization with SIMD instructions
src/
├── main.rs # Server entry point and HTTP handlers
├── api_types.rs # OpenAI-compatible API type definitions
├── inference_engine.rs # Main inference coordinator and backend routing
├── max_client.rs # MAX serve integration and process management
├── candle_inference.rs # Native Candle backend implementation
├── embedding_service.rs # Embedding generation service
├── rerank_service.rs # Document reranking service
├── cache.rs # LRU cache implementation
├── batching.rs # Continuous batching system
├── embed.rs # Example embedding code
├── rerank.rs # Example reranking code
└── models/ # Model-specific Candle implementations
├── llama.rs # Llama architecture implementation
├── qwen3.rs # Qwen3 architecture implementation
├── gemma.rs # Gemma architecture implementation
├── mistral.rs # Mistral architecture implementation
├── glm4.rs # GLM4 architecture implementation
├── granite.rs # Granite architecture implementation
├── olmo.rs # OLMo architecture implementation
└── quant_qwen3.rs # Quantized Qwen3 implementation
MAX automatically supports new models when MAX adds support. Simply use the model identifier.
To add support for a new transformer architecture:
- Create model implementation in
src/models/{architecture_name}.rs - Add variant to
ModelBackendenum insrc/candle_inference.rs - Implement model loading and generation methods
- Update configuration parsing if needed
# Run all tests
cargo test
# Run tests with detailed output
cargo test -- --nocapture
# Format code
cargo fmt
# Run linter
cargo clippy
# Run performance tests
cargo test --release -- --ignored performanceComplete documentation is available in the Docs/ directory, covering all aspects of the system:
- Architecture overview
- API reference
- Model-specific implementations
- Configuration guides
- Performance optimization
Typical performance characteristics:
- Throughput: 5-50+ requests per second depending on model and configuration
- Latency: 10ms-2s+ depending on request type and model
- Memory Usage: Model-dependent + runtime overhead
- GPU Utilization: 30-90% with proper batch sizing
# Example Dockerfile
FROM rust:latest as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bullseye-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/axion /usr/local/bin/axion
EXPOSE 3000
CMD ["axion"]- Supports Kubernetes deployments
- Configurable resource limits
- Health check endpoints for liveness/readiness probes
- Environment variable configuration for different environments
- Strict validation of all API parameters
- Size limits for input and output
- Sanitization of model identifiers
- Protection against injection attacks
- Optional API key authentication
- Rate limiting capabilities
- Network access controls
- Model access restrictions
This project is licensed under the terms specified in the LICENSE file.
We welcome contributions! Please see our contribution guidelines for details:
- Fork the repository
- Create a feature branch for your changes
- Add tests for new functionality
- Update documentation as needed
- Submit a pull request with a clear description
- Follow Rust coding standards and best practices
- Write comprehensive tests for new features
- Document public APIs thoroughly
- Maintain performance and security standards
- Enhanced monitoring and metrics
- Model quantization support
- Advanced caching strategies
- Improved error handling and recovery
- Multi-GPU support
- Model hot-swapping
- Custom backend plugins
- Advanced batching algorithms
- Distributed inference
- Enhanced security features
