Axion - High-Performance LLM Serving Platform

OpenAI-compatible LLM serving built with Rust

Axion is a high-performance LLM serving platform built with Rust that provides OpenAI-compatible APIs for chat completions, embeddings, and reranking. Designed for production environments, Axion delivers exceptional throughput and low latency through advanced optimization techniques.

Key Features

Core Capabilities

Dual Backend System: Automatically uses MAX serve for supported models, falls back to Candle for unsupported models
OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints with full compatibility
Streaming Support: Real-time streaming responses using Server-Sent Events (SSE)
Request Caching: LRU cache system for faster repeated requests
Continuous Batching: Efficient request batching for improved throughput
Multi-Model Support: Extensive support for Llama, Qwen3, Gemma, Mistral, GLM4, Granite, Olmo, and other architectures

Performance Optimizations

Hardware Acceleration: Automatic GPU detection and utilization
Memory Efficiency: Optimized memory management and KV-cache reuse
Concurrent Processing: High-throughput request handling
Adaptive Batching: Dynamic batch formation based on request patterns

API Endpoints

Chat Completions

Endpoint: POST /v1/chat/completions
Features: Streaming and non-streaming responses, full OpenAI parameter compatibility
Example:

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 150,
    "stream": false
  }'

Embeddings

Endpoint: POST /v1/embeddings
Features: High-performance embeddings using fastembed
Example:

curl -X POST http://localhost:3000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, World!",
    "model": "BAAI/bge-small-en-v1.5"
  }'

Reranking

Endpoint: POST /v1/rerank
Features: Semantic reranking using fastembed for improved search results
Example:

curl -X POST http://localhost:3000/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is a panda?",
    "documents": ["A bear species", "A software library", "An animal"],
    "model": "BAAI/bge-reranker-base",
    "top_n": 3
  }'

Health Check

Endpoint: GET /health
Returns: Server status, backend availability, loaded model information, and system metrics

Architecture Overview

┌─────────────┐
│   Client    │
└─────┬───────┘
      │ HTTPS
      ▼
┌─────────────────────────┐
│    Axion Server         │
│  (Axum + Tower HTTP)    │
└─────┬───────────────────┘
      │
      ├──► Cache Layer (LRU)
      │
      ├──► Continuous Batcher
      │
      ▼
┌─────────────────────────┐
│  Inference Engine       │
│  (Smart Routing)        │
└─────┬───────────────────┘
      │
      ├──► MAX Client ──────► max serve (OpenAI API)
      │                       │
      │                       ▼
      │                   Model Process
      │
      └──► Candle Backend ──► Native Inference
           │                  (Llama, Qwen, etc.)
           │
           └──► GPU/CPU Execution

Quick Start

Prerequisites

Rust: Latest stable version (1.70+)
Git LFS: For large model files
MAX CLI (Optional): For MAX backend support
CUDA (Optional): For GPU acceleration

Installation

Clone the repository:

git clone <repository-url>
cd axion
git lfs install
git lfs pull

Build the project:

cargo build --release

Run with default settings:

# Use default model
cargo run --release

# Specify a model
MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct" cargo run --release

# With custom configuration
MODEL_NAME="microsoft/Phi-3-mini-4k-instruct" \
SERVER_PORT=8080 \
RUST_LOG=axion=info \
cargo run --release

Configuration

Environment Variables

Model Configuration

MODEL_NAME: Primary model to serve (default: meta-llama/Llama-3.2-3B-Instruct)
MAX_SEQ_LEN: Maximum sequence length (default: 4096)

Server Configuration

SERVER_HOST: Server host address (default: 0.0.0.0)
SERVER_PORT: Server port (default: 3000)
MAX_CONNECTIONS: Maximum concurrent connections (default: 100)

Performance Configuration

CACHE_CAPACITY: Number of cached responses (default: 1000)
BATCH_TIMEOUT_MS: Batching timeout in milliseconds (default: 50)
MAX_BATCH_SIZE: Maximum batch size (default: 8)
CONCURRENT_REQUESTS: Maximum concurrent requests (default: 10)

Logging

RUST_LOG: Logging level (default: axion=info,tower_http=info)

Backend Selection

MAX Backend (Primary)

When a model is supported by MAX, Axion automatically:

Spawns max serve --model {model_name}
Waits for MAX to become ready
Routes all requests to MAX's OpenAI-compatible endpoint
Monitors health and manages process lifecycle

MAX supports:

Llama models (Llama, Llama2, Llama3, Llama3.1, Llama3.2)
Mistral models (Mistral, MistralNeMo, Mixtral)
Qwen models (Qwen, Qwen2, Qwen3)
Gemma models (Gemma, Gemma2)
Phi models (Phi, Phi2, Phi3)
DeepSeek models
And other HuggingFace transformers

Candle Backend (Fallback)

If MAX is unavailable or unsupported, Axion uses Candle:

Loads model using model-specific implementation
Performs native inference with Candle framework
Automatically uses GPU if available
Applies model-specific optimizations

Candle supports:

Llama family models
Qwen3 and quantized variants
Gemma family models
Mistral family models
GLM4 family models
IBM Granite models
OLMo models
And other transformer architectures

Performance Features

Request Caching

LRU cache with configurable capacity (default: 1000 entries)
Caches non-streaming chat completions based on model parameters
Cache key includes model, messages, temperature, and other relevant parameters
Thread-safe implementation for concurrent access

Continuous Batching

Dynamic batch formation with configurable timeout
Configurable maximum batch size
Reduces computational overhead for concurrent requests
Maintains low latency through intelligent batching

Hardware Acceleration

Automatic GPU detection and utilization
CUDA support for NVIDIA GPUs
Optimized memory management for GPU inference
CPU optimization with SIMD instructions

Project Structure

src/
├── main.rs                 # Server entry point and HTTP handlers
├── api_types.rs            # OpenAI-compatible API type definitions
├── inference_engine.rs     # Main inference coordinator and backend routing
├── max_client.rs           # MAX serve integration and process management
├── candle_inference.rs     # Native Candle backend implementation
├── embedding_service.rs    # Embedding generation service
├── rerank_service.rs       # Document reranking service
├── cache.rs                # LRU cache implementation
├── batching.rs             # Continuous batching system
├── embed.rs                # Example embedding code
├── rerank.rs               # Example reranking code
└── models/                 # Model-specific Candle implementations
    ├── llama.rs           # Llama architecture implementation
    ├── qwen3.rs           # Qwen3 architecture implementation
    ├── gemma.rs           # Gemma architecture implementation
    ├── mistral.rs         # Mistral architecture implementation
    ├── glm4.rs            # GLM4 architecture implementation
    ├── granite.rs         # Granite architecture implementation
    ├── olmo.rs            # OLMo architecture implementation
    └── quant_qwen3.rs     # Quantized Qwen3 implementation

Development

Adding New Models

For MAX-supported models

MAX automatically supports new models when MAX adds support. Simply use the model identifier.

For Candle models

To add support for a new transformer architecture:

Create model implementation in src/models/{architecture_name}.rs
Add variant to ModelBackend enum in src/candle_inference.rs
Implement model loading and generation methods
Update configuration parsing if needed

Testing

# Run all tests
cargo test

# Run tests with detailed output
cargo test -- --nocapture

# Format code
cargo fmt

# Run linter
cargo clippy

# Run performance tests
cargo test --release -- --ignored performance

Building Documentation

Complete documentation is available in the Docs/ directory, covering all aspects of the system:

Architecture overview
API reference
Model-specific implementations
Configuration guides
Performance optimization

Performance Benchmarks

Typical performance characteristics:

Throughput: 5-50+ requests per second depending on model and configuration
Latency: 10ms-2s+ depending on request type and model
Memory Usage: Model-dependent + runtime overhead
GPU Utilization: 30-90% with proper batch sizing

Production Deployment

Docker Support

# Example Dockerfile
FROM rust:latest as builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bullseye-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/axion /usr/local/bin/axion
EXPOSE 3000
CMD ["axion"]

Container Orchestration

Supports Kubernetes deployments
Configurable resource limits
Health check endpoints for liveness/readiness probes
Environment variable configuration for different environments

Security Considerations

Input Validation

Strict validation of all API parameters
Size limits for input and output
Sanitization of model identifiers
Protection against injection attacks

Access Control

Optional API key authentication
Rate limiting capabilities
Network access controls
Model access restrictions

License

This project is licensed under the terms specified in the LICENSE file.

Contributing

We welcome contributions! Please see our contribution guidelines for details:

Fork the repository
Create a feature branch for your changes
Add tests for new functionality
Update documentation as needed
Submit a pull request with a clear description

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Docs		Docs
examples		examples
src		src
.env.example		.env.example
.gitignore		.gitignore
CONFIG_GUIDE.md		CONFIG_GUIDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
GETTING_STARTED.md		GETTING_STARTED.md
IMPLEMENTATION_NOTES.md		IMPLEMENTATION_NOTES.md
LICENSE		LICENSE
README.md		README.md
config.example.toml		config.example.toml
logo.png		logo.png
setup_max.sh		setup_max.sh

License

Ammar-Alnagar/Axion

Folders and files

Latest commit

History

Repository files navigation

Axion - High-Performance LLM Serving Platform

Key Features

Core Capabilities

Performance Optimizations

API Endpoints

Chat Completions

Embeddings

Reranking

Health Check

Architecture Overview

Quick Start

Prerequisites

Installation

Configuration

Environment Variables

Model Configuration

Server Configuration

Performance Configuration

Logging

Backend Selection

MAX Backend (Primary)

Candle Backend (Fallback)

Performance Features

Request Caching

Continuous Batching

Hardware Acceleration

Project Structure

Development

Adding New Models

For MAX-supported models

For Candle models

Testing

Building Documentation

Performance Benchmarks

Production Deployment

Docker Support

Container Orchestration

Security Considerations

Input Validation

Access Control

License

Contributing

Development Guidelines

Roadmap

Short-term Goals

Long-term Goals

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages