Skip to content

Axion is a high-performance LLM serving platform built with Rust that provides OpenAI-compatible APIs for chat completions, embeddings, and reranking. Designed for production environments, Axion delivers exceptional throughput and low latency through advanced optimization techniques.

License

Notifications You must be signed in to change notification settings

Ammar-Alnagar/Axion

Repository files navigation

Axion Logo

Axion - High-Performance LLM Serving Platform

OpenAI-compatible LLM serving built with Rust

Rust License


Axion is a high-performance LLM serving platform built with Rust that provides OpenAI-compatible APIs for chat completions, embeddings, and reranking. Designed for production environments, Axion delivers exceptional throughput and low latency through advanced optimization techniques.

Key Features

Core Capabilities

  • Dual Backend System: Automatically uses MAX serve for supported models, falls back to Candle for unsupported models
  • OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints with full compatibility
  • Streaming Support: Real-time streaming responses using Server-Sent Events (SSE)
  • Request Caching: LRU cache system for faster repeated requests
  • Continuous Batching: Efficient request batching for improved throughput
  • Multi-Model Support: Extensive support for Llama, Qwen3, Gemma, Mistral, GLM4, Granite, Olmo, and other architectures

Performance Optimizations

  • Hardware Acceleration: Automatic GPU detection and utilization
  • Memory Efficiency: Optimized memory management and KV-cache reuse
  • Concurrent Processing: High-throughput request handling
  • Adaptive Batching: Dynamic batch formation based on request patterns

API Endpoints

Chat Completions

  • Endpoint: POST /v1/chat/completions
  • Features: Streaming and non-streaming responses, full OpenAI parameter compatibility
  • Example:
curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 150,
    "stream": false
  }'

Embeddings

  • Endpoint: POST /v1/embeddings
  • Features: High-performance embeddings using fastembed
  • Example:
curl -X POST http://localhost:3000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, World!",
    "model": "BAAI/bge-small-en-v1.5"
  }'

Reranking

  • Endpoint: POST /v1/rerank
  • Features: Semantic reranking using fastembed for improved search results
  • Example:
curl -X POST http://localhost:3000/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is a panda?",
    "documents": ["A bear species", "A software library", "An animal"],
    "model": "BAAI/bge-reranker-base",
    "top_n": 3
  }'

Health Check

  • Endpoint: GET /health
  • Returns: Server status, backend availability, loaded model information, and system metrics

Architecture Overview

┌─────────────┐
│   Client    │
└─────┬───────┘
      │ HTTPS
      ▼
┌─────────────────────────┐
│    Axion Server         │
│  (Axum + Tower HTTP)    │
└─────┬───────────────────┘
      │
      ├──► Cache Layer (LRU)
      │
      ├──► Continuous Batcher
      │
      ▼
┌─────────────────────────┐
│  Inference Engine       │
│  (Smart Routing)        │
└─────┬───────────────────┘
      │
      ├──► MAX Client ──────► max serve (OpenAI API)
      │                       │
      │                       ▼
      │                   Model Process
      │
      └──► Candle Backend ──► Native Inference
           │                  (Llama, Qwen, etc.)
           │
           └──► GPU/CPU Execution

Quick Start

Prerequisites

  • Rust: Latest stable version (1.70+)
  • Git LFS: For large model files
  • MAX CLI (Optional): For MAX backend support
  • CUDA (Optional): For GPU acceleration

Installation

  1. Clone the repository:
git clone <repository-url>
cd axion
git lfs install
git lfs pull
  1. Build the project:
cargo build --release
  1. Run with default settings:
# Use default model
cargo run --release

# Specify a model
MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct" cargo run --release

# With custom configuration
MODEL_NAME="microsoft/Phi-3-mini-4k-instruct" \
SERVER_PORT=8080 \
RUST_LOG=axion=info \
cargo run --release

Configuration

Environment Variables

Model Configuration

  • MODEL_NAME: Primary model to serve (default: meta-llama/Llama-3.2-3B-Instruct)
  • MAX_SEQ_LEN: Maximum sequence length (default: 4096)

Server Configuration

  • SERVER_HOST: Server host address (default: 0.0.0.0)
  • SERVER_PORT: Server port (default: 3000)
  • MAX_CONNECTIONS: Maximum concurrent connections (default: 100)

Performance Configuration

  • CACHE_CAPACITY: Number of cached responses (default: 1000)
  • BATCH_TIMEOUT_MS: Batching timeout in milliseconds (default: 50)
  • MAX_BATCH_SIZE: Maximum batch size (default: 8)
  • CONCURRENT_REQUESTS: Maximum concurrent requests (default: 10)

Logging

  • RUST_LOG: Logging level (default: axion=info,tower_http=info)

Backend Selection

MAX Backend (Primary)

When a model is supported by MAX, Axion automatically:

  1. Spawns max serve --model {model_name}
  2. Waits for MAX to become ready
  3. Routes all requests to MAX's OpenAI-compatible endpoint
  4. Monitors health and manages process lifecycle

MAX supports:

  • Llama models (Llama, Llama2, Llama3, Llama3.1, Llama3.2)
  • Mistral models (Mistral, MistralNeMo, Mixtral)
  • Qwen models (Qwen, Qwen2, Qwen3)
  • Gemma models (Gemma, Gemma2)
  • Phi models (Phi, Phi2, Phi3)
  • DeepSeek models
  • And other HuggingFace transformers

Candle Backend (Fallback)

If MAX is unavailable or unsupported, Axion uses Candle:

  1. Loads model using model-specific implementation
  2. Performs native inference with Candle framework
  3. Automatically uses GPU if available
  4. Applies model-specific optimizations

Candle supports:

  • Llama family models
  • Qwen3 and quantized variants
  • Gemma family models
  • Mistral family models
  • GLM4 family models
  • IBM Granite models
  • OLMo models
  • And other transformer architectures

Performance Features

Request Caching

  • LRU cache with configurable capacity (default: 1000 entries)
  • Caches non-streaming chat completions based on model parameters
  • Cache key includes model, messages, temperature, and other relevant parameters
  • Thread-safe implementation for concurrent access

Continuous Batching

  • Dynamic batch formation with configurable timeout
  • Configurable maximum batch size
  • Reduces computational overhead for concurrent requests
  • Maintains low latency through intelligent batching

Hardware Acceleration

  • Automatic GPU detection and utilization
  • CUDA support for NVIDIA GPUs
  • Optimized memory management for GPU inference
  • CPU optimization with SIMD instructions

Project Structure

src/
├── main.rs                 # Server entry point and HTTP handlers
├── api_types.rs            # OpenAI-compatible API type definitions
├── inference_engine.rs     # Main inference coordinator and backend routing
├── max_client.rs           # MAX serve integration and process management
├── candle_inference.rs     # Native Candle backend implementation
├── embedding_service.rs    # Embedding generation service
├── rerank_service.rs       # Document reranking service
├── cache.rs                # LRU cache implementation
├── batching.rs             # Continuous batching system
├── embed.rs                # Example embedding code
├── rerank.rs               # Example reranking code
└── models/                 # Model-specific Candle implementations
    ├── llama.rs           # Llama architecture implementation
    ├── qwen3.rs           # Qwen3 architecture implementation
    ├── gemma.rs           # Gemma architecture implementation
    ├── mistral.rs         # Mistral architecture implementation
    ├── glm4.rs            # GLM4 architecture implementation
    ├── granite.rs         # Granite architecture implementation
    ├── olmo.rs            # OLMo architecture implementation
    └── quant_qwen3.rs     # Quantized Qwen3 implementation

Development

Adding New Models

For MAX-supported models

MAX automatically supports new models when MAX adds support. Simply use the model identifier.

For Candle models

To add support for a new transformer architecture:

  1. Create model implementation in src/models/{architecture_name}.rs
  2. Add variant to ModelBackend enum in src/candle_inference.rs
  3. Implement model loading and generation methods
  4. Update configuration parsing if needed

Testing

# Run all tests
cargo test

# Run tests with detailed output
cargo test -- --nocapture

# Format code
cargo fmt

# Run linter
cargo clippy

# Run performance tests
cargo test --release -- --ignored performance

Building Documentation

Complete documentation is available in the Docs/ directory, covering all aspects of the system:

  • Architecture overview
  • API reference
  • Model-specific implementations
  • Configuration guides
  • Performance optimization

Performance Benchmarks

Typical performance characteristics:

  • Throughput: 5-50+ requests per second depending on model and configuration
  • Latency: 10ms-2s+ depending on request type and model
  • Memory Usage: Model-dependent + runtime overhead
  • GPU Utilization: 30-90% with proper batch sizing

Production Deployment

Docker Support

# Example Dockerfile
FROM rust:latest as builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bullseye-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/axion /usr/local/bin/axion
EXPOSE 3000
CMD ["axion"]

Container Orchestration

  • Supports Kubernetes deployments
  • Configurable resource limits
  • Health check endpoints for liveness/readiness probes
  • Environment variable configuration for different environments

Security Considerations

Input Validation

  • Strict validation of all API parameters
  • Size limits for input and output
  • Sanitization of model identifiers
  • Protection against injection attacks

Access Control

  • Optional API key authentication
  • Rate limiting capabilities
  • Network access controls
  • Model access restrictions

License

This project is licensed under the terms specified in the LICENSE file.

Contributing

We welcome contributions! Please see our contribution guidelines for details:

  1. Fork the repository
  2. Create a feature branch for your changes
  3. Add tests for new functionality
  4. Update documentation as needed
  5. Submit a pull request with a clear description

Development Guidelines

  • Follow Rust coding standards and best practices
  • Write comprehensive tests for new features
  • Document public APIs thoroughly
  • Maintain performance and security standards

Roadmap

Short-term Goals

  • Enhanced monitoring and metrics
  • Model quantization support
  • Advanced caching strategies
  • Improved error handling and recovery

Long-term Goals

  • Multi-GPU support
  • Model hot-swapping
  • Custom backend plugins
  • Advanced batching algorithms
  • Distributed inference
  • Enhanced security features

About

Axion is a high-performance LLM serving platform built with Rust that provides OpenAI-compatible APIs for chat completions, embeddings, and reranking. Designed for production environments, Axion delivers exceptional throughput and low latency through advanced optimization techniques.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published