Skip to content

defai-digital/ax-serving

AX Serving

Category: Department-Scale Private AI Fleet Control Plane

Product: The serving and orchestration layer for multi-model private AI fleets operated by SMEs and enterprise departments.

macOS 14+ rust-1.88+ Tests: 384 passing license-AGPL-3.0-or-later

AX Serving is the serving and orchestration control plane behind AX Fabric. It is designed for department-scale private AI fleets that need OpenAI-compatible APIs, runtime and model inventory, scheduling, metrics, audit surfaces, and multi-worker routing across heterogeneous runtime nodes.

AX Serving is not the token-generation engine. In the target architecture, inference execution is delegated to runtime nodes:

  • Mac nodes run ax-engine
  • PC CUDA nodes run vLLM
  • NVIDIA Thor nodes run vLLM

The existing embedded local worker path remains available as a compatibility bridge. New deployments should prefer runtime-node adapters such as ax-runtime-agent in front of ax-engine or vLLM endpoints.

AX Fabric is the product-facing layer for retrieval, knowledge, and grounded agent workflows. AX Serving is the infrastructure layer that makes that stack deployable and operable across Mac-led and mixed-worker environments.

Status: production-ready Rust workspace for Apple Silicon (aarch64-apple-darwin) with OpenAI-compatible REST, gRPC, runtime model management, and multi-worker orchestration oriented around department-scale private AI serving.

Market Focus

AX Serving is built to win in three adjacent niches:

  • department-scale private AI fleet control planes
  • Mac-native serving and orchestration for single-node and Mac-grid deployments
  • enterprise mixed-worker orchestration across NVIDIA / Thor-class, Mac Studio-class, and future workers
  • serving infrastructure for governed private AI stacks such as AX Fabric

Who it is for:

  • SMEs and enterprise departments with fewer than ~100 users or operators
  • platform and infra teams running private AI fleets
  • operators who need more than a single local runtime process
  • teams that care about model lifecycle, routing, metrics, health, audit, and fleet operations
  • private deployments that need an OpenAI-compatible serving layer without a cloud-first dependency

What it is not:

  • not an end-user desktop chat app
  • not a generic CUDA hyperscale serving stack
  • not the low-level token-generation engine itself

Deployment fit:

  • Single Mac: the default open-source deployment path
  • Mac grid: the default open-source multi-worker deployment path
  • Enterprise heterogeneous fleet: commercial path for NVIDIA / Thor-class workers, governed mixed-node deployments, and enterprise delivery requirements

For market positioning, competitive analysis, and ICP details, see:


Licensing And Commercial Use

AX Serving is dual-licensed:

  • Open-source use: AGPL-3.0-or-later
  • Commercial use: available under separate written license

Commercial licensing is intended for organizations that want to use AX Serving as a proprietary serving backend, private inference/control plane, embedded runtime, OEM component, managed fleet, or enterprise integration layer without AGPL obligations.

Commercial engagements may include:

  • commercial runtime licensing
  • private deployment rights
  • OEM / embedded redistribution rights
  • enterprise fleet and mixed-node integration work
  • support, service, and deployment terms

Open-Source And Enterprise Boundary

The public repository is the open-source core of AX Serving.

The default open-source product scope is:

  • single-Mac serving
  • Mac-led local serving
  • Mac worker grids
  • core serving, orchestration, worker, metrics, and admin protocols

Commercial offerings cover one or both of the following:

  • non-AGPL licensing rights for the AX Serving core itself
  • separate enterprise modules, deployment bundles, and supported integrations

The intended enterprise expansion path is:

  • NVIDIA / Thor-class workers
  • heterogeneous Mac + accelerator fleets
  • enterprise auth, governance, and deployment packaging
  • supported private integrations and fleet operations tooling

The public repository contains the public source distribution, including single-node and multi-worker serving/orchestration capabilities. Commercial agreements govern usage outside AGPL obligations, private packaging, and enterprise delivery terms. The recommended technical boundary is service-level integration, not private crates mixed into the public workspace.

See LICENSING.md and LICENSE-COMMERCIAL.md.

Execution artifacts for the open-source / enterprise split:


Quick Start

Prerequisites:

  • Apple Silicon macOS
  • Rust toolchain
  • one inference runtime node path:
    • Mac compatibility worker through ax-serving serve
    • Mac ax-engine node adapter path through ax-runtime-agent
    • PC CUDA or NVIDIA Thor vLLM node path through ax-runtime-agent

Validate your environment:

cargo check --workspace
cargo run -p ax-serving-cli --bin ax-serving -- doctor

Recommended topology:

  • run ax-serving-api as the API gateway and control plane
  • register runtime nodes through the worker/node contract
  • route requests by model, runtime class, node pool, health, and capacity

Generic runtime node adapter:

AXS_CONTROL_PLANE_URL=http://127.0.0.1:19090 \
AXS_NODE_RUNTIME=vllm \
AXS_NODE_RUNTIME_URL=http://127.0.0.1:8000 \
AXS_NODE_ADVERTISED_ADDR=127.0.0.1:18081 \
AXS_NODE_HARDWARE_CLASS=pc-cuda \
cargo run -p ax-thor-agent --bin ax-runtime-agent

Use AXS_NODE_RUNTIME=ax_engine and AXS_NODE_HARDWARE_CLASS=mac for a Mac ax-engine node. The legacy ax-thor-agent binary remains available as a Thor compatibility alias.

Compatibility local worker:

AXS_ALLOW_NO_AUTH=true \
AXS_WORKER_RUNTIME=ax_engine \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --host 127.0.0.1 \
  --port 18080

Send a request:

curl -sS http://127.0.0.1:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Give me three short points about Rust."}],
    "stream": false,
    "max_tokens": 96
  }'

For fuller setup paths, see QUICKSTART.md:

  • gateway + runtime nodes
  • local compatibility worker
  • authenticated offline deployment
  • model management
  • embeddings

TypeScript SDK (Zod-validated):

cd sdk/javascript
npm install
npm run build

Why AX Serving

Most local runtimes focus on single-process inference. AX Serving focuses on the operational layer above inference:

  • OpenAI-compatible REST and gRPC serving
  • runtime model load/unload/reload
  • admission queueing and concurrency control
  • metrics, dashboard, diagnostics, and audit surfaces
  • multi-worker orchestration in the public repo
  • benchmark and soak tooling in the same repo

Positioning:

  • AX Fabric is the product layer
  • AX Serving is the serving and orchestration layer underneath it
  • inference runtimes such as ax-engine, vLLM, and compatibility local backends remain lower-level execution systems

Runtime Architecture

AX Serving is not itself the token-generation engine. It is the serving layer that routes requests into runtime nodes.

  • Mac inference should be provided by ax-engine runtime nodes.
  • PC CUDA and NVIDIA Thor inference should be provided by vLLM runtime nodes.
  • The worker registry records runtime, runtime version, hardware class, runtime endpoint, supported operations, health, queue, and model inventory.
  • Fleet routing can use model, runtime class, worker pool, hardware class, health, load, queue state, and capability constraints.

The legacy embedded backend paths (llama.cpp, MLX subprocess, optional libllama, and direct native ax-engine integration) are compatibility paths. They remain available for migration and local testing, but new product work should use the public node contract instead of adding more inference-runtime responsibility to AX Serving.

In practice, this means AX Serving owns the APIs, scheduling, orchestration, fleet health, metrics, and lifecycle policy, while runtime nodes own inference execution.

Best With AX Fabric

AX Serving is designed to work with AX Fabric as part of one complete system.

  • AX Serving: execution control plane, model lifecycle, routing, scheduling, APIs
  • AX Fabric: document ingestion, vector search, BM25/hybrid retrieval, MCP-native data access
  • Together: AX Fabric is the product layer; AX Serving is the execution layer underneath it

Core Capabilities

Capability AX Serving
OpenAI-compatible chat/completions/embeddings
Streaming SSE + non-streaming responses
Runtime model management (/v1/models)
Multi-worker orchestration (ax-serving-api)
Dispatch policies (least_inflight, weighted_round_robin, model_affinity, token_cost)
Scheduler queue/inflight controls
Prometheus + JSON metrics
Embedded dashboard (/dashboard)
Built-in benchmarking (ax-serving-bench)

Run Modes

1. Single Inference CLI

cargo run -p ax-serving-cli --bin ax-serving -- \
  -m ./models/<model>.gguf \
  -p "Hello from AX Serving" \
  -n 128

2. Single Runtime (ax-serving serve)

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --port 18080

3. Gateway + Workers (ax-serving-api + workers)

Gateway:

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving-api -- \
  --port 18080 \
  --internal-port 19090 \
  --policy least_inflight

Worker:

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --port 18081 \
  --orchestrator http://127.0.0.1:19090

This gateway + worker path is part of the open-source Mac-native deployment story. Enterprise fleet products build on the same serving contracts while adding supported NVIDIA / Thor-class worker integrations, deployment bundles, and governance layers under commercial terms.


API Surface

Serving runtime (ax-serving serve)

  • POST /v1/chat/completions
  • POST /v1/completions
  • POST /v1/embeddings
  • GET /v1/models
  • POST /v1/models
  • DELETE /v1/models/{id}
  • POST /v1/models/{id}/reload
  • GET /health
  • GET /v1/metrics
  • GET /metrics
  • GET /dashboard
  • GET /v1/license
  • POST /v1/license
  • GET /v1/admin/status
  • GET /v1/admin/startup-report
  • GET /v1/admin/diagnostics
  • GET /v1/admin/audit
  • GET /v1/admin/policy

Orchestrator (ax-serving-api)

  • POST /v1/chat/completions
  • POST /v1/completions
  • POST /v1/embeddings
  • GET /v1/models
  • GET /health
  • GET /v1/metrics
  • GET /v1/license
  • POST /v1/license
  • GET /v1/admin/status
  • GET /v1/admin/startup-report
  • GET /v1/admin/diagnostics
  • GET /v1/admin/audit
  • GET /v1/admin/policy
  • GET /v1/admin/fleet
  • GET /v1/workers
  • GET /v1/workers/{id}
  • POST /v1/workers/{id}/drain
  • POST /v1/workers/{id}/drain-complete
  • DELETE /v1/workers/{id}

Runtime health contract:

  • GET /health returns current status, loaded model_ids, uptime_secs, and thermal state.
  • Used for both liveness and readiness by orchestrators and monitoring.
  • See docs/contracts/ax-fabric-runtime-contract.md for the formal integration contract.

AX Fabric integration contract:

Admin/control-plane notes:

  • all authenticated admin responses preserve X-Request-ID
  • GET /v1/admin/status gives an operational summary
  • GET /v1/admin/startup-report and GET /v1/admin/diagnostics are for runtime inspection
  • worker inventory and drain APIs are orchestrator-only

v1.4 Runtime Controls

  • AXS_SPLIT_SCHEDULER=true
    • enables prefill/decode activity tracking in scheduler metrics

Relevant scheduler metrics:

  • prefill_tokens_active
  • decode_sequences_active
  • split_scheduler_enabled

Authentication

  • If AXS_API_KEY is set, protected endpoints require bearer auth.
  • If AXS_API_KEY is unset, startup requires AXS_ALLOW_NO_AUTH=true.

Recommended offline enterprise startup:

AXS_CONFIG=config/serving.offline-enterprise.yaml \
AXS_API_KEY="change-me" \
AXS_MODEL_ALLOWED_DIRS="/absolute/path/to/models" \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m /absolute/path/to/models/<model>.gguf \
  --model-id default
AXS_API_KEY="token1,token2" cargo run -p ax-serving-cli --bin ax-serving -- serve -m ./models/<model>.gguf

Client header:

Authorization: Bearer token1

Build, Lint, Test

cargo check --workspace
cargo fmt --all -- --check
cargo clippy --workspace --tests -- -D warnings
cargo test --workspace

Integration tests (no model required — uses in-process mock servers):

AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test orchestration
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test model_management
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test graceful_shutdown

Release build:

cargo build --workspace --release

Test Coverage

All tests run automatically in CI on every push and pull request against main. No model file or GPU is required — tests use in-process backends (NullBackend, EchoBackend, FailingUnloadBackend) that exercise the full request path without hardware.

Exact test counts change over time. Use the linked CI badge and workflow runs as the source of truth.

Suite What It Covers
Unit — serving API Scheduler (permits, AIMD, TTFT histogram, split prefill/decode), model registry (lifecycle, idle eviction, capacity), orchestration (queue, dispatch policies, worker registry, DashMap), REST helpers (cache key normalisation, cache hit ratio), config (env layering, validation), gRPC status mapping, auth, metrics
Unit — engine Backend routing, GGUF metadata parsing, thermal state, memory budget
Unit — C shim Null-safe llama.h ABI compatibility
Integration — model_management Auth (Bearer, whitespace tolerance, 401+WWW-Authenticate), model load/unload/reload (201/200/409/404/503), health semantics (ok/degraded/critical-thermal/no-models), input validation (400/422 on every field), full inference path (chat + completions via EchoBackend), embeddings, security response headers, metrics JSON keys, dashboard HTML, license GET/SET
Integration — orchestration Worker register/heartbeat/eviction, dispatch (least-inflight, weighted round-robin, model-affinity, token-cost), queue admission and backpressure, reroute on 5xx, chaos (all workers fail → 503), overload (queue full → 429)
Integration — graceful_shutdown In-flight request drains to completion before server exits

Every CI run posts a test summary to the GitHub Actions job summary page — see the Actions tab for per-run results.


Benchmarking

cargo run -p ax-serving-bench --release -- bench -m ./models/<model>.gguf

Other benchmark modes:

  • profile
  • mixed
  • cache-bench
  • soak
  • compare
  • regression-check
  • multi-worker

Repository Layout

  • crates/ax-serving-engine: backend abstraction, routing, model internals
  • crates/ax-serving-api: REST/gRPC serving, scheduler, orchestration
  • crates/ax-serving-cli: ax-serving and ax-serving-api binaries
  • crates/ax-serving-bench: benchmark and soak runners
  • crates/ax-serving-shim: C-compatible shim
  • crates/ax-serving-py: Python bindings
  • config/: serving and routing configuration
  • docs/: runbooks and architecture notes

Documentation


Licensing

About

Offline OpenAI-compatible serving and orchestration plane for AX Fabric on Apple Silicon, with runtime model lifecycle, routing, metrics, and multi-worker control.

Topics

Resources

License

AGPL-3.0, Unknown licenses found

Licenses found

AGPL-3.0
LICENSE
Unknown
LICENSE-COMMERCIAL.md

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors