AX Serving

Category: Department-Scale Private AI Fleet Control Plane

Product: The serving and orchestration layer for multi-model private AI fleets operated by SMEs and enterprise departments.

AX Serving is the serving and orchestration control plane behind AX Fabric. It is designed for department-scale private AI fleets that need OpenAI-compatible APIs, runtime and model inventory, scheduling, metrics, audit surfaces, and multi-worker routing across heterogeneous runtime nodes.

AX Serving is not the token-generation engine. In the target architecture, inference execution is delegated to runtime nodes:

Mac nodes run ax-engine
PC CUDA nodes run vLLM
NVIDIA Thor nodes run vLLM

The existing embedded local worker path remains available as a compatibility bridge. New deployments should prefer runtime-node adapters such as ax-runtime-agent in front of ax-engine or vLLM endpoints.

AX Fabric is the product-facing layer for retrieval, knowledge, and grounded agent workflows. AX Serving is the infrastructure layer that makes that stack deployable and operable across Mac-led and mixed-worker environments.

Status: production-ready Rust workspace for Apple Silicon (aarch64-apple-darwin) with OpenAI-compatible REST, gRPC, runtime model management, and multi-worker orchestration oriented around department-scale private AI serving.

Market Focus

AX Serving is built to win in three adjacent niches:

department-scale private AI fleet control planes
Mac-native serving and orchestration for single-node and Mac-grid deployments
enterprise mixed-worker orchestration across NVIDIA / Thor-class, Mac Studio-class, and future workers
serving infrastructure for governed private AI stacks such as AX Fabric

Who it is for:

SMEs and enterprise departments with fewer than ~100 users or operators
platform and infra teams running private AI fleets
operators who need more than a single local runtime process
teams that care about model lifecycle, routing, metrics, health, audit, and fleet operations
private deployments that need an OpenAI-compatible serving layer without a cloud-first dependency

What it is not:

not an end-user desktop chat app
not a generic CUDA hyperscale serving stack
not the low-level token-generation engine itself

Deployment fit:

Single Mac: the default open-source deployment path
Mac grid: the default open-source multi-worker deployment path
Enterprise heterogeneous fleet: commercial path for NVIDIA / Thor-class workers, governed mixed-node deployments, and enterprise delivery requirements

For market positioning, competitive analysis, and ICP details, see:

Licensing And Commercial Use

AX Serving is dual-licensed:

Open-source use: AGPL-3.0-or-later
Commercial use: available under separate written license

Commercial licensing is intended for organizations that want to use AX Serving as a proprietary serving backend, private inference/control plane, embedded runtime, OEM component, managed fleet, or enterprise integration layer without AGPL obligations.

Commercial engagements may include:

commercial runtime licensing
private deployment rights
OEM / embedded redistribution rights
enterprise fleet and mixed-node integration work
support, service, and deployment terms

Open-Source And Enterprise Boundary

The public repository is the open-source core of AX Serving.

The default open-source product scope is:

single-Mac serving
Mac-led local serving
Mac worker grids
core serving, orchestration, worker, metrics, and admin protocols

Commercial offerings cover one or both of the following:

non-AGPL licensing rights for the AX Serving core itself
separate enterprise modules, deployment bundles, and supported integrations

The intended enterprise expansion path is:

NVIDIA / Thor-class workers
heterogeneous Mac + accelerator fleets
enterprise auth, governance, and deployment packaging
supported private integrations and fleet operations tooling

The public repository contains the public source distribution, including single-node and multi-worker serving/orchestration capabilities. Commercial agreements govern usage outside AGPL obligations, private packaging, and enterprise delivery terms. The recommended technical boundary is service-level integration, not private crates mixed into the public workspace.

See LICENSING.md and LICENSE-COMMERCIAL.md.

Execution artifacts for the open-source / enterprise split:

Quick Start

Prerequisites:

Apple Silicon macOS
Rust toolchain
one inference runtime node path:
- Mac compatibility worker through ax-serving serve
- Mac ax-engine node adapter path through ax-runtime-agent
- PC CUDA or NVIDIA Thor vLLM node path through ax-runtime-agent

Validate your environment:

cargo check --workspace
cargo run -p ax-serving-cli --bin ax-serving -- doctor

Recommended topology:

run ax-serving-api as the API gateway and control plane
register runtime nodes through the worker/node contract
route requests by model, runtime class, node pool, health, and capacity

Generic runtime node adapter:

AXS_CONTROL_PLANE_URL=http://127.0.0.1:19090 \
AXS_NODE_RUNTIME=vllm \
AXS_NODE_RUNTIME_URL=http://127.0.0.1:8000 \
AXS_NODE_ADVERTISED_ADDR=127.0.0.1:18081 \
AXS_NODE_HARDWARE_CLASS=pc-cuda \
cargo run -p ax-thor-agent --bin ax-runtime-agent

Use AXS_NODE_RUNTIME=ax_engine and AXS_NODE_HARDWARE_CLASS=mac for a Mac ax-engine node. The legacy ax-thor-agent binary remains available as a Thor compatibility alias.

Compatibility local worker:

AXS_ALLOW_NO_AUTH=true \
AXS_WORKER_RUNTIME=ax_engine \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --host 127.0.0.1 \
  --port 18080

Send a request:

curl -sS http://127.0.0.1:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Give me three short points about Rust."}],
    "stream": false,
    "max_tokens": 96
  }'

For fuller setup paths, see QUICKSTART.md:

gateway + runtime nodes
local compatibility worker
authenticated offline deployment
model management
embeddings

TypeScript SDK (Zod-validated):

cd sdk/javascript
npm install
npm run build

Why AX Serving

Most local runtimes focus on single-process inference. AX Serving focuses on the operational layer above inference:

OpenAI-compatible REST and gRPC serving
runtime model load/unload/reload
admission queueing and concurrency control
metrics, dashboard, diagnostics, and audit surfaces
multi-worker orchestration in the public repo
benchmark and soak tooling in the same repo

Positioning:

AX Fabric is the product layer
AX Serving is the serving and orchestration layer underneath it
inference runtimes such as ax-engine, vLLM, and compatibility local backends remain lower-level execution systems

Runtime Architecture

AX Serving is not itself the token-generation engine. It is the serving layer that routes requests into runtime nodes.

Mac inference should be provided by ax-engine runtime nodes.
PC CUDA and NVIDIA Thor inference should be provided by vLLM runtime nodes.
The worker registry records runtime, runtime version, hardware class, runtime endpoint, supported operations, health, queue, and model inventory.
Fleet routing can use model, runtime class, worker pool, hardware class, health, load, queue state, and capability constraints.

The legacy embedded backend paths (llama.cpp, MLX subprocess, optional libllama, and direct native ax-engine integration) are compatibility paths. They remain available for migration and local testing, but new product work should use the public node contract instead of adding more inference-runtime responsibility to AX Serving.

In practice, this means AX Serving owns the APIs, scheduling, orchestration, fleet health, metrics, and lifecycle policy, while runtime nodes own inference execution.

Best With AX Fabric

AX Serving is designed to work with AX Fabric as part of one complete system.

AX Serving: execution control plane, model lifecycle, routing, scheduling, APIs
AX Fabric: document ingestion, vector search, BM25/hybrid retrieval, MCP-native data access
Together: AX Fabric is the product layer; AX Serving is the execution layer underneath it

Core Capabilities

Capability	AX Serving
OpenAI-compatible chat/completions/embeddings	✅
Streaming SSE + non-streaming responses	✅
Runtime model management (`/v1/models`)	✅
Multi-worker orchestration (`ax-serving-api`)	✅
Dispatch policies (`least_inflight`, `weighted_round_robin`, `model_affinity`, `token_cost`)	✅
Scheduler queue/inflight controls	✅
Prometheus + JSON metrics	✅
Embedded dashboard (`/dashboard`)	✅
Built-in benchmarking (`ax-serving-bench`)	✅

Run Modes

1. Single Inference CLI

cargo run -p ax-serving-cli --bin ax-serving -- \
  -m ./models/<model>.gguf \
  -p "Hello from AX Serving" \
  -n 128

2. Single Runtime (`ax-serving serve`)

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --port 18080

3. Gateway + Workers (`ax-serving-api` + workers)

Gateway:

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving-api -- \
  --port 18080 \
  --internal-port 19090 \
  --policy least_inflight

Worker:

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --port 18081 \
  --orchestrator http://127.0.0.1:19090

This gateway + worker path is part of the open-source Mac-native deployment story. Enterprise fleet products build on the same serving contracts while adding supported NVIDIA / Thor-class worker integrations, deployment bundles, and governance layers under commercial terms.

API Surface

Serving runtime (`ax-serving serve`)

POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
GET /v1/models
POST /v1/models
DELETE /v1/models/{id}
POST /v1/models/{id}/reload
GET /health
GET /v1/metrics
GET /metrics
GET /dashboard
GET /v1/license
POST /v1/license
GET /v1/admin/status
GET /v1/admin/startup-report
GET /v1/admin/diagnostics
GET /v1/admin/audit
GET /v1/admin/policy

Orchestrator (`ax-serving-api`)

POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
GET /v1/models
GET /health
GET /v1/metrics
GET /v1/license
POST /v1/license
GET /v1/admin/status
GET /v1/admin/startup-report
GET /v1/admin/diagnostics
GET /v1/admin/audit
GET /v1/admin/policy
GET /v1/admin/fleet
GET /v1/workers
GET /v1/workers/{id}
POST /v1/workers/{id}/drain
POST /v1/workers/{id}/drain-complete
DELETE /v1/workers/{id}

Runtime health contract:

GET /health returns current status, loaded model_ids, uptime_secs, and thermal state.
Used for both liveness and readiness by orchestrators and monitoring.
See docs/contracts/ax-fabric-runtime-contract.md for the formal integration contract.

AX Fabric integration contract:

documented in docs/contracts/ax-fabric-runtime-contract.md

Admin/control-plane notes:

all authenticated admin responses preserve X-Request-ID
GET /v1/admin/status gives an operational summary
GET /v1/admin/startup-report and GET /v1/admin/diagnostics are for runtime inspection
worker inventory and drain APIs are orchestrator-only

v1.4 Runtime Controls

AXS_SPLIT_SCHEDULER=true
- enables prefill/decode activity tracking in scheduler metrics

Relevant scheduler metrics:

prefill_tokens_active
decode_sequences_active
split_scheduler_enabled

Authentication

If AXS_API_KEY is set, protected endpoints require bearer auth.
If AXS_API_KEY is unset, startup requires AXS_ALLOW_NO_AUTH=true.

Recommended offline enterprise startup:

AXS_CONFIG=config/serving.offline-enterprise.yaml \
AXS_API_KEY="change-me" \
AXS_MODEL_ALLOWED_DIRS="/absolute/path/to/models" \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m /absolute/path/to/models/<model>.gguf \
  --model-id default

AXS_API_KEY="token1,token2" cargo run -p ax-serving-cli --bin ax-serving -- serve -m ./models/<model>.gguf

Client header:

Authorization: Bearer token1

Build, Lint, Test

cargo check --workspace
cargo fmt --all -- --check
cargo clippy --workspace --tests -- -D warnings
cargo test --workspace

Integration tests (no model required — uses in-process mock servers):

AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test orchestration
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test model_management
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test graceful_shutdown

Release build:

cargo build --workspace --release

Test Coverage

All tests run automatically in CI on every push and pull request against main. No model file or GPU is required — tests use in-process backends (NullBackend, EchoBackend, FailingUnloadBackend) that exercise the full request path without hardware.

Exact test counts change over time. Use the linked CI badge and workflow runs as the source of truth.

Suite	What It Covers
Unit — serving API	Scheduler (permits, AIMD, TTFT histogram, split prefill/decode), model registry (lifecycle, idle eviction, capacity), orchestration (queue, dispatch policies, worker registry, DashMap), REST helpers (cache key normalisation, cache hit ratio), config (env layering, validation), gRPC status mapping, auth, metrics
Unit — engine	Backend routing, GGUF metadata parsing, thermal state, memory budget
Unit — C shim	Null-safe llama.h ABI compatibility
Integration — model_management	Auth (Bearer, whitespace tolerance, 401+WWW-Authenticate), model load/unload/reload (201/200/409/404/503), health semantics (ok/degraded/critical-thermal/no-models), input validation (400/422 on every field), full inference path (chat + completions via EchoBackend), embeddings, security response headers, metrics JSON keys, dashboard HTML, license GET/SET
Integration — orchestration	Worker register/heartbeat/eviction, dispatch (least-inflight, weighted round-robin, model-affinity, token-cost), queue admission and backpressure, reroute on 5xx, chaos (all workers fail → 503), overload (queue full → 429)
Integration — graceful_shutdown	In-flight request drains to completion before server exits

Every CI run posts a test summary to the GitHub Actions job summary page — see the Actions tab for per-run results.

Benchmarking

cargo run -p ax-serving-bench --release -- bench -m ./models/<model>.gguf

Other benchmark modes:

profile
mixed
cache-bench
soak
compare
regression-check
multi-worker

Repository Layout

crates/ax-serving-engine: backend abstraction, routing, model internals
crates/ax-serving-api: REST/gRPC serving, scheduler, orchestration
crates/ax-serving-cli: ax-serving and ax-serving-api binaries
crates/ax-serving-bench: benchmark and soak runners
crates/ax-serving-shim: C-compatible shim
crates/ax-serving-py: Python bindings
config/: serving and routing configuration
docs/: runbooks and architecture notes

Documentation

QUICKSTART.md
docs/market-positioning.md
docs/competitive-landscape.md
docs/icp-and-demand.md
docs/ax-code-integration.md
docs/contracts/ax-serving-public-contract-inventory.md
docs/maintainability-refactor-plan.md
docs/contracts/ax-fabric-runtime-contract.md
sdk/javascript/README.md (TypeScript SDK with Zod validation)
sdk/python/ (Python SDK)
docs/runbooks/multi-worker.md
docs/perf/service-tuning.md

Licensing

Open-source terms: AGPL v3 text and licensing guide
Commercial terms: commercial licensing summary
Issue reporting policy: CONTRIBUTING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AX Serving

Market Focus

Licensing And Commercial Use

Open-Source And Enterprise Boundary

Quick Start

Why AX Serving

Runtime Architecture

Best With AX Fabric

Core Capabilities

Run Modes

1. Single Inference CLI

2. Single Runtime (`ax-serving serve`)

3. Gateway + Workers (`ax-serving-api` + workers)

API Surface

Serving runtime (`ax-serving serve`)

Orchestrator (`ax-serving-api`)

v1.4 Runtime Controls

Authentication

Build, Lint, Test

Test Coverage

Benchmarking

Repository Layout

Documentation

Licensing

About

Licenses found

Uh oh!

Releases 23

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.cargo		.cargo
.github/workflows		.github/workflows
REPORTS		REPORTS
benchmarks		benchmarks
config		config
crates		crates
docs		docs
include		include
packaging		packaging
proto		proto
scripts		scripts
sdk		sdk
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-COMMERCIAL.md		LICENSE-COMMERCIAL.md
LICENSING.md		LICENSING.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
cbindgen.toml		cbindgen.toml

Folders and files

Latest commit

History

Repository files navigation

AX Serving

Market Focus

Licensing And Commercial Use

Open-Source And Enterprise Boundary

Quick Start

Why AX Serving

Runtime Architecture

Best With AX Fabric

Core Capabilities

Run Modes

1. Single Inference CLI

2. Single Runtime (ax-serving serve)

3. Gateway + Workers (ax-serving-api + workers)

API Surface

Serving runtime (ax-serving serve)

Orchestrator (ax-serving-api)

v1.4 Runtime Controls

Authentication

Build, Lint, Test

Test Coverage

Benchmarking

Repository Layout

Documentation

Licensing

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 23

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

2. Single Runtime (`ax-serving serve`)

3. Gateway + Workers (`ax-serving-api` + workers)

Serving runtime (`ax-serving serve`)

Orchestrator (`ax-serving-api`)

Packages