Armada is a multi-Kubernetes cluster batch job scheduler.
Armada is designed to address the following issues:
- A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster scheduler built on top of several Kubernetes clusters.
- Acheiving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer.
Armada is designed primarily for machine learning, AI, and data analytics workloads, and to:
- Manage compute clusters composed of tens of thousands of nodes in total.
- Schedule a thousand or more pods per second, on average.
- Enqueue tens of thousands of jobs over a few seconds.
- Divide resources fairly between users.
- Provide visibility for users and admins.
- Ensure near-constant uptime.
Armada is a CNCF Sandbox project used in production at G-Research.
For an overview of Armada, see these videos:
- Armada - high-throughput batch scheduling
- Building Armada - Running Batch Jobs at Massive Scale on Kubernetes
Armada adheres to the CNCF Code of Conduct.
For an overview of the architecture and design of Armada, and instructions for submitting jobs, see:
There are two methods of setting Armada up for local development:
For API reference, see:
We expect readers of the documentation to have a basic understanding of Docker and Kubernetes; see, e.g., the following links: