Kubernetes-native auto-scaling and load balancing for OpenDroneMap.
ScaleODM is a Kubernetes-native orchestration layer for OpenDroneMap, designed to automatically scale processing workloads using native Kubernetes primitives such as Jobs, Deployments, and Horizontal Pod Autoscalers.
It aims to provide the same API surface as NodeODM, while replacing both NodeODM and ClusterODM with a single, cloud-native control plane.
Note
ScaleODM has no authentication mechanisms, and should not be exposed publicly.
Instead, your frontend connects to a backend. The backend then uses
PyODM or similar to connect to the internal network ScaleODM
instance.
In order to federate multiple ScaleODM instances, a secure network mesh should be made with tools like Tailscale.
- ClusterODM --> NodeODM --> ODM are all fantastic tools, well tested with a big community behind them.
- However, running these tools inside a Kubernetes cluster poses a few challenges:
- Scaling relies on provisioning or deprovisioning VMs, not container replicas.
- Kubernetes-native scaling (Deployments, Jobs, KEDA) doesn't map neatly to its model.
- Data ingestion depends on
zip_urlor uploading via HTTP. - S3 integration covers outputs only, not input data. Ideally we need a data 'pull' approach instead of data 'push'.
- Built-in file-based queues are not distributed or Kubernetes-aware.
Our initial goal was to deploy ClusterODM and NodeODM as-is inside Kubernetes, scaling NodeODM instances dynamically via KEDA.
ScaleODM was introduced as a lightweight queueing API, backed by PostgreSQL
(SKIP LOCKED), acting as a mediator for job scheduling and scaling triggers.
However, two main challenges emerged:
- NodeODM's internal queueing is file-based and not easily abstracted for distributed scaling.
- Data ingestion still required either HTTP uploads or
zip_urlpackaging, adding unnecessary I/O overhead.
NodeODM wasn't really designed for ephemeral or autoscaled container environments, and that's fine.
Rethinking the architecture: instead of orchestrating NodeODM instances, it probably makes more sense to orchestrate ODM workloads inside as Kubernetes Jobs or Argo Workflows.
Key concepts:
- NodeODM-compatible API: ScaleODM exposes the same REST endpoints as NodeODM, ensuring π€ compatibility with existing tools (e.g. PyODM).
- Kubernetes Jobs: Each processing task is executed in an ephemeral container, than can be distributed by the control plane as needed.
- S3-native workflow: Each job downloads inputs, performs processing, uploads outputs, and exits cleanly - no persistent volumes required. (i.e. jobs include the S3 params / credentials).
- Federation: ScaleODM instances can be federated across clusters, enabling global load balancing and community resource sharing.
The decision to take this approach was not taken lightly, as we are strong supporters of contributing to existing open-source projects.
Long term, hopefully the ODM community can steward this project as an alternative processing API (with different requirements).
For more details, see the decisions section in this repo.
| Status | Feature | Release |
|---|---|---|
| π | NodeODM-compatible API (submit, status, download) | v1 |
| π | Processing pipeline using Argo workflows + ODM containers | v1 |
| π | Using the same job statuses as NodeODM (QUEUED, RUNNING, FAILED, COMPLETED, CANCELED) | v1 |
| π | Env var config for API / pipeline | v1 |
| π | Pre-processing to determine the required resource usage for the workflow (CPU / RAM allocated) | v1 |
| π | Accept both zipped and unzipped imagery via S3 dir | v1 |
| π | Progress monitoring via API by hooking into the ODM container logs | v2 |
| π | Split-merge workflow | v2 |
| π | Accept GCP as part of job submission | v2 |
| π | Federation of ScaleODM instances and task distribution | v3 |
| π | Webhook triggering - send a notification to external system when complete | v3 |
| π | Post processing of the final artifacts - capability present in NodeODM | v4 |
| π | Consider a load balancing service across all ScaleODM instances in DB | v4 |
| π | Adding extra missing things from NodeODM implementation, if required* | v4 |
*missing NodeODM functionality
- Exposing all of the config options possible in ODM.
- Multi-step project creation endpoints, with direct file upload.
- Exposing all of the config options possible in ODM.
Details to come once API is stabilised.
ScaleODM supports two modes for S3 access:
- Set
SCALEODM_S3_ACCESS_KEYandSCALEODM_S3_SECRET_KEYenvironment variables. - These credentials are passed directly to all workflow jobs.
- Note: This is less secure as credentials are stored in the cluster.
For better security, use AWS STS to generate temporary credentials per job:
-
Set environment variables:
SCALEODM_S3_ACCESS_KEY=<your-iam-user-access-key> SCALEODM_S3_SECRET_KEY=<your-iam-user-secret-key> SCALEODM_S3_STS_ROLE_ARN=arn:aws:iam::ACCOUNT_ID:role/scaleodm-workflow-role SCALEODM_S3_STS_ENDPOINT= # Optional: defaults to https://sts.{region}.amazonaws.com
-
IAM User Permissions (for the user specified in
SCALEODM_S3_ACCESS_KEY):The IAM user must have permission to assume the STS role:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "sts:AssumeRole", "Resource": "arn:aws:iam::ACCOUNT_ID:role/scaleodm-workflow-role" } ] }Important: The
Resourcemust match the exact role ARN specified inSCALEODM_S3_STS_ROLE_ARN. Using"Resource": "*"is less secure but allows assuming any role. -
IAM Role Trust Policy (for the role specified in
SCALEODM_S3_STS_ROLE_ARN):The role must trust the IAM user:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::ACCOUNT_ID:user/your-scaleodm-user" }, "Action": "sts:AssumeRole" } ] } -
IAM Role Permissions (for the role):
The role must have permissions to read/write to your S3 buckets:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::your-bucket-name/*", "arn:aws:s3:::your-bucket-name" ] } ] }
How it works:
- When a job is submitted, ScaleODM uses the IAM user credentials to call
sts:AssumeRoleon the specified role. - Temporary credentials (valid for 24 hours) are generated and injected into the workflow.
- Each workflow job uses these temporary credentials to access S3.
- Credentials automatically expire, reducing security risk.
Troubleshooting:
If you see errors like:
User: arn:aws:iam::ACCOUNT_ID:user/your-user is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::ACCOUNT_ID:role/your-role
Check:
- The IAM user has
sts:AssumeRolepermission for the role ARN. - The role's trust policy allows the IAM user to assume it.
- The
SCALEODM_S3_STS_ROLE_ARNis set to a role ARN (not a user ARN).
- Binary and container image distribution is automated on new release.
For local development and testing, ScaleODM uses a Talos Kubernetes cluster
created via talosctl cluster create. This provides a real Kubernetes
environment for testing Argo Workflows integration.
Quick start:
# Setup Talos cluster and start all services
just devThis will:
- Create a local Talos Kubernetes cluster
- Install Argo Workflows
- Start PostgreSQL, MinIO, and the ScaleODM API
Manual setup:
# 1. Setup Talos cluster (one-time)
just test-cluster-init
# 2. Start compose services
just startTesting workflow:
just test-cluster-init # Setup cluster
just test # Run tests
just test-cluster-destroy # Clean upSee compose.README.md for detailed setup instructions.
Prerequisites:
talosctlinstalled (installation guide)- Docker running
- At least 8GB free memory
The test suite depends on a database and Kubernetes cluster:
# With Talos cluster already running
just test
# Or manually
docker compose run --rm api go test -timeout=2m -v ./...