Argo Workflows-based lifecycle management for Kata Containers.
This chart installs a namespace-scoped WorkflowTemplate that performs controlled,
node-by-node upgrades of kata-deploy with verification and automatic rollback on failure.
- Kubernetes cluster with kata-deploy installed via Helm (chart 3.27.0 or higher required; the workflow uses
helm upgrade --installand relies on the kata-deploy chart to set DaemonSetupdateStrategy.type=OnDelete) - Argo Workflows v3.4+ installed (not Argo CD)
helmCLI andargoCLI (Argo Workflows CLI, notargocd)- Verification pod spec (see Verification Pod)
Install the chart from the OCI registry (published on GitHub Releases):
# Install latest (or pin a version with --version $version)
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argoFor development from a local clone: helm install kata-lifecycle-manager . --namespace argo
A verification pod is required to validate each node after upgrade. The chart will fail to install without one.
Provide the verification pod when installing the chart:
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argo \
--set-file defaults.verificationPod=./my-verification-pod.yamlThis verification pod is baked into the WorkflowTemplate and used for all upgrades.
One-off override for a specific upgrade run. The pod spec must be base64-encoded because Argo workflow parameters don't handle multi-line YAML reliably:
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p verification-pod="$(base64 -w0 < ./my-verification-pod.yaml)"Note: During helm upgrade, kata-deploy's own verification is disabled
(--set verification.pod=""). This is because kata-deploy's verification is
cluster-wide (designed for initial install), while kata-lifecycle-manager performs
per-node verification with proper placeholder substitution.
Create a pod spec that validates your Kata deployment. The pod should exit 0 on success, non-zero on failure.
Example (my-verification-pod.yaml):
apiVersion: v1
kind: Pod
metadata:
name: ${TEST_POD}
spec:
runtimeClassName: kata-qemu
restartPolicy: Never
nodeSelector:
kubernetes.io/hostname: ${NODE}
tolerations:
- operator: Exists
containers:
- name: verify
image: quay.io/kata-containers/alpine-bash-curl:latest
command:
- sh
- -c
- |
echo "=== Kata Verification ==="
echo "Node: ${NODE}"
echo "Kernel: $(uname -r)"
echo "SUCCESS: Pod running with Kata runtime"| Placeholder | Description |
|---|---|
${NODE} |
Node hostname being upgraded/verified |
${TEST_POD} |
Generated unique pod name |
You are responsible for:
- Setting the
runtimeClassNamein your pod spec - Defining the verification logic in your container
- Using the exit code to indicate success (0) or failure (non-zero)
Failure modes detected:
- Pod stuck in Pending/
ContainerCreating(runtime can't start VM) - Pod crashes immediately (containerd/CRI-O configuration issues)
- Pod times out (resource issues, image pull failures)
- Pod exits with non-zero code (verification logic failed)
All of these trigger automatic rollback.
Nodes can be selected using labels, taints, or both.
Option A: Label-based selection (default)
# Label nodes for upgrade
kubectl label node worker-1 katacontainers.io/kata-lifecycle-manager-window=true
# Trigger upgrade
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p node-selector="katacontainers.io/kata-lifecycle-manager-window=true"Option B: Taint-based selection
# Taint nodes for upgrade
kubectl taint nodes worker-1 kata-lifecycle-manager=pending:NoSchedule
# Trigger upgrade using taint selector
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p node-taint-key=kata-lifecycle-manager \
-p node-taint-value=pendingOption C: Combined selection
# Use both labels and taints for precise targeting
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p node-selector="node-pool=kata-pool" \
-p node-taint-key=kata-lifecycle-managerargo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0
# Watch progress
argo watch @latestNodes are upgraded sequentially (one at a time) to ensure fleet consistency. If any node fails verification, the workflow stops immediately and that node is rolled back. This prevents ending up with a mixed fleet where some nodes have the new version and others have the old version.
| Parameter | Description | Default |
|---|---|---|
argoNamespace |
Namespace for Argo resources | argo |
defaults.helmRelease |
kata-deploy Helm release name | kata-deploy |
defaults.helmNamespace |
kata-deploy namespace | kube-system |
defaults.nodeSelector |
Node label selector (optional if using taints) | "" |
defaults.nodeTaintKey |
Taint key for node selection | "" |
defaults.nodeTaintValue |
Taint value filter (optional) | "" |
defaults.verificationNamespace |
Namespace for verification pods | default |
defaults.verificationPod |
Pod YAML for verification (required) | "" |
defaults.drainEnabled |
Enable node drain before upgrade | false |
defaults.drainTimeout |
Timeout for drain operation | 300s |
defaults.helmSetValues |
Extra --set values for helm upgrade (see Custom Image) |
"" |
images.utils |
Image with Helm 4 and kubectl (multi-arch) | ghcr.io/kata-containers/lifecycle-manager-utils:latest |
When submitting a workflow, you can override:
| Parameter | Description |
|---|---|
target-version |
Required - Target Kata version |
helm-release |
Helm release name |
helm-namespace |
Namespace of kata-deploy |
node-selector |
Label selector for nodes |
node-taint-key |
Taint key for node selection |
node-taint-value |
Taint value filter |
verification-namespace |
Namespace for verification pods |
verification-pod |
Pod YAML with placeholders |
drain-enabled |
Whether to drain nodes before upgrade |
drain-timeout |
Timeout for drain operation |
helm-set-values |
Extra --set values for helm upgrade (see Custom Image) |
For each node selected by the node-selector label:
- Prepare: Annotate node with deploy status
- Cordon: Mark node as
unschedulable - Drain (optional): Evict pods if
drain-enabled=true - Helm Upgrade: Run
helm upgrade --installwithupdateStrategy.type=OnDelete(kata-deploy chart 3.27.0+ applies this)- This updates the DaemonSet spec but does NOT restart pods automatically
- Trigger Pod Restart: Delete the kata-deploy pod on THIS node only
- This triggers recreation with the new image on just this node
- Wait: Wait for new kata-deploy pod to be ready
- Verify: Run verification pod and check exit code
- On Success:
Uncordonnode, proceed to next node - On Failure: Automatic rollback (helm rollback + pod restart),
uncordon, workflow stops
True node-by-node control: By using updateStrategy: OnDelete, the workflow
ensures that only the current node's pod restarts. Other nodes continue running
the previous version until explicitly upgraded.
Nodes are processed sequentially (one at a time). If verification fails on any node, the workflow stops immediately, preventing a mixed-version fleet.
Default (drain disabled): Drain is not required for Kata upgrades. Running Kata VMs continue using the in-memory binaries. Only new workloads use the upgraded binaries.
Optional drain: Enable drain if you prefer to evict all workloads before any maintenance operation, or if your organization's operational policies require it:
# Enable drain when installing the chart
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argo \
--set defaults.drainEnabled=true \
--set defaults.drainTimeout=600s \
--set-file defaults.verificationPod=./my-verification-pod.yaml
# Or override at workflow submission time
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p drain-enabled=true \
-p drain-timeout=600sBy default, the workflow upgrades kata-deploy using the official chart images for the
specified target-version. To deploy from a custom image (e.g., your own registry or
a custom build), pass extra --set values that override the kata-deploy chart's image
settings.
At workflow submission (one-off):
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0 \
-p helm-set-values="image.repository=myregistry.io/kata-deploy,image.tag=my-custom-tag"Baked into the chart (persistent default):
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argo \
--set-file defaults.verificationPod=./my-verification-pod.yaml \
--set 'defaults.helmSetValues=image.repository=myregistry.io/kata-deploy\,image.tag=my-custom-tag'The value uses standard Helm --set comma-separated syntax (key1=val1,key2=val2).
Any kata-deploy chart value can be overridden this way, not just the image.
Automatic rollback on verification failure: If the verification pod fails (non-zero exit), kata-lifecycle-manager automatically:
- Runs
helm rollbackto revert to the previous Helm release - Waits for kata-deploy DaemonSet to be ready with the previous version
Uncordonsthe node- Annotates the node with
rolled-backstatus
This ensures nodes are never left in a broken state.
Manual rollback: For cases where you need to rollback a successfully upgraded node:
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
--entrypoint rollback-node \
-p node-name=worker-1Check node annotations to monitor upgrade progress:
kubectl get nodes \
-L katacontainers.io/kata-lifecycle-manager-status \
-L katacontainers.io/kata-current-version| Annotation | Description |
|---|---|
katacontainers.io/kata-lifecycle-manager-status |
Current upgrade phase |
katacontainers.io/kata-current-version |
Version after successful upgrade |
Status values:
preparing- Upgrade startingcordoned- Node markedunschedulabledraining- Draining pods (only if drain-enabled=true)upgrading- Helm upgrade in progressverifying- Verification pod runningcompleted- Upgrade successfulrolling-back- Rollback in progress (automatic on verification failure)rolled-back- Rollback completed
The workflow uses updateStrategy.type=OnDelete to achieve true node-by-node control:
- Helm upgrade updates the DaemonSet spec but pods don't restart automatically
- The workflow explicitly deletes the kata-deploy pod on the current node
- Kubernetes recreates the pod with the new image on just that node
- Other nodes continue running the previous version until their turn
This ensures that if verification fails on Node B, Node A is still running the new version (verified working) while the workflow stops. No automatic cluster-wide rollback occurs unless explicitly triggered.
Rollback behavior:
- On verification failure,
helm rollbackreverts the DaemonSet spec - The pod on the failed node is deleted to restart with the previous version
- Already-verified nodes continue running the new version (their pods weren't touched)
Any project that uses the kata-deploy Helm chart can install this companion chart to get upgrade orchestration:
# Install kata-deploy
helm install kata-deploy oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy \
--namespace kube-system
# Install kata-lifecycle-manager from the published chart (see GitHub Releases)
helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-lifecycle-manager-charts/kata-lifecycle-manager \
--namespace argo \
--set-file defaults.verificationPod=./my-verification-pod.yaml
# Trigger upgrade
argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
-p target-version=3.27.0Note: target-version must be 3.27.0 or higher; the workflow will fail at prerequisites otherwise.
Workflows use the repository owner for GHCR paths so you can test from a fork (e.g. fidencio/kata-lifecycle-manager):
-
Build the workflow image
In your fork: Actions → "Build workflow image" → "Run workflow".
This pushesghcr.io/<your-username>/lifecycle-manager-utils:latest. -
Release the chart (optional)
Actions → "Release Helm Chart" → "Run workflow", set version (e.g.0.1.0-dev).
This pushes the chart toghcr.io/<your-username>/kata-lifecycle-manager-charts. -
Install from your fork
helm install kata-lifecycle-manager \ oci://ghcr.io/<your-username>/kata-lifecycle-manager-charts/kata-lifecycle-manager \ --version 0.1.0-dev \ --set-file defaults.verificationPod=./verification-pod.yaml \ --set images.utils=ghcr.io/<your-username>/lifecycle-manager-utils:latest \ --namespace argo
- Design Document - Architecture and design decisions
Apache License 2.0