Symptom
A freshly vended dev spoke (2× m7g.large bootstrap nodes) syncing the full ai-platform + addons catalog (argocd, cert-manager, external-secrets, falco, trivy, gpu-operator, grafana/prometheus, argo-events/rollouts/workflows, neuron, …) oversubscribes the bootstrap nodes before Karpenter can scale out:
- Cilium ENI-mode per-node IP limit (~35 on m7g.large) is exceeded → pods stuck
ContainerCreating with cilium-cni: no IPs available.
- Karpenter, itself scheduled on a saturated bootstrap node, loses DNS and can't resolve the EC2 API (
lookup ec2.us-west-2.amazonaws.com: i/o timeout) → can't launch nodes → can't relieve the pressure. Deadlock.
Fix options
- A dev/smoke addon profile that installs a minimal set (skip falco/trivy/gpu-operator/neuron/grafana-prometheus for non-prod), and/or
- Size the bootstrap node group for the catalog (more/bigger nodes, or prefix-delegation for more IPs/node), and/or
- Give Karpenter a priorityClass + anti-affinity so it isn't stranded on a saturated bootstrap node.
Evidence
Live-observed on a fresh dev spoke during an agent-platform validation; worked around by cordoning the saturated nodes and running the target workload on a Karpenter node.
Symptom
A freshly vended dev spoke (2× m7g.large bootstrap nodes) syncing the full ai-platform + addons catalog (argocd, cert-manager, external-secrets, falco, trivy, gpu-operator, grafana/prometheus, argo-events/rollouts/workflows, neuron, …) oversubscribes the bootstrap nodes before Karpenter can scale out:
ContainerCreatingwithcilium-cni: no IPs available.lookup ec2.us-west-2.amazonaws.com: i/o timeout) → can't launch nodes → can't relieve the pressure. Deadlock.Fix options
Evidence
Live-observed on a fresh dev spoke during an agent-platform validation; worked around by cordoning the saturated nodes and running the target workload on a Karpenter node.