Skip to content

Full addon catalog overwhelms small/dev clusters → IP-exhaustion + DNS/Karpenter deadlock #82

Description

@stxkxs

Symptom

A freshly vended dev spoke (2× m7g.large bootstrap nodes) syncing the full ai-platform + addons catalog (argocd, cert-manager, external-secrets, falco, trivy, gpu-operator, grafana/prometheus, argo-events/rollouts/workflows, neuron, …) oversubscribes the bootstrap nodes before Karpenter can scale out:

  • Cilium ENI-mode per-node IP limit (~35 on m7g.large) is exceeded → pods stuck ContainerCreating with cilium-cni: no IPs available.
  • Karpenter, itself scheduled on a saturated bootstrap node, loses DNS and can't resolve the EC2 API (lookup ec2.us-west-2.amazonaws.com: i/o timeout) → can't launch nodes → can't relieve the pressure. Deadlock.

Fix options

  • A dev/smoke addon profile that installs a minimal set (skip falco/trivy/gpu-operator/neuron/grafana-prometheus for non-prod), and/or
  • Size the bootstrap node group for the catalog (more/bigger nodes, or prefix-delegation for more IPs/node), and/or
  • Give Karpenter a priorityClass + anti-affinity so it isn't stranded on a saturated bootstrap node.

Evidence

Live-observed on a fresh dev spoke during an agent-platform validation; worked around by cordoning the saturated nodes and running the target workload on a Karpenter node.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions