Skip to content

Keep a freshly vended spoke from deadlocking on IPs, arch, and Karpenter#84

Merged
stxkxs merged 1 commit into
mainfrom
dev-spoke-deadlock-infra-fixes
Jun 30, 2026
Merged

Keep a freshly vended spoke from deadlocking on IPs, arch, and Karpenter#84
stxkxs merged 1 commit into
mainfrom
dev-spoke-deadlock-infra-fixes

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 30, 2026

Copy link
Copy Markdown
Member

Symptom

A freshly vended dev spoke (2× m7g.large bootstrap nodes) syncing the full addon catalog deadlocked:

  • pods stuck ContainerCreatingcilium-cni: no IPs available
  • Karpenter, stranded on a saturated bootstrap node with no DNS, couldn't reach the EC2 API to launch the nodes that would have relieved the pressure.

Three root causes, each fixed

Cilium ENI IP cap. ENI mode hands out single secondary IPs, capping m7g.large at ~35 pods — below what the catalog needs on the bootstrap nodes before Karpenter scales out. eni.awsEnablePrefixDelegation: true gives each ENI /28 prefixes instead, lifting the cap ~4x (~110) and removing the IP-exhaustion that also starved CoreDNS and Karpenter.

NodePool architecture. Both the default and sandbox NodePools required kubernetes.io/arch In [amd64], but Graviton/arm64 is the org default — the bootstrap nodes are m7g and the agent/sandbox images are arm64. An amd64 node provisioned here would exec-format-crash the arm64 pods scheduled onto it. Both pools now pin arm64.

Karpenter priority. Karpenter is the only thing that can relieve a saturated cluster, so it must never be the pod that gets evicted or stranded. The controller now carries priorityClassName: system-cluster-critical.

Scope

Prefix-delegation removes the IP-exhaustion root cause directly. A separate smoke/full addon profile (skip the heavy optional catalog on dev spokes entirely — and the appset split + landing-zone label it needs) is tracked in #83.

task validate green (yaml lint + all kustomize overlays build; both NodePools resolve arm64).

Closes #82

https://claude.ai/code/session_01R6rXpE1FZAVS14zanDdgb7

A freshly vended dev spoke (2× m7g.large bootstrap nodes) syncing the
full addon catalog deadlocked: pods stuck ContainerCreating with
"cilium-cni: no IPs available", and Karpenter — stranded on a saturated
bootstrap node with no DNS — couldn't reach the EC2 API to launch the
nodes that would have relieved the pressure.

Three independent root causes, each fixed here:

Cilium ENI IP cap. ENI mode hands out single secondary IPs, capping
m7g.large at ~35 pods — below what the catalog needs on the bootstrap
nodes before Karpenter scales out. Enable awsEnablePrefixDelegation so
each ENI carries /28 prefixes instead, lifting the cap ~4x (~110) and
removing the IP-exhaustion that also starved CoreDNS and Karpenter.

NodePool architecture. Both the default and sandbox NodePools required
kubernetes.io/arch In [amd64], but Graviton/arm64 is the org default —
the bootstrap nodes are m7g and the agent/sandbox images are arm64. An
amd64 node provisioned here would exec-format-crash the arm64 pods
scheduled onto it. Pin both pools to arm64.

Karpenter priority. Karpenter is the only thing that can relieve a
saturated cluster, so it must never be the pod that gets evicted or
stranded. Give the controller priorityClassName system-cluster-critical
so the scheduler preempts lower-priority pods to keep it running.

Prefix-delegation removes the IP-exhaustion root cause directly; a
separate smoke/full addon profile (skip the heavy optional catalog on
dev spokes entirely) is tracked as a follow-up.

Claude-Session: https://claude.ai/code/session_01R6rXpE1FZAVS14zanDdgb7
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint ✅ success
Render + assert (all environments) ✅ success

All checks passed.

@stxkxs stxkxs merged commit 55be13c into main Jun 30, 2026
9 checks passed
@stxkxs stxkxs deleted the dev-spoke-deadlock-infra-fixes branch June 30, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full addon catalog overwhelms small/dev clusters → IP-exhaustion + DNS/Karpenter deadlock

1 participant