Skip to content

CCM and K3S/Cilium CIDR mismatch #915

Closed
@pellepelster

Description

@pellepelster

I am not 100% sure this is a bug, but I observed a weird behavior and can not really pinpoint this to either CCM/cilium or the K3S cluster where both are running in.

Setup

A K3S cluster (v1.30.2+k3s2) with cilium as network layer (1.17.3) and Hetzner CCM (1.24.0). Networking is configured exactly like described in https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/deploy_with_networks.md with and native routing mode is active.

cilium values.yml

operator:
  replicas: 2
routingMode: native
k8sServiceHost: prod-k3s.xxx.de
k8sServicePort: 6443
ipv4NativeRoutingCIDR: 10.0.0.0/8
ipam:
  operator:
    clusterPoolIPv4PodCIDRList: 10.0.16.0/20
autoDirectNodeRoutes: true
kubeProxyReplacement: true
directRoutingSkipUnreachable: true
nodePort:
    enabled: true # https://docs.cilium.io/en/latest/network/servicemesh/ingress/#prerequisites
ingressController:
  enabled: true
  loadbalancerMode: shared
  enableProxyProtocol: true
  hostNetwork:
    enabled: true # https://docs.cilium.io/en/latest/network/servicemesh/ingress/#gs-ingress-host-network-mode
    sharedListenerPort: 8080
  externalTrafficPolicy: Local
  service:
    externalTrafficPolicy: null
    type: ClusterIP
  loadBalancer:
    l7:
      backend: envoy

Issue

The cluster was working fine, networking working as expected. During scale up to add another node, cilium suddenly started complaining that the cluster health was degraded:

cilium-dbg status

[...]
Cluster health:          5/6 reachable
[...]

a verbose status query showed that though the server node was fine, the endpoint within the nodes pod CIDR was not reachable

cilium-dbg status --verbose

Name              IP              Node   Endpoints
  prod-xxx-agent-0 (localhost):
    Host connectivity to 10.0.1.10:
      ICMP to stack:   OK, RTT=124.186µs
      HTTP to agent:   OK, RTT=357.775µs
    Endpoint connectivity to 10.0.21.142:
      ICMP to stack:   OK, RTT=309.803µs
      HTTP to agent:   OK, RTT=310.313µs
  prod-xxx-k3s-agent-1:
    Host connectivity to 10.0.1.11:
      ICMP to stack:   OK, RTT=2.2799ms
      HTTP to agent:   OK, RTT=852.765µs
    Endpoint connectivity to 10.0.22.3:
      ICMP to stack:   OK, RTT=2.359048ms
      HTTP to agent:   OK, RTT=980.736µs
  prod-xxx-k3s-agent-2:
    Host connectivity to 10.0.1.12:
      ICMP to stack:   OK, RTT=3.352002ms
      HTTP to agent:   OK, RTT=933.567µs
    Endpoint connectivity to 10.0.25.241:
      ICMP to stack:   ERROR (exact message not recorded)
      HTTP to agent:   ERROR (exact message not recorded)

[...]

looking at the the log of the cloud controller I can see there is a mismatch of the nodes CIDR

I0428 09:24:29.280171       1 route_controller.go:214] action for Node "prod-xxx-k3s-agent-0" with CIDR "10.0.21.0/24": "keep"
I0428 09:24:29.280211       1 route_controller.go:214] action for Node "prod-xxx-k3s-agent-1" with CIDR "10.0.22.0/24": "keep"
I0428 09:24:29.280222       1 route_controller.go:214] action for Node "prod-xxx-k3s-agent-2" with CIDR "10.0.19.0/24": "keep"
I0428 09:24:29.280232       1 route_controller.go:214] action for Node "prod-xxx-k3s-server-0" with CIDR "10.0.16.0/24": "keep"
I0428 09:24:29.280242       1 route_controller.go:214] action for Node "prod-xxx-k3s-server-1" with CIDR "10.0.17.0/24": "keep"
I0428 09:24:29.280251       1 route_controller.go:214] action for Node "prod-xxx-k3s-server-2" with CIDR "10.0.18.0/24": "keep"

so the culprit seemed to be the mismatch, the Hetzner CCM thinks node prod-xxx-k3s-agent-2 has the pod CIDR 10.0.19.0/24 while cilium expects it to be 10.0.25.0/24. This is confirmed by a look at the CiliumNode config

apiVersion: cilium.io/v2
kind: CiliumNode
metadata:
  creationTimestamp: "2025-04-24T22:46:27Z"
  generation: 18
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    csi.hetzner.cloud/location: nbg1
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: prod-xxx-k3s-agent-2
    kubernetes.io/os: linux
  name: prod-xxx-k3s-agent-2
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: prod-xxx-k3s-agent-2
    uid: 8ac44d80-1509-4bd4-96a0-b89ae7bd37fc
  resourceVersion: "25412434"
  uid: a87adfb9-d50f-4213-85f7-3dcb8e33a305
spec:
  addresses:
  - ip: 10.0.1.12
    type: InternalIP
  - ip: 10.0.25.64
    type: CiliumInternalIP
  alibaba-cloud: {}
  azure: {}
  bootid: 152ec818-1c18-488f-9bcd-9c3b320b0d21
  encryption: {}
  eni: {}
  health:
    ipv4: 10.0.25.241
  ingress:
    ipv4: 10.0.25.17
  ipam:
    podCIDRs:
    - 10.0.25.0/24
    pools: {}
status:
  alibaba-cloud: {}
  azure: {}
  eni: {}
  ipam:
    operator-status: {}

I was able to mitigate the issue by manually editing the CiliumNode but am still on the hunt of the root cause.

I would appreciate any hints on where to look for the bug or how this mismatch can happen?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions