Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet TLS Handshake Failures After Certificate Rotation #16850

Open
roman5595 opened this issue Sep 20, 2024 · 1 comment
Open

Kubelet TLS Handshake Failures After Certificate Rotation #16850

roman5595 opened this issue Sep 20, 2024 · 1 comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@roman5595
Copy link

roman5595 commented Sep 20, 2024

What happened?
We are deploying to several kops clusters via pipelines, since kops 1.23, some pipelines would fail with error below, so we implemented temporary retry mechanism that would retry request, currently we are at kops 1.29 and this issue still persists, this is not causing any outage, but I would like to remove our temporary solution and remove this issue below (I also checked PRs for 1.23 but i didnt find anything that might be related or could cause this issue, also, on kops 1.22 we never encountered this error ) :

/usr/bin/helm Error: unable to get pod logs for <APPLICATION>: Get "https://<WORKER NODE>:10250/containerLogs/default/<APPLICATION>/test-service": write tcp <CONTROL-PLANE NODE>:44194-><WORKER NODE>:10250: use of closed network connection

at the exact same time logs from api-server :
│ kube-apiserver I0920 16:30:39.542320 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials │ kube-apiserver E0920 16:30:39.543643 11 status.go:71] apiserver received an error that is not an metav1.Status: &url.Error{Op:"Get", URL:"https://<WORKER NODE >:10250/containerLogs/default/application/filebeat?sinceSecon │ │ ds=300", Err:(*net.OpError)(0xc071c91090)}: Get "https://<WORKER NODE >:10250/containerLogs/default/application/filebeat?sinceSeconds=300": write tcp <CONTROL-PLANE NODE>:44194-><WORKER NODE>:10250: use of closed network connection

Everytime this error hapens, there is same log in kubelet :

kubelet[5111]: I0920 16:30:39.542666 5111 log.go:245] http: TLS handshake error from <CONTROL-PLANE NODE>:44194: EOF

I checked validity of certificates, they are all valid,

apiserver logs :

I0920 16:10:03.110741 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 16:20:03.515003 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 16:30:39.542320 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 16:43:43.688572 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 16:53:43.688628 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:03:43.689273 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:14:04.170499 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:28:43.689063 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:38:43.688520 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:48:43.688942 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials

Is this normal behaviour ? Cert rotation cca every 10 minutes ?

What cloud provider are you using?
AWS

What did you expect to happen?

I expected, that once certs are rotated, this will not cause any intermittent network issues.

Kubelet config
kubelet:
containerLogMaxSize: "20Mi"
containerLogMaxFiles: 5
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
readOnlyPort: 0
protectKernelDefaults: true
streamingConnectionIdleTimeout: "30m"
eventQps: "0"
featureGates:
RotateKubeletServerCertificate: "true"
HPAContainerMetrics: "true"
kubeReserved:
cpu: "100m"
memory: "100Mi"
kubeReservedCgroup: "/kube-reserved"
systemReserved:
cpu: "100m"
memory: "100Mi"
systemReservedCgroup: "/system-reserved"
tlsCipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_RSA_WITH_AES_256_GCM_SHA384
- TLS_RSA_WITH_AES_128_GCM_SHA256

Possible relation

Is there chance that this issue might be related to : golang/go#50984 ?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants