Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS handshake error from: EOF #2142

Closed
ritazh opened this issue Jul 1, 2022 · 51 comments
Closed

TLS handshake error from: EOF #2142

ritazh opened this issue Jul 1, 2022 · 51 comments
Labels
wontfix This will not be worked on

Comments

@ritazh
Copy link
Member

ritazh commented Jul 1, 2022

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Getting the following intermittent errors in the gatekeeper-system logs:

http: TLS handshake error from 172.16.0.3:42672: EOF

kube-apiserver logs during the same time range do not have equivalent errors.
Everything is functioning. No impact on functionality.

NOTE:
There isn't any actual functional issues related to these error messages and the policies are working as expected. Lots of other webhook projects have reported the same issue, the error is coming from the kube-apiserver when it drops the connection prematurely and retries afterwards.

Please provide feedback in the following issues:

The EOF errors seems be related to a Go bug golang/go#50984 and appear on Kubernetes 1.23 and 1.24 and later. see kubernetes/kubernetes#109022

What did you expect to happen:
No TLS error in pod logs

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Gatekeeper version: v3.8.1 and v3.7.1
  • Kubernetes version: (use kubectl version): 1.23.5
@ritazh ritazh added the bug Something isn't working label Jul 1, 2022
@ritazh
Copy link
Member Author

ritazh commented Jul 1, 2022

The EOF errors seems be related to a Go bug golang/go#50984 and appear on Kubernetes 1.23 and 1.24 see kubernetes/kubernetes#109022

From the issue description, it does not seem like there are any actual functional issues related to these error messages (as the policies are working as expected). At the moment, there is nothing we can do to fix this, as the error is coming from Kubernetes core. We can continue to monitor this after the linked issue has been fixed and released as part of a future Kubernetes patch release.

@ritazh ritazh pinned this issue Jul 1, 2022
@ritazh
Copy link
Member Author

ritazh commented Jul 1, 2022

xref: #866 (comment)

@stale
Copy link

stale bot commented Aug 30, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 30, 2022
@punnarpulusu
Copy link

This is not just related to on Kubernetes 1.23 and 1.24 this is happening on all kuberenetes ( AWS EKS ) version 1.21

@stale stale bot removed the stale label Sep 8, 2022
@ritazh
Copy link
Member Author

ritazh commented Sep 9, 2022

@punnarpulusu Can you share the exact error in the log and kubernetes and gatekeeper version?

@punnarpulusu
Copy link

punnarpulusu commented Sep 9, 2022

@ritazh Here is the error log ... redacted some information for security purpose.

gatekeeper version is 3.8.1

  k logs -n gatekeeper deploy/gatekeeper-controller-manager -f
  Found 3 pods, using pod/gatekeeper-controller-manager-xxxxxxx-ldsc7
2022/09/08 01:14:32 http: TLS handshake error from x.x.x.x:49070: EOF
2022/09/08 01:46:37 http: TLS handshake error from x.x.x.x:35184: EOF
2022/09/08 02:47:46 http: TLS handshake error from x.x.x.x:39938: EOF
2022/09/08 06:47:20 http: TLS handshake error from x.x.x.x:38652: EOF
2022/09/08 12:37:59 http: TLS handshake error from x.x.x.x:49956: EOF
2022/09/08 13:16:45 http: TLS handshake error from x.x.x.x:56032: EOF
2022/09/08 13:41:48 http: TLS handshake error from x.x.x.x:56232: EOF
2022/09/08 16:38:13 http: TLS handshake error from x.x.x.x:60828: EOF
2022/09/08 19:02:34 http: TLS handshake error from x.x.x.x:36744: EOF

sorry about the delayed response.

@punnarpulusu
Copy link

@ritazh I am getting the same error on gatekeeper 3.9.0 as well

image: artifactory.dev.earnin.net/docker-remote/openpolicyagent/gatekeeper:v3.9.0

Here is the log

k logs -n gatekeeper deploy/gatekeeper-controller-manager -f
Found 3 pods, using pod/gatekeeper-controller-manager-69b88d77ff-v6fn8
2022/09/28 16:39:44 maxprocs: Updating GOMAXPROCS=5: determined from CPU quota
2022/09/28 16:40:01 http: TLS handshake error from x.x.x.x:48490: EOF

any idea on whats causing this issue and how I can get it fixed.

@meons
Copy link
Contributor

meons commented Oct 10, 2022

Same here on GKE 1.22 + Gatekeeper 3.9.0:

kubectl -n gatekeeper logs deployment/gatekeeper-controller-manager | grep error
Found 3 pods, using pod/gatekeeper-controller-manager-888b9f574-h4vjz
2022/10/09 10:27:38 http: TLS handshake error from x.x.x.x:50782: EOF
2022/10/09 10:36:56 http: TLS handshake error from x.x.x.x:52720: EOF
2022/10/09 10:56:14 http: TLS handshake error from x.x.x.x:55364: EOF
2022/10/09 11:05:39 http: TLS handshake error from x.x.x.x:58102: EOF
2022/10/09 11:34:01 http: TLS handshake error from x.x.x.x:49868: EOF
2022/10/09 13:30:59 http: TLS handshake error from x.x.x.x:46064: EOF
2022/10/09 14:45:18 http: TLS handshake error from x.x.x.x:55056: EOF
2022/10/09 15:06:19 http: TLS handshake error from x.x.x.x:54452: EOF
2022/10/09 16:07:31 http: TLS handshake error from x.x.x.x:54824: EOF
2022/10/09 16:16:03 http: TLS handshake error from x.x.x.x:37644: EOF
2022/10/09 16:43:38 http: TLS handshake error from x.x.x.x:46590: EOF
2022/10/09 21:27:03 http: TLS handshake error from x.x.x.x:54706: EOF
2022/10/09 21:40:59 http: TLS handshake error from x.x.x.x:47458: EOF
2022/10/09 22:47:10 http: TLS handshake error from x.x.x.x:50688: EOF
2022/10/10 00:18:11 http: TLS handshake error from x.x.x.x:44814: EOF
2022/10/10 01:13:12 http: TLS handshake error from x.x.x.x:42378: EOF
2022/10/10 02:45:59 http: TLS handshake error from x.x.x.x:47150: EOF
2022/10/10 02:56:03 http: TLS handshake error from x.x.x.x:32800: EOF
2022/10/10 03:33:21 http: TLS handshake error from x.x.x.x:52332: EOF
2022/10/10 03:58:46 http: TLS handshake error from x.x.x.x:45042: EOF
2022/10/10 04:59:54 http: TLS handshake error from x.x.x.x:48482: EOF
2022/10/10 05:49:48 http: TLS handshake error from x.x.x.x:42376: EOF
2022/10/10 05:57:33 http: TLS handshake error from x.x.x.x:41712: EOF
2022/10/10 07:11:39 http: TLS handshake error from x.x.x.x:60302: EOF

Actually x.x.x.x are GKE control planes IPs.

@ZiaUrRehman-GBI
Copy link

Kubernetes version : 1.24.3-gke.2100

textPayload: "2022/10/31 11:20:06 http: TLS handshake error from x.x.x.x:41398: EOF"

@kfox1111
Copy link

kfox1111 commented Nov 3, 2022

seen on k8s 1.21.11
2022/11/03 19:17:10 http: TLS handshake error from 10.17.0.0:52110: EOF

not sure its affecting anything.

@tspearconquest
Copy link

Hello, I've noticed these before but not had time to do some proper investigation until now.

I found that these messages are coming from an IP belonging to the konnectivity pods in my kube-system namespace in Azure.

This pod is facilitating the control plane to cluster communications as per https://kubernetes.io/docs/tasks/extend-kubernetes/setup-konnectivity/

Digging into the kube-system namespace labels, I see that there is control-plane: true on that namespace.

I believe what's going on which is causing this, is that konnectivity-agent is looking for all namespaces where the label control-plane exists (regardless of the value) and trying to make a connection to the gatekeeper pods.

I found #1061 which covers the removal of the control-plane label, however it has only been partially implemented by removing the check from the validating webhook configuration in Gatekeeper (#758)

Is it safe to remove the control-plane: controller-manager label from the gatekeeper-system namespace currently, if we have already applied the admission.gatekeeper.sh/ignore: no-self-managing label?

In case it is safe, then we should push to have the control-plane label removed from the namespace as soon as possible, as this is really causing problems for teams with log monitoring agents like fluentd.

@ritazh
Copy link
Member Author

ritazh commented Nov 14, 2022

NOTE

The EOF errors seems be related to a Go bug golang/go#50984 and appear on Kubernetes 1.23 and 1.24 see kubernetes/kubernetes#109022

From the issue description, it does not seem like there are any actual functional issues related to these error messages (as the policies are working as expected). At the moment, there is nothing we can do to fix this, as the error is coming from Kubernetes core. We can continue to monitor this after the linked issue has been fixed and released as part of a future Kubernetes patch release.

@tspearconquest
Copy link

tspearconquest commented Nov 14, 2022

Hi @ritazh I believe that is incorrect. These errors also come on Kubernetes 1.22 for us, and also others have noted in this issue that they happen on K8s 1.21.

This is not just related to on Kubernetes 1.23 and 1.24 this is happening on all kuberenetes ( AWS EKS ) version 1.21

comment

Furthermore, kubernetes/kubernetes#109022 clearly indicates the errors coming from 127.0.0.1.
The original post of this issue does not indicate 127.0.0.1, but rather has the IP addresses masked as x.x.x.x which leads me to believe that the OP is experiencing this from their 10.x.x.x/8 subnet, the same as myself.

@ritazh
Copy link
Member Author

ritazh commented Nov 14, 2022

Thanks for the additional data @tspearconquest! If you remove the control-plane label from the gatekeeper-system namespace as you suggested, do you still see the error in the log?

@tspearconquest
Copy link

We're testing today and I will report back soon!

@Murtaza-Solangi
Copy link

We're testing today and I will report back soon!

Was the test successful?

@tspearconquest
Copy link

We're testing today and I will report back soon!

Was the test successful?

Hello, apologies as I put my update on the other issue: #1061

Hi @ritazh - It seems my suspicion was not correct, and removing the control-plane label did not help.

It's really interesting that this is only affecting Gatekeeper, as we do have other tools with MWH and VWH which do not see this problem, and the traffic causing the errors is 100% coming from the konnectivity-agent pods in kube-system

I also took a look in konnectivity configmap and deployment manifest in one of our clusters to see if I could find a log format option, but I'm afraid I couldn't find any. My main concern is that these are not coming in json format, so it causes a lot of spam for our fluentd instance to try to parse non-json log outputs as json.

@stale
Copy link

stale bot commented Jan 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 22, 2023
@tspearconquest
Copy link

tspearconquest commented Jan 22, 2023 via email

@wondywang
Copy link

I also encountered these error logs, looking forward to someone to solve it.

@maxsmythe
Copy link
Contributor

Gatekeeper works but this error happens periodically. May be, because of
cert rotation

I'd expect the same errors during cert rotation, though would think the cert rotation frequency is low enough (O(years)) for the error to never repeat in the standard lifecycle of a pod.

In any case, having multiple concurrent writers with one "winner" is probably the best model. This is essentially how leader election works anyway, and avoids needing to worry about any one pod becoming a SPOF or figuring out who is eligible to become a leader. There is an edge case where there is the possibility of a controller fight if there is an incompatible change. This can be mitigated by gradually introducing a change and leaning on our "upgrades are N - 1 compatible" policy.

Copy link

stale bot commented Nov 5, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 5, 2023
@part-time-githubber
Copy link

still an open issue

@stale stale bot removed the stale label Nov 6, 2023
@resnostyle
Copy link
Contributor

We're currently running version 1.25.15 of kube and running version v3.12.0 of the opa gatekeeper and still seeing this error.

@cccsss01
Copy link

cccsss01 commented Dec 2, 2023

Seeing this error on 1.27.1 with gatekeeper v3.11.0 not sure if this is causing issues with timeouts for leaderelection or not

Copy link

stale bot commented Feb 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 1, 2024
@immae1
Copy link

immae1 commented Feb 1, 2024

Still a topic on AKS (1.27.7) with latest node images;

2024/02/01 08:28:47 http: TLS handshake error from 10.2.0.1:32914: EOF

@OrKarstoft
Copy link

Same issue with Kubernetes v1.27.9-gke.1092000.

@aimbot31
Copy link

still

@stale stale bot removed the stale label Feb 14, 2024
@rjbrown57
Copy link

Seeing this is 1.26 as well

@salaxander salaxander added wontfix This will not be worked on and removed bug Something isn't working labels Feb 21, 2024
@ritazh ritazh added wontfix This will not be worked on and removed wontfix This will not be worked on labels Feb 23, 2024
@ritazh
Copy link
Member Author

ritazh commented Feb 23, 2024

Closing this issue as there isn't any actual functional issues related to these error messages and the policies are working as expected. Lots of other webhook projects have reported the same issue, the error is coming from the kube-apiserver when it drops the connection prematurely and retries afterwards.

@ritazh ritazh closed this as completed Feb 23, 2024
@ritazh ritazh closed this as not planned Won't fix, can't repro, duplicate, stale Feb 23, 2024
@cbugneac-nex
Copy link

But it does unnecessarily pollute the logs with these errors in DataDog (in our case) creating unnecessary noise.

@sozercan
Copy link
Member

sozercan commented Feb 23, 2024

@cbugneac-nex it would be good to provide that feedback in golang/go#50984 as this is not an issue Gatekeeper (or any Kubernetes webhooks) can address on its own.

@ritazh ritazh unpinned this issue Mar 19, 2024
@pythonking6
Copy link

This maybe separate, but I am seeing this in the Loadbalancer controller on kubernetes 1.29, terraformed from eks_blueprints:

2024/05/29 21:55:41 http: TLS handshake error from 10.0.119.240:46106: EOF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests