Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of clueless "Transport endpoint is not connected" messages in ingress controller #7266

Closed
cemsbr opened this issue Nov 11, 2021 · 11 comments

Comments

@cemsbr
Copy link

cemsbr commented Nov 11, 2021

Bug Report

What is the issue?

In the nginx ingress controller proxy, there 2 messages every 3 seconds:

WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)

I have no idea which endpoint it is talking about. I tried to find something related in the debug proxy, but I couldn't.

I disabled external health checks and the messages didn't stop, so the connections are probably from inside the cluster.

How can it be reproduced?

Maybe creating another cluster with the same environment (see Environment section).

Logs, error output, etc

(If the output is long, please create a gist and
paste the link here.)

linkerd check output

Linkerd core checks                                                                                                                                                                   [53/66745]
===================                        
                                                
kubernetes-api                             
--------------                                  
√ can initialize the client                                                                     
√ can query the Kubernetes API 
                                                
kubernetes-version                                                                              
------------------                   
√ is running the minimum Kubernetes API version                                                 
√ is running the minimum kubectl version      
                                                                                                
linkerd-existence                               
-----------------         
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist  
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist

linkerd-identity                                                                                                                                                                      [22/66745]
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check

Status check results are √

Environment

  • Kubernetes Version: v1.19.6
  • Cluster Environment: DOKS
  • Host OS:
  • Linkerd version: 2.11.1 (client and server)

Possible solution

Improve logs to tell the 2 ends of the failing connection?

Additional context

Just some context on why I want to solve this warning message. Sometimes, once in a couple of days, linkerd fails to contact a service due to timeout. However, the service logs show no error. I'm not sure whether these warnings are related to the timeout error, but I'm trying to solve every warning (and error) message I find.

@olix0r
Copy link
Member

olix0r commented Nov 11, 2021

This error indicates that there was a problem accepting the inbound connection (i.e from an external client). The error message originates here:

https://github.com/linkerd/linkerd2-proxy/blob/98fee53a3be85beb7a82571ece4a7e45470989c4/linkerd/app/core/src/serve.rs#L29-L39

This might indicate that the client has already disconnected when the connection is processed, or could indicate some OS-level issues.

@cemsbr
Copy link
Author

cemsbr commented Nov 11, 2021

Is there a way to know the incoming IP and port of this inbound connection? I also search the Nginx access logs, but it was silent after disabling external health checks, and the proxy warnings were still happening at the same rate.

@olix0r
Copy link
Member

olix0r commented Nov 11, 2021

These connections aren't making it to nginx. You can increase the proxy's log level by setting a pod annotation like:

config.linkerd.io/proxy-log-level: linkerd_app_core::serve=debug,info

This will configure the proxy to include the client address on these log lines.

@cemsbr
Copy link
Author

cemsbr commented Nov 12, 2021

I tried searching pods with IP 10.244.0.146 and .123 (kubectl get pods -o wide) but found none. There is also one client address that has the public IP of the node. Any idea of which services are trying to connect?

[ 11992.906251s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.123:50254}: linkerd_app_core::serve: Connection closed
[ 11992.915758s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11993.207810s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)                               
[ 11994.722659s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:38382}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[ 11995.918142s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11995.963397s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:50834}: linkerd_app_core::serve: Connection closed
[ 11996.211307s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11996.981061s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:38390}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[ 11998.386177s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:46542}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 10
7)                                                                                                                                                                                              
[ 11998.920286s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11999.213747s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11999.878354s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:50914}: linkerd_app_core::serve: Connection closed
[ 12000.570219s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:38398}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[ 12001.145728s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.123:50438}: linkerd_app_core::serve: Connection closed

@olix0r
Copy link
Member

olix0r commented Nov 12, 2021

Looking a little more closely at this: we won't actually have client IPs associated with the Server failed to accept connection error because this error indicates that we either failed to accept the socket or that we failed to determine the socket's client address.

Either way, though, this is an OS error encountered when the proxy is accepting the inbound connection (i.e. before the connection is processed or forwarded to nginx).

I've put up a branch that should at least improve the error messages a bit. If you want to try this build, you can use the following pod annotation:

config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: accept.a4c185df

@cemsbr
Copy link
Author

cemsbr commented Nov 26, 2021

Thank you for the branch. Now, we can see the client IPs in error messages. However, they are the same as I noticed before: the public IP address, and 2 private ones, which I can't find by running kubectl get pods -Ao wide | grep 10.244.0.146, neither kubectl get services -A | grep 10.244.0.146`, for example. Am I missing another command to get the IP addresses?

time="2021-11-26T23:06:47Z" level=info msg="running version edge-21.4.5"                                                                                                                        
[     0.000744s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime                                                                                                     
[     0.001367s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.001379s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.001381s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.001384s]  INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[     0.001389s]  INFO ThreadId(01) linkerd2_proxy: Local identity is ingress-nginx.default.serviceaccount.identity.linkerd.cluster.local
[     0.001394s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.clu
ster.local)
[     0.001397s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.c
luster.local)
[     0.012202s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=ingress-nginx.default.serviceaccount.identity.linkerd.cluster.local
[     0.039016s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=127.0.0.1:57156}: linkerd_app_core::serve: Connection closed
[     7.629721s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.146:57832}: linkerd_app_core::serve: Connection closed
[    12.384783s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:57396}: linkerd_app_core::serve: Connection closed
[    14.818663s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.146:57970}: linkerd_app_core::serve: Connection closed
[    17.629491s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.146:58026}: linkerd_app_core::serve: Connection closed
[    17.661262s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:57520}: linkerd_app_core::serve: Connection closed
[    18.890740s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:64123}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107
)
[    19.525010s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    19.767303s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    21.274532s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:43912}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[    22.017999s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:43914}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[    22.384536s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:57588}: linkerd_app_core::serve: Connection closed
[    22.526506s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    22.769601s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    23.945471s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:56066}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107
)
[    24.327122s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:6752}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[    24.347951s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:43924}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[    24.818402s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.146:58162}: linkerd_app_core::serve: Connection closed
[    25.529911s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    25.700883s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:13240}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107
)

@olix0r
Copy link
Member

olix0r commented Nov 28, 2021

Is 10.244.0.146 the node IP (kubectl get nodes -o wide | grep -F 10.244.0.146)? This seems likely since we also see connections from this IP address to the admin server. These connections may be for readiness/liveness probes? Does the workload have probes configured?

If everything's working properly except for the log messages, you can disable these warnings by setting a log level like linkerd=info,linkerd_app_core::serve[inbound]=error,warn.

@cemsbr
Copy link
Author

cemsbr commented Nov 28, 2021

No, nodes have 10.131 internal IPs. Except the public IP in the logs that matches one of the nodes. Yes, we have some services with probes and all of them are in running state.

About disabling the logs, I'm not sure whether this issue is related to some timeouts between nginx and one of our services. I'm still not confident on ignoring it. However, the logs can't be more helpful than the ones from your branch, I'm afraid.

I still don't know who are 10.244.0.146 and 10.244.0.123. Maybe this is the only way to understand what really is going on.

@cemsbr
Copy link
Author

cemsbr commented Nov 28, 2021

Maybe these IP addresses are from DOKS managed master nodes and I should contact them.

@olix0r
Copy link
Member

olix0r commented Nov 29, 2021

@cemsbr

I'm not sure whether this issue is related to some timeouts between nginx and one of our services.

For the warnings like:

WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)

These connections are definitely not reaching nginx. This is an error from the operating system when the proxy tried to accept the connection before passing it to nginx.

It's notable that there are other similar errors on other connections immediately preceding these accept errors, though:

DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:6752}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:43924}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)

This may be a conntrack issue or some other OS configuration issue, or it may be a symptom of a lossy network. It's going to be hard for the Linkerd team to be more helpful, though. I'd recommend trying to engage the DigitalOcean team to see if they can help diagnose this further. We're happy to help if there's anything we can do in Linkerd to make this easier to diagnose.

@cemsbr
Copy link
Author

cemsbr commented Nov 30, 2021

Thank you very much for your support. I'll close this issue and reach DO.

@cemsbr cemsbr closed this as completed Nov 30, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants