Lots of clueless "Transport endpoint is not connected" messages in ingress controller #7266

cemsbr · 2021-11-11T15:35:48Z

Bug Report

What is the issue?

In the nginx ingress controller proxy, there 2 messages every 3 seconds:

WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)

I have no idea which endpoint it is talking about. I tried to find something related in the debug proxy, but I couldn't.

I disabled external health checks and the messages didn't stop, so the connections are probably from inside the cluster.

How can it be reproduced?

Maybe creating another cluster with the same environment (see Environment section).

Logs, error output, etc

(If the output is long, please create a gist and
paste the link here.)

`linkerd check` output

Linkerd core checks                                                                                                                                                                   [53/66745]
===================                        
                                                
kubernetes-api                             
--------------                                  
√ can initialize the client                                                                     
√ can query the Kubernetes API 
                                                
kubernetes-version                                                                              
------------------                   
√ is running the minimum Kubernetes API version                                                 
√ is running the minimum kubectl version      
                                                                                                
linkerd-existence                               
-----------------         
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist  
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist

linkerd-identity                                                                                                                                                                      [22/66745]
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check

Status check results are √

Environment

Kubernetes Version: v1.19.6
Cluster Environment: DOKS
Host OS:
Linkerd version: 2.11.1 (client and server)

Possible solution

Improve logs to tell the 2 ends of the failing connection?

Additional context

Just some context on why I want to solve this warning message. Sometimes, once in a couple of days, linkerd fails to contact a service due to timeout. However, the service logs show no error. I'm not sure whether these warnings are related to the timeout error, but I'm trying to solve every warning (and error) message I find.

The text was updated successfully, but these errors were encountered:

olix0r · 2021-11-11T17:03:20Z

This error indicates that there was a problem accepting the inbound connection (i.e from an external client). The error message originates here:

https://github.com/linkerd/linkerd2-proxy/blob/98fee53a3be85beb7a82571ece4a7e45470989c4/linkerd/app/core/src/serve.rs#L29-L39

This might indicate that the client has already disconnected when the connection is processed, or could indicate some OS-level issues.

cemsbr · 2021-11-11T18:38:37Z

Is there a way to know the incoming IP and port of this inbound connection? I also search the Nginx access logs, but it was silent after disabling external health checks, and the proxy warnings were still happening at the same rate.

olix0r · 2021-11-11T19:01:02Z

These connections aren't making it to nginx. You can increase the proxy's log level by setting a pod annotation like:

config.linkerd.io/proxy-log-level: linkerd_app_core::serve=debug,info

This will configure the proxy to include the client address on these log lines.

cemsbr · 2021-11-12T15:05:44Z

I tried searching pods with IP 10.244.0.146 and .123 (kubectl get pods -o wide) but found none. There is also one client address that has the public IP of the node. Any idea of which services are trying to connect?

[ 11992.906251s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.123:50254}: linkerd_app_core::serve: Connection closed
[ 11992.915758s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11993.207810s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)                               
[ 11994.722659s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:38382}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[ 11995.918142s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11995.963397s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:50834}: linkerd_app_core::serve: Connection closed
[ 11996.211307s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11996.981061s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:38390}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[ 11998.386177s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:46542}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 10
7)                                                                                                                                                                                              
[ 11998.920286s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11999.213747s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)
[ 11999.878354s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:50914}: linkerd_app_core::serve: Connection closed
[ 12000.570219s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:38398}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[ 12001.145728s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.123:50438}: linkerd_app_core::serve: Connection closed

olix0r · 2021-11-12T17:02:13Z

Looking a little more closely at this: we won't actually have client IPs associated with the Server failed to accept connection error because this error indicates that we either failed to accept the socket or that we failed to determine the socket's client address.

Either way, though, this is an OS error encountered when the proxy is accepting the inbound connection (i.e. before the connection is processed or forwarded to nginx).

I've put up a branch that should at least improve the error messages a bit. If you want to try this build, you can use the following pod annotation:

config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: accept.a4c185df

cemsbr · 2021-11-26T23:26:59Z

Thank you for the branch. Now, we can see the client IPs in error messages. However, they are the same as I noticed before: the public IP address, and 2 private ones, which I can't find by running kubectl get pods -Ao wide | grep 10.244.0.146, neither kubectl get services -A | grep 10.244.0.146`, for example. Am I missing another command to get the IP addresses?

time="2021-11-26T23:06:47Z" level=info msg="running version edge-21.4.5"                                                                                                                        
[     0.000744s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime                                                                                                     
[     0.001367s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.001379s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.001381s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.001384s]  INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[     0.001389s]  INFO ThreadId(01) linkerd2_proxy: Local identity is ingress-nginx.default.serviceaccount.identity.linkerd.cluster.local
[     0.001394s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.clu
ster.local)
[     0.001397s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.c
luster.local)
[     0.012202s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=ingress-nginx.default.serviceaccount.identity.linkerd.cluster.local
[     0.039016s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=127.0.0.1:57156}: linkerd_app_core::serve: Connection closed
[     7.629721s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.146:57832}: linkerd_app_core::serve: Connection closed
[    12.384783s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:57396}: linkerd_app_core::serve: Connection closed
[    14.818663s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.146:57970}: linkerd_app_core::serve: Connection closed
[    17.629491s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.146:58026}: linkerd_app_core::serve: Connection closed
[    17.661262s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:57520}: linkerd_app_core::serve: Connection closed
[    18.890740s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:64123}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107
)
[    19.525010s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    19.767303s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    21.274532s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:43912}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[    22.017999s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:43914}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[    22.384536s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.146:57588}: linkerd_app_core::serve: Connection closed
[    22.526506s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    22.769601s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    23.945471s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:56066}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107
)
[    24.327122s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:6752}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[    24.347951s] DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:43924}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
[    24.818402s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.244.0.146:58162}: linkerd_app_core::serve: Connection closed
[    25.529911s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)
[    25.700883s] DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:13240}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107
)

olix0r · 2021-11-28T06:12:42Z

Is 10.244.0.146 the node IP (kubectl get nodes -o wide | grep -F 10.244.0.146)? This seems likely since we also see connections from this IP address to the admin server. These connections may be for readiness/liveness probes? Does the workload have probes configured?

If everything's working properly except for the log messages, you can disable these warnings by setting a log level like linkerd=info,linkerd_app_core::serve[inbound]=error,warn.

cemsbr · 2021-11-28T13:29:33Z

No, nodes have 10.131 internal IPs. Except the public IP in the logs that matches one of the nodes. Yes, we have some services with probes and all of them are in running state.

About disabling the logs, I'm not sure whether this issue is related to some timeouts between nginx and one of our services. I'm still not confident on ignoring it. However, the logs can't be more helpful than the ones from your branch, I'm afraid.

I still don't know who are 10.244.0.146 and 10.244.0.123. Maybe this is the only way to understand what really is going on.

cemsbr · 2021-11-28T13:41:06Z

Maybe these IP addresses are from DOKS managed master nodes and I should contact them.

olix0r · 2021-11-29T06:25:15Z

@cemsbr

I'm not sure whether this issue is related to some timeouts between nginx and one of our services.

For the warnings like:

WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107)

These connections are definitely not reaching nginx. This is an error from the operating system when the proxy tried to accept the connection before passing it to nginx.

It's notable that there are other similar errors on other connections immediately preceding these accept errors, though:

DEBUG ThreadId(01) inbound:accept{client.addr=<public IP address>:6752}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)
DEBUG ThreadId(01) inbound:accept{client.addr=10.244.0.123:43924}: linkerd_app_core::serve: Connection closed reason=client: Transport endpoint is not connected (os error 107)

This may be a conntrack issue or some other OS configuration issue, or it may be a symptom of a lossy network. It's going to be hard for the Linkerd team to be more helpful, though. I'd recommend trying to engage the DigitalOcean team to see if they can help diagnose this further. We're happy to help if there's anything we can do in Linkerd to make this easier to diagnose.

cemsbr · 2021-11-30T01:17:54Z

Thank you very much for your support. I'll close this issue and reach DO.

cemsbr closed this as completed Nov 30, 2021

github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of clueless "Transport endpoint is not connected" messages in ingress controller #7266

Lots of clueless "Transport endpoint is not connected" messages in ingress controller #7266

cemsbr commented Nov 11, 2021

olix0r commented Nov 11, 2021

cemsbr commented Nov 11, 2021

olix0r commented Nov 11, 2021

cemsbr commented Nov 12, 2021

olix0r commented Nov 12, 2021 •

edited

Loading

cemsbr commented Nov 26, 2021

olix0r commented Nov 28, 2021

cemsbr commented Nov 28, 2021

cemsbr commented Nov 28, 2021

olix0r commented Nov 29, 2021

cemsbr commented Nov 30, 2021

Lots of clueless "Transport endpoint is not connected" messages in ingress controller #7266

Lots of clueless "Transport endpoint is not connected" messages in ingress controller #7266

Comments

cemsbr commented Nov 11, 2021

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

olix0r commented Nov 11, 2021

cemsbr commented Nov 11, 2021

olix0r commented Nov 11, 2021

cemsbr commented Nov 12, 2021

olix0r commented Nov 12, 2021 • edited Loading

cemsbr commented Nov 26, 2021

olix0r commented Nov 28, 2021

cemsbr commented Nov 28, 2021

cemsbr commented Nov 28, 2021

olix0r commented Nov 29, 2021

cemsbr commented Nov 30, 2021

`linkerd check` output

olix0r commented Nov 12, 2021 •

edited

Loading