Skip to content

TCP sockets not closing properly when etcd is running proxy mode. #9009

Closed
@lnx-bsp

Description

@lnx-bsp

Setup a small cluster of 4 nodes, having 3 nodes + 1 proxy (config has ETCD_PROXY="on" which is not the default).

On the proxy node, check for open TCP sockets like this:

netstat -n -p -a | fgrep etcd

When local clients on the proxy node connect to 127.0.0.1:2379 we can see a new connection from an ephemeral TCP port on the proxy node over to port 2379 on one of the other three working nodes, and that's fine because it is the proxy behaviour in operation. However these proxy connections do not properly clean up when etcd is long lived. Checking with netstat as per above shows more and more lines of output as more activity goes via the proxy. Over time, the available file-handles are consumed and eventually it will refuse connections.

etcd: http: Accept error: accept tcp [::]:2379: accept4: too many open files; retrying in 5ms

This appears related to an old issue which possibly has come back, or maybe never got fixed 100% in the first place which is here ... #1959

Restarting etcd temporarily gets it working again, only to have the file handles gradually get consumed over time, requiring more restarts. It is not necessary to restart the entire cluster, merely restarting the proxy node is sufficient, so this guy is certainly the culprit.

Platform is CentOS-7 using the packaged etcd installed by "yum" and launched via "systemd" as follows:

Name        : etcd
Arch        : x86_64
Version     : 3.2.7
Release     : 1.el7
Size        : 39 M
Repo        : installed
From repo   : extras
Summary     : A highly-available key value store for shared configuration
URL         : https://github.com/coreos/etcd
License     : ASL 2.0
Description : A highly-available key value store for shared configuration.

In the maps I see the following libraries are being used by the etcd process:

/usr/lib64/libc-2.17.so
/usr/lib64/libdl-2.17.so
/usr/lib64/libpthread-2.17.so
/usr/lib64/ld-2.17.so

These are all very standard CentOS system libraries, I doubt the bug is in the library, but at least you should be able to reproduce the same setup fairly easily. Many people in comments on the older issues reported similar setup (3 nodes + 1 proxy) was the way to reproduce this problem, so it would appear to be quite consistently happening. We are running a bunch of web servers, and a load balancer, sharing session data via etcd which should be fairly simple read/write type operations that can easily be simulated for testing. I'm guessing that the content of the data is irrelevant; typical size of data block might be approx 1k bytes.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions