Skip to content

Splunk driver not getting response from splunk makes docker unresponsive #55

Open
@Carles-Figuerola

Description

@Carles-Figuerola

What happened:
We have a cluster of nodes running docker and managed by marathon/mesos. The containers running there are using the docker splunk logging plugin to send logs to the splunk event collector.

The load balancer in front of the splunk event collector was having trouble connecting so from the point of view of the logging plugin, the https connections were being opened, but not replied, so all connections were "hanging". This made all the environment unstable as containers were not passing healthchecks and not able to serve the application running on them.

An example of the logs seen in docker are:

Aug 12 12:50:34 dockerhost.local dockerd[10030]: time="2019-08-12T12:50:34.493818095-07:00" level=warning msg="Error while sending logs" error="Post https://splunk-ec:443/services/collector/event/1.0: context deadline exceeded" module=logger/splunk

The manual connection to the splunk-ec shows that it hangs after sending the headers and will get no response at all:

$ curl -vk https://splunk-ec:443/services/collector/event/1.0
* About to connect() to splunk-ec port 443 (#0)
*   Trying 10.0.0.1...
* Connected to splunk-ec (10.0.0.1) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_RSA_WITH_AES_256_CBC_SHA
* Server certificate:
*       subject: CN=<REDACTED>
*       start date: Jan 22 16:45:30 2010 GMT
*       expire date: Jan 23 01:36:42 2020 GMT
*       common name: <REDACTED>
*       issuer: CN=Entrust Certification Authority - L1C,OU="(c) 2009 Entrust, Inc.",OU=www.entrust.net/rpa is incorporated by reference,O="Entrust, Inc.",C=US
> GET /services/collector/event/1.0 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: splunk-ec
> Accept: */*
>
^C

What you expected to happen:
If the splunk logging driver can't send logs for any reason, it should fill the buffer and drop logs when it's full, not make the docker agent unstable and make the application inaccessible

How to reproduce it (as minimally and precisely as possible):
Have a small app (maybe just nc -l -p443) listen in https but not make any reply either successful or unsuccessful, then point the splunk logging plugin there.

Anything else we need to know?:
The docker agent runs with these environment variables:

SPLUNK_LOGGING_DRIVER_BUFFER_MAX=400
SPLUNK_LOGGING_DRIVER_CHANNEL_SIZE=200
SPLUNK_LOGGING_DRIVER_POST_MESSAGES_BATCH_SIZE=20

the containers are running with these options:

--log-driver=splunk
--log-opt=splunk-token=<token>
--log-opt=splunk-url=https://splunk-ec:443
--log-opt=splunk-index=app
--log-opt=splunk-sourcetype=<sourcetype>
--log-opt=splunk-insecureskipverify=true
--log-opt=env=APP_NAME,HOST,ACTIVE_VERSION
--log-opt=splunk-format=raw
--log-opt=splunk-verify-connection=false

Environment:

  • Docker version (use docker version):
Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Sun Feb 10 03:47:25 2019
  OS/Arch:          linux/amd64
  Experimental:     false
  • OS (e.g: cat /etc/os-release):
CentOS Linux release 7.6.1810 (Core)
Linux hostname 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Splunk version:
7.1.6

(this shouldn't affect as the problem was with splunk not getting an https response from the load balancer)

  • Others:

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions