Fluentd 0.14.13 gets buffer overflow errors with the same settings as Fluentd 0.12.32 #1485

qingling128 · 2017-03-01T17:11:33Z

During some basic load testing, we found that the upgrade to Fluentd 0.14.13 from 0.12.32 caused buffer overflow errors to be triggered very frequently.

[Error]

2017-03-01 17:02:20 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block

[Steps to reproduce]

Create a GCP instance (1 vCPU, 3.75GB, Debian GNU/Linux 8)
Install docker:

$ sudo apt-get install docker.io

Create some test logs with the logs generator container:

$ mkdir ~/logs
$ sudo docker run -i -e "LOGS_GENERATOR_DURATION=1s" -e "LOGS_GENERATOR_LINES_TOTAL=1000000" gcr.io/google_containers/logs-generator:v0.1.0 2>&1 | awk '{print "{\"log\":\"" $0 "\"}"}' > ~/logs/log.log

Create configs

$ mkdir ~/config.d
$ cat ~/config.d/config.conf
<match fluent.**>
  type null
</match>

<source>
  type tail
  format json
  time_key time
  path /var/log/containers/*.log
  pos_file /var/log/gcp-containers.log.pos
  time_format %Y-%m-%dT%H:%M:%S.%NZ
  tag reform.*
  read_from_head true
</source>

<match reform.**>
  type record_reformer
  enable_ruby true
  tag kubernetes.${tag_suffix[4].split('-')[0..-2].join('-')}
</match>

# We use 2 output stanzas - one to handle the container logs and one to handle
# the node daemon logs, the latter of which explicitly sends its logs to the
# compute.googleapis.com service rather than container.googleapis.com to keep
# them separate since most users don't care about the node logs.
<match kubernetes.**>
  type google_cloud
  # Set the buffer type to file to improve the reliability and reduce the memory consumption
  buffer_type file
  buffer_path /var/log/fluentd-buffers/kubernetes.containers.buffer
  # Set queue_full action to block because we want to pause gracefully
  # in case of the off-the-limits load instead of throwing an exception
  buffer_queue_full_action block
  # Set the chunk limit conservatively to avoid exceeding the GCL limit
  # of 10MiB per write request.
  buffer_chunk_limit 2M
  # Cap the combined memory usage of this buffer and the one below to
  # 2MiB/chunk * (6 + 2) chunks = 16 MiB
  buffer_queue_limit 6
  # Never wait more than 5 seconds before flushing logs in the non-error case.
  flush_interval 5s
  # Never wait longer than 30 seconds between retries.
  max_retry_wait 30
  # Disable the limit on the number of retries (retry forever).
  disable_retry_limit
  # Use multiple threads for processing.
  num_threads 2
</match>

# Keep a smaller buffer here since these logs are less important than the user's
# container logs.
<match **>
  type google_cloud
  detect_subservice false
  buffer_type file
  buffer_path /var/log/fluentd-buffers/kubernetes.system.buffer
  buffer_queue_full_action block
  buffer_chunk_limit 2M
  buffer_queue_limit 2
  flush_interval 5s
  max_retry_wait 30
  disable_retry_limit
  num_threads 2
</match>

Start the container and check the logs.

export FLUENTD_ID=`sudo docker run -d -v $(pwd)/logs:/var/log/containers -v ~/config.d:/etc/fluent/config.d qingling128/testing:buffer-overflow-test-0-12-32 `
sudo docker logs -f $FLUENTD_ID

export FLUENTD_ID=`sudo docker run -d -v $(pwd)/logs:/var/log/containers -v ~/config.d:/etc/fluent/config.d qingling128/testing:buffer-overflow-test-0-14-13 `
sudo docker logs -f $FLUENTD_ID

Both images are built from fluentd-gcp-image. The only difference is the Fluentd version in the Gemfile. buffer-overflow-test-0-12-32 has Fluentd 0.12.32, and can process the logs successfully with the config settings above. buffer-overflow-test-0-14-13 has Fluentd 0.14.13, and it can't process the logs without a buffer overflow error unless we increase the buffer size to:

buffer_chunk_limit 8M
buffer_queue_limit 32
num_threads 8

The text was updated successfully, but these errors were encountered:

qingling128 · 2017-03-01T17:18:06Z

The 0.14 release note seems to imply there are some changes made to the buffer flushing mechanism. Would any of those changes cause this issue? We'd like to upgrade to 0.14 as soon as possible, but at the same time we are a bit concerned since it seems less robust.

repeatedly · 2017-03-02T09:15:27Z

Yes, first implementation of v0.14's buffer focuses on flexibility and reliablity.
This is rough benchmark and it shows v0.14 is bit slower than v0.12: https://www.slideshare.net/repeatedly/fluentd-v014-overview/17
After fixed reported issues, we tries to optimize the performance.

repeatedly · 2017-03-02T09:19:28Z

If you give me data set, configuration and traffic detail,
I will add it to benchmark list for optimizaiton.

repeatedly · 2017-03-02T11:18:03Z

Using ruby 2.4 may mitigate the problem because ruby 2.4 is faster than ruby 2.3 in almost cases.
Of course, need benchmark test.

qingling128 · 2017-03-02T20:25:26Z

Hi @repeatedly, thanks for taking a look!

For our load testing, we are tailing a 109M log file (1,000,024 lines) with logs like below:

lingshi@debian1:~/logs$ ls -lh
total 109M
-rw-r--r-- 1 lingshi lingshi 109M Mar  2 20:09 log.log
lingshi@debian1:~/logs$ tail log.log
{"log":"I0302 20:09:18.497373      11 logs_generator.go:67] 999990 GET /api/v1/namespaces/kube-system/pods/0zj 389"}
{"log":"I0302 20:09:18.497449      11 logs_generator.go:67] 999991 PUT /api/v1/namespaces/kube-system/pods/0uu 250"}
{"log":"I0302 20:09:18.497513      11 logs_generator.go:67] 999992 PUT /api/v1/namespaces/kube-system/pods/1mjc 597"}
{"log":"I0302 20:09:18.497605      11 logs_generator.go:67] 999993 POST /api/v1/namespaces/default/pods/l0j 472"}
{"log":"I0302 20:09:18.497675      11 logs_generator.go:67] 999994 GET /api/v1/namespaces/kube-system/pods/wzw 332"}
{"log":"I0302 20:09:18.497766      11 logs_generator.go:67] 999995 GET /api/v1/namespaces/ns/pods/51i 471"}
{"log":"I0302 20:09:18.497843      11 logs_generator.go:67] 999996 PUT /api/v1/namespaces/ns/pods/ym1 227"}
{"log":"I0302 20:09:18.497917      11 logs_generator.go:67] 999997 PUT /api/v1/namespaces/kube-system/pods/l2f7 322"}
{"log":"I0302 20:09:18.498011      11 logs_generator.go:67] 999998 GET /api/v1/namespaces/default/pods/mnvd 281"}
{"log":"I0302 20:09:18.498096      11 logs_generator.go:67] 999999 PUT /api/v1/namespaces/kube-system/pods/34i 297"}

The config is the same as in the previous comments. Without buffer overflow errors, it takes around 6 mins for our backend API to consume all these logs.

crassirostris · 2017-03-02T20:48:08Z

@repeatedly The problem is not with the performance. Outage will happen and the queue will be blocked no matter how good fluentd is :) We want to make sure that fluentd survives this kind of problem.

repeatedly · 2017-03-06T13:42:04Z

I will work on it after one bug fixed.

qingling128 · 2017-03-06T16:04:36Z

@repeatedly Sounds good. Thanks!

qingling128 · 2017-03-13T19:55:11Z

Hi @repeatedly, any luck so far?

repeatedly · 2017-03-13T20:01:45Z

I'm now working on other issue. So please wait.

repeatedly · 2017-03-14T23:44:23Z

It seems the enqueue or flush thread has a problem for handling full queue status.
Now investigating the problem deeply.

repeatedly · 2017-03-15T02:44:48Z

#1502 fixes the deadlock at startup.
Could you check this patch?

qingling128 · 2017-03-16T00:36:22Z

@repeatedly Sure. Is there a pre version gem I could check against?

repeatedly · 2017-03-16T09:24:19Z

Released v014.14.pre.1.

qingling128 · 2017-03-16T16:03:39Z

Thanks @repeatedly !

I've just tested the new version 0.14.14.pre.1. Seems to work fine with exactly the same config as 0.12.32.

There's no more buffer overflow errors in the log, and all the entries arrived just fine. I ran the container image 5 times to double check. Feel free to close this ticket.

2017-03-16 15:42:55 +0000 [info]: gem 'fluentd' version '0.14.14.pre.1'
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="fluent.**" type="null"
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="fluent.**" type="null"
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="reform.**" type="record_reformer"
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="kubernetes.**" type="google_cloud"
2017-03-16 15:42:55 +0000 [info]: #0 Detected GCE platform
2017-03-16 15:42:55 +0000 [info]: #0 Logs viewer address: https://console.developers.google.com/project/google.com:sandbox-146415/logs?service=compute.googleapis.com&key1=instance&key2=3499515036930873161
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="**" type="google_cloud"
2017-03-16 15:42:55 +0000 [info]: #0 Detected GCE platform
2017-03-16 15:42:55 +0000 [info]: #0 Logs viewer address: https://console.developers.google.com/project/google.com:sandbox-146415/logs?service=compute.googleapis.com&key1=instance&key2=3499515036930873161
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding source type="tail"
2017-03-16 15:42:55 +0000 [info]: using configuration file: <ROOT>
  <match fluent.**>
    type null
  </match>
  <match fluent.**>
    type null
  </match>
  <source>
    type tail
    format json
    time_key time
    path "/var/log/containers/*.log"
    pos_file "/var/log/gcp-containers.log.pos"
    time_format %Y-%m-%dT%H:%M:%S.%NZ
    tag "reform.*"
    read_from_head true
    <parse>
      time_key time
      time_format %Y-%m-%dT%H:%M:%S.%NZ
      @type json
      time_type string
    </parse>
  </source>
  <match reform.**>
    type record_reformer
    enable_ruby true
    tag "kubernetes.${tag_suffix[4].split(\'-\')[0..-2].join(\'-\')}"
  </match>
  <match kubernetes.**>
    type google_cloud
    buffer_type "file"
    buffer_path /var/log/fluentd-buffers/kubernetes.containers.buffer
    buffer_queue_full_action block
    buffer_chunk_limit 2M
    buffer_queue_limit 6
    flush_interval 5s
    max_retry_wait 30
    disable_retry_limit
    num_threads 2
    <buffer>
      flush_mode interval
      retry_type exponential_backoff
      @type file
      path /var/log/fluentd-buffers/kubernetes.containers.buffer
      flush_thread_count 2
      flush_interval 5s
      retry_forever
      retry_max_interval 30
      chunk_limit_size 2M
      queue_limit_length 6
      overflow_action block
    </buffer>
  </match>
  <match **>
    type google_cloud
    detect_subservice false
    buffer_type "file"
    buffer_path /var/log/fluentd-buffers/kubernetes.system.buffer
    buffer_queue_full_action block
    buffer_chunk_limit 2M
    buffer_queue_limit 2
    flush_interval 5s
    max_retry_wait 30
    disable_retry_limit
    num_threads 2
    <buffer>
      flush_mode interval
      retry_type exponential_backoff
      @type file
      path /var/log/fluentd-buffers/kubernetes.system.buffer
      flush_thread_count 2
      flush_interval 5s
      retry_forever
      retry_max_interval 30
      chunk_limit_size 2M
      queue_limit_length 2
      overflow_action block
    </buffer>
  </match>
</ROOT>
2017-03-16 15:42:55 +0000 [info]: #0 starting fluentd worker pid=17 ppid=13 worker=0
2017-03-16 15:42:55 +0000 [info]: #0 following tail of /var/log/containers/log.log
2017-03-16 15:43:05 +0000 [info]: #0 Successfully sent to Stackdriver Logging API.

qingling128 · 2017-03-16T16:05:44Z

BTW, do we have a rough idea when 0.14.14 will be released yet?

repeatedly · 2017-03-22T14:50:00Z

After merged #1515, I will release new version.

qingling128 · 2017-03-22T15:29:47Z

Sounds good. Thanks for the heads up!

repeatedly · 2017-03-23T00:35:32Z

The bug itself is fixed. Closed.

repeatedly added bug Something isn't working v0.14 labels Mar 6, 2017

repeatedly mentioned this issue Mar 15, 2017

Fix Input and Output deadlock when buffer is full during startup #1502

Merged

repeatedly closed this as completed Mar 23, 2017

qingling128 mentioned this issue Apr 13, 2017

Update fluentd to v0.14.14 GoogleCloudPlatform/google-fluentd#20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentd 0.14.13 gets buffer overflow errors with the same settings as Fluentd 0.12.32 #1485

Fluentd 0.14.13 gets buffer overflow errors with the same settings as Fluentd 0.12.32 #1485

qingling128 commented Mar 1, 2017

qingling128 commented Mar 1, 2017

repeatedly commented Mar 2, 2017

repeatedly commented Mar 2, 2017 •

edited

Loading

repeatedly commented Mar 2, 2017

qingling128 commented Mar 2, 2017

crassirostris commented Mar 2, 2017

repeatedly commented Mar 6, 2017

qingling128 commented Mar 6, 2017

qingling128 commented Mar 13, 2017

repeatedly commented Mar 13, 2017

repeatedly commented Mar 14, 2017

repeatedly commented Mar 15, 2017

qingling128 commented Mar 16, 2017

repeatedly commented Mar 16, 2017

qingling128 commented Mar 16, 2017

qingling128 commented Mar 16, 2017

repeatedly commented Mar 22, 2017

qingling128 commented Mar 22, 2017

repeatedly commented Mar 23, 2017

Fluentd 0.14.13 gets buffer overflow errors with the same settings as Fluentd 0.12.32 #1485

Fluentd 0.14.13 gets buffer overflow errors with the same settings as Fluentd 0.12.32 #1485

Comments

qingling128 commented Mar 1, 2017

qingling128 commented Mar 1, 2017

repeatedly commented Mar 2, 2017

repeatedly commented Mar 2, 2017 • edited Loading

repeatedly commented Mar 2, 2017

qingling128 commented Mar 2, 2017

crassirostris commented Mar 2, 2017

repeatedly commented Mar 6, 2017

qingling128 commented Mar 6, 2017

qingling128 commented Mar 13, 2017

repeatedly commented Mar 13, 2017

repeatedly commented Mar 14, 2017

repeatedly commented Mar 15, 2017

qingling128 commented Mar 16, 2017

repeatedly commented Mar 16, 2017

qingling128 commented Mar 16, 2017

qingling128 commented Mar 16, 2017

repeatedly commented Mar 22, 2017

qingling128 commented Mar 22, 2017

repeatedly commented Mar 23, 2017

repeatedly commented Mar 2, 2017 •

edited

Loading