Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd 0.14.13 gets buffer overflow errors with the same settings as Fluentd 0.12.32 #1485

Closed
qingling128 opened this issue Mar 1, 2017 · 19 comments
Labels
bug Something isn't working v0.14

Comments

@qingling128
Copy link

During some basic load testing, we found that the upgrade to Fluentd 0.14.13 from 0.12.32 caused buffer overflow errors to be triggered very frequently.

[Error]

2017-03-01 17:02:20 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:block

[Steps to reproduce]

  1. Create a GCP instance (1 vCPU, 3.75GB, Debian GNU/Linux 8)
  2. Install docker:
$ sudo apt-get install docker.io
  1. Create some test logs with the logs generator container:
$ mkdir ~/logs
$ sudo docker run -i -e "LOGS_GENERATOR_DURATION=1s" -e "LOGS_GENERATOR_LINES_TOTAL=1000000" gcr.io/google_containers/logs-generator:v0.1.0 2>&1 | awk '{print "{\"log\":\"" $0 "\"}"}' > ~/logs/log.log
  1. Create configs
$ mkdir ~/config.d
$ cat ~/config.d/config.conf
<match fluent.**>
  type null
</match>

<source>
  type tail
  format json
  time_key time
  path /var/log/containers/*.log
  pos_file /var/log/gcp-containers.log.pos
  time_format %Y-%m-%dT%H:%M:%S.%NZ
  tag reform.*
  read_from_head true
</source>

<match reform.**>
  type record_reformer
  enable_ruby true
  tag kubernetes.${tag_suffix[4].split('-')[0..-2].join('-')}
</match>

# We use 2 output stanzas - one to handle the container logs and one to handle
# the node daemon logs, the latter of which explicitly sends its logs to the
# compute.googleapis.com service rather than container.googleapis.com to keep
# them separate since most users don't care about the node logs.
<match kubernetes.**>
  type google_cloud
  # Set the buffer type to file to improve the reliability and reduce the memory consumption
  buffer_type file
  buffer_path /var/log/fluentd-buffers/kubernetes.containers.buffer
  # Set queue_full action to block because we want to pause gracefully
  # in case of the off-the-limits load instead of throwing an exception
  buffer_queue_full_action block
  # Set the chunk limit conservatively to avoid exceeding the GCL limit
  # of 10MiB per write request.
  buffer_chunk_limit 2M
  # Cap the combined memory usage of this buffer and the one below to
  # 2MiB/chunk * (6 + 2) chunks = 16 MiB
  buffer_queue_limit 6
  # Never wait more than 5 seconds before flushing logs in the non-error case.
  flush_interval 5s
  # Never wait longer than 30 seconds between retries.
  max_retry_wait 30
  # Disable the limit on the number of retries (retry forever).
  disable_retry_limit
  # Use multiple threads for processing.
  num_threads 2
</match>

# Keep a smaller buffer here since these logs are less important than the user's
# container logs.
<match **>
  type google_cloud
  detect_subservice false
  buffer_type file
  buffer_path /var/log/fluentd-buffers/kubernetes.system.buffer
  buffer_queue_full_action block
  buffer_chunk_limit 2M
  buffer_queue_limit 2
  flush_interval 5s
  max_retry_wait 30
  disable_retry_limit
  num_threads 2
</match>
  1. Start the container and check the logs.
export FLUENTD_ID=`sudo docker run -d -v $(pwd)/logs:/var/log/containers -v ~/config.d:/etc/fluent/config.d qingling128/testing:buffer-overflow-test-0-12-32 `
sudo docker logs -f $FLUENTD_ID

export FLUENTD_ID=`sudo docker run -d -v $(pwd)/logs:/var/log/containers -v ~/config.d:/etc/fluent/config.d qingling128/testing:buffer-overflow-test-0-14-13 `
sudo docker logs -f $FLUENTD_ID

Both images are built from fluentd-gcp-image. The only difference is the Fluentd version in the Gemfile. buffer-overflow-test-0-12-32 has Fluentd 0.12.32, and can process the logs successfully with the config settings above. buffer-overflow-test-0-14-13 has Fluentd 0.14.13, and it can't process the logs without a buffer overflow error unless we increase the buffer size to:

buffer_chunk_limit 8M
buffer_queue_limit 32
num_threads 8
@qingling128
Copy link
Author

The 0.14 release note seems to imply there are some changes made to the buffer flushing mechanism. Would any of those changes cause this issue? We'd like to upgrade to 0.14 as soon as possible, but at the same time we are a bit concerned since it seems less robust.

@repeatedly
Copy link
Member

Yes, first implementation of v0.14's buffer focuses on flexibility and reliablity.
This is rough benchmark and it shows v0.14 is bit slower than v0.12: https://www.slideshare.net/repeatedly/fluentd-v014-overview/17
After fixed reported issues, we tries to optimize the performance.

@repeatedly
Copy link
Member

repeatedly commented Mar 2, 2017

If you give me data set, configuration and traffic detail,
I will add it to benchmark list for optimizaiton.

@repeatedly
Copy link
Member

Using ruby 2.4 may mitigate the problem because ruby 2.4 is faster than ruby 2.3 in almost cases.
Of course, need benchmark test.

@qingling128
Copy link
Author

Hi @repeatedly, thanks for taking a look!

For our load testing, we are tailing a 109M log file (1,000,024 lines) with logs like below:

lingshi@debian1:~/logs$ ls -lh
total 109M
-rw-r--r-- 1 lingshi lingshi 109M Mar  2 20:09 log.log
lingshi@debian1:~/logs$ tail log.log
{"log":"I0302 20:09:18.497373      11 logs_generator.go:67] 999990 GET /api/v1/namespaces/kube-system/pods/0zj 389"}
{"log":"I0302 20:09:18.497449      11 logs_generator.go:67] 999991 PUT /api/v1/namespaces/kube-system/pods/0uu 250"}
{"log":"I0302 20:09:18.497513      11 logs_generator.go:67] 999992 PUT /api/v1/namespaces/kube-system/pods/1mjc 597"}
{"log":"I0302 20:09:18.497605      11 logs_generator.go:67] 999993 POST /api/v1/namespaces/default/pods/l0j 472"}
{"log":"I0302 20:09:18.497675      11 logs_generator.go:67] 999994 GET /api/v1/namespaces/kube-system/pods/wzw 332"}
{"log":"I0302 20:09:18.497766      11 logs_generator.go:67] 999995 GET /api/v1/namespaces/ns/pods/51i 471"}
{"log":"I0302 20:09:18.497843      11 logs_generator.go:67] 999996 PUT /api/v1/namespaces/ns/pods/ym1 227"}
{"log":"I0302 20:09:18.497917      11 logs_generator.go:67] 999997 PUT /api/v1/namespaces/kube-system/pods/l2f7 322"}
{"log":"I0302 20:09:18.498011      11 logs_generator.go:67] 999998 GET /api/v1/namespaces/default/pods/mnvd 281"}
{"log":"I0302 20:09:18.498096      11 logs_generator.go:67] 999999 PUT /api/v1/namespaces/kube-system/pods/34i 297"}

The config is the same as in the previous comments. Without buffer overflow errors, it takes around 6 mins for our backend API to consume all these logs.

@crassirostris
Copy link

@repeatedly The problem is not with the performance. Outage will happen and the queue will be blocked no matter how good fluentd is :) We want to make sure that fluentd survives this kind of problem.

@repeatedly repeatedly added bug Something isn't working v0.14 labels Mar 6, 2017
@repeatedly
Copy link
Member

I will work on it after one bug fixed.

@qingling128
Copy link
Author

@repeatedly Sounds good. Thanks!

@qingling128
Copy link
Author

Hi @repeatedly, any luck so far?

@repeatedly
Copy link
Member

I'm now working on other issue. So please wait.

@repeatedly
Copy link
Member

It seems the enqueue or flush thread has a problem for handling full queue status.
Now investigating the problem deeply.

@repeatedly
Copy link
Member

#1502 fixes the deadlock at startup.
Could you check this patch?

@qingling128
Copy link
Author

@repeatedly Sure. Is there a pre version gem I could check against?

@repeatedly
Copy link
Member

Released v014.14.pre.1.

@qingling128
Copy link
Author

Thanks @repeatedly !

I've just tested the new version 0.14.14.pre.1. Seems to work fine with exactly the same config as 0.12.32.

There's no more buffer overflow errors in the log, and all the entries arrived just fine. I ran the container image 5 times to double check. Feel free to close this ticket.

2017-03-16 15:42:55 +0000 [info]: gem 'fluentd' version '0.14.14.pre.1'
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="fluent.**" type="null"
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="fluent.**" type="null"
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="reform.**" type="record_reformer"
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="kubernetes.**" type="google_cloud"
2017-03-16 15:42:55 +0000 [info]: #0 Detected GCE platform
2017-03-16 15:42:55 +0000 [info]: #0 Logs viewer address: https://console.developers.google.com/project/google.com:sandbox-146415/logs?service=compute.googleapis.com&key1=instance&key2=3499515036930873161
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding match pattern="**" type="google_cloud"
2017-03-16 15:42:55 +0000 [info]: #0 Detected GCE platform
2017-03-16 15:42:55 +0000 [info]: #0 Logs viewer address: https://console.developers.google.com/project/google.com:sandbox-146415/logs?service=compute.googleapis.com&key1=instance&key2=3499515036930873161
2017-03-16 15:42:55 +0000 [warn]: #0 'type' is deprecated parameter name. use '@type' instead.
2017-03-16 15:42:55 +0000 [info]: adding source type="tail"
2017-03-16 15:42:55 +0000 [info]: using configuration file: <ROOT>
  <match fluent.**>
    type null
  </match>
  <match fluent.**>
    type null
  </match>
  <source>
    type tail
    format json
    time_key time
    path "/var/log/containers/*.log"
    pos_file "/var/log/gcp-containers.log.pos"
    time_format %Y-%m-%dT%H:%M:%S.%NZ
    tag "reform.*"
    read_from_head true
    <parse>
      time_key time
      time_format %Y-%m-%dT%H:%M:%S.%NZ
      @type json
      time_type string
    </parse>
  </source>
  <match reform.**>
    type record_reformer
    enable_ruby true
    tag "kubernetes.${tag_suffix[4].split(\'-\')[0..-2].join(\'-\')}"
  </match>
  <match kubernetes.**>
    type google_cloud
    buffer_type "file"
    buffer_path /var/log/fluentd-buffers/kubernetes.containers.buffer
    buffer_queue_full_action block
    buffer_chunk_limit 2M
    buffer_queue_limit 6
    flush_interval 5s
    max_retry_wait 30
    disable_retry_limit
    num_threads 2
    <buffer>
      flush_mode interval
      retry_type exponential_backoff
      @type file
      path /var/log/fluentd-buffers/kubernetes.containers.buffer
      flush_thread_count 2
      flush_interval 5s
      retry_forever
      retry_max_interval 30
      chunk_limit_size 2M
      queue_limit_length 6
      overflow_action block
    </buffer>
  </match>
  <match **>
    type google_cloud
    detect_subservice false
    buffer_type "file"
    buffer_path /var/log/fluentd-buffers/kubernetes.system.buffer
    buffer_queue_full_action block
    buffer_chunk_limit 2M
    buffer_queue_limit 2
    flush_interval 5s
    max_retry_wait 30
    disable_retry_limit
    num_threads 2
    <buffer>
      flush_mode interval
      retry_type exponential_backoff
      @type file
      path /var/log/fluentd-buffers/kubernetes.system.buffer
      flush_thread_count 2
      flush_interval 5s
      retry_forever
      retry_max_interval 30
      chunk_limit_size 2M
      queue_limit_length 2
      overflow_action block
    </buffer>
  </match>
</ROOT>
2017-03-16 15:42:55 +0000 [info]: #0 starting fluentd worker pid=17 ppid=13 worker=0
2017-03-16 15:42:55 +0000 [info]: #0 following tail of /var/log/containers/log.log
2017-03-16 15:43:05 +0000 [info]: #0 Successfully sent to Stackdriver Logging API.

@qingling128
Copy link
Author

BTW, do we have a rough idea when 0.14.14 will be released yet?

@repeatedly
Copy link
Member

After merged #1515, I will release new version.

@qingling128
Copy link
Author

Sounds good. Thanks for the heads up!

@repeatedly
Copy link
Member

The bug itself is fixed. Closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v0.14
Projects
None yet
Development

No branches or pull requests

3 participants