-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluentd memory leak in Kubernetes with "buffer_type file" #2236
Comments
Is the problem klog implementation, right? |
I haven't managed to reproduce it outside the k8s world. Does Fluentd support I just set up a regular VM and some logs with |
in_tail itself considers truncated case: fluentd/lib/fluent/plugin/in_tail.rb Line 589 in 35ef6c7
But I will re-check object management for truncated case.
Good. I want to know which point causes this problem, configuration, klog, k8s or gke. |
Still waiting for the memory trend to show up. I haven't seen an obvious leaking pattern. But I realized I forgot to specify |
Hi, I work with @qingling128. We had a customer report high CPU usage with fluentd, running outside Kubernetes, and it had in common with this issue that they were using I reached the same conclusion as you that in_tail should handle this correctly, but I was able to confirm the CPU spikes in the Ruby interpreter (so not logrotate or the user application) where no spikes were expected based on the configuration, which had fluentd watch a single file (not |
For the reference, the related logrotation config and in_tail config that has logrotation CPU issue are like below: Log Rotation Setting
In Tail Config
|
So no spike without |
@repeatedly Correct. |
@repeatedly Any luck with reproducing the memory leak or CPU spike? |
Sorry for the late. I took new year holidays. I will start working tomorrow. |
@repeatedly Cool cool. Thanks for the heads up! |
I tested simple truncate case several hours but no memory leak. 5.times.map { |i|
Thread.new {
loop {
f = File.open("./output_#{i}.log", "a")
10.times {
f.puts "hello"
f.flush
sleep 1
}
f.truncate(0)
f.close
}
}
}.each { |t| t.join } I will test more complicated truncate case. |
Now testing on my Ubuntu 18.04 box with fluentd v1.3.2 and ruby 2.4.5. |
I tested several hours but no memory leak and cpu spike. Env: OS: Ubuntu 18.04 conf:
logrotate:
Call logrotate every 10 seconds:
Generate logs by following script: paths = 3.times.map { |i| "./log/output#{i + 1}.log" }
ths = paths.map { |path|
Thread.new {
f = File.open(path, "a")
loop {
f.puts "hello fluuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuend world! " * 10
f.flush
sleep 0.01
}
f.close
}
}
ths.each { |t| t.join } |
I'm not sure filter/output plugins. Does using |
Our client was experiencing this issue with older versions of Ubuntu (14.04.1) |
Okay. Will check on Ubuntu 14.04 |
I'll try the |
Didn't reproduce the issue on Ubuntu 14.04. Need more detailed information. |
Our client was using fluentd version 1.2.5 and Stackdriver agent 1.6.0-1 on Ubuntu 14.04.1 |
How about null plugin case on your side? And I want to know my test case is good or not: #2236 (comment) |
Sorry for the delay. I got pulled away for some urgent matters. Just set up two clusters, one with https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud and one with null plugin. Will check the memory growth after a day. |
Hmm... I tried the fake-output-plugin, seems there is no memory leak. I'm not exactly sure what caused the leak, but it might be something with our output plugin. I'll close this ticket for now and investigate more. If it turned out to be something Fluentd core related, I'll reopen this ticket. BTW, if there is any recommended tools for Ruby memory debugging/profiling, please feel free to advise. I was using https://github.com/ruby-prof/ruby-prof before. It did not help much with memory leak though (or maybe I used it the wrong way) |
@nurse Do you have any idea for debugging memory leak? I assume C ext or something has a problem. |
We once had a memory leak (that one was fixed) in the C core code of gRPC gem. Let me give it a quick try to turn gRPC off and use REST. |
After some investigation, it turned out the measurement in #2236 (comment) did not have exactly the same buffer settings between the out_google_cloud plugin and the fake plugin (https://github.com/qingling128/fluent-plugin-buffer-output-stub/blob/master/lib/fluent/plugin/out_buffer_output_stub.rb). As I dug further, the memory leak exists for the fake gem as well. When the buffer type is memory, there does not seem to be a leak. [lingshi-memory-buffer-fake-gem]
[lingshi-file-buffer-fake-gem]
However, when I changed the buffer configuration to specify certain limit, the memory growth is quite obvious.
I’m not sure if I also tried the following two configurations. They both showed a growing pattern. But it might be too early (only 3 hours) to draw a conclusion.
[lingshi-file-fake-gem-block]
Might there be some leak with the buffer file handler? BTW, not sure if this is related at all. But this reminds me of one observation we had with buffer files before:
|
Ah, I missed this notification. |
Good to know. Thanks! |
@repeatedly Any luck with reproducing this issue? |
Currently, I have tested your configurations on our mac but no luck. Here is test conf:
|
Not sure if this is related, but someone mentioned some CPU usage benefit with threading when using wildcards in the buffer_path: Stackdriver/kubernetes-configs#16 (comment). |
On the other side, I tried setting up a Kubernetes cluster with dummy input plugin:
That double confirmed the leak is probably with the buffer output plugin. |
This is interesting. We will check it.
Other metrics are healthy, right? e.g. low CPU usage. |
dentry cache is not increased? |
Does this imply
|
This means 1000 msg/sec by in_dummy and fake output combo consumes 70% CPU, right?
If your pipeline uses |
Oops, missed the previous email.
Yes
Good to know. Will give that a try. |
I observed that kubernetes/kubernetes#58899 Kubernetes v1.10.0 includes kubernetes/kubernetes#58899 , which suggested that log file will be reopened after log file is rotated. |
Any progress on that issue? |
I was just looking at this issue again recently. The latest hypothesis is that the Fluentd buffer files have the postfix [Sample Fluentd buffer file directory]
@repeatedly Do you think this might be a possible cause? |
Ah, really? It means k8s automatically scans directories and rotate |
@repeatedly - Seems like that is the cause: k8s scans files under Can we change the buffer file suffix? e.g. |
We will move the buffer file path out of |
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
@repeatedly - Thanks for the quick fix. This is very helpful! |
buf_file: Add path_suffix parameter. ref #2236
buf_file: Add path_suffix parameter. ref #2236 Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
[Fluentd version]
v1.2.5
[Environment]
Kubernetes
[Configuration]
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/fluentd-gcp/fluentd-gcp-configmap.yaml
[Problems]
We are seeing Fluentd memory leaks with Kubernetes versions >= v1.10. After some investigation, we realized the log rotation mechanism changed from Kubernetes v1.9 and v1.10, which might trigger the memory leak. Guess the logs are rotated in a way that Fluentd does not support / handle well.
Some experiment with exactly the same Fluentd version, output plugin version but different Kubernetes versions are shown below:
Is there any known limitation for log rotation mechanism that might trigger this? We report a similar issue before, but I've verified that we are using a Fluentd version with that fix.
[The log rotation manager]
In case it helps, the log rotation mechanism is https://github.com/kubernetes/kubernetes/blob/a3ccea9d8743f2ff82e41b6c2af6dc2c41dc7b10/pkg/kubelet/logs/container_log_manager.go#L150.
The text was updated successfully, but these errors were encountered: