You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
If you set a tail input with a large Buffer_Chunk_Size and Buffer_Max_Size, the chunks that are created and passed to fluentbit are larger than a max 10485760 bytes they are rejected by cloud logging and dropped by the stackdriver output plugin with the following error:
[INPUT]
name tail
read_from_head false
skip_long_lines on
path /var/log/containers/*.log
Tag kube.*
buffer_chunk_size 5M
buffer_max_size 10M
Exclude_Path /var/log/containers/*fluent*
Refresh_Interval 1
mem_buf_limit 50MB
threaded on
Skip_long_lines on
Have a high volume logging container running on the same node as fluentbit.
Fluentbit tail input successfully reads all messages from the container (and can be verified by checking the prometheus metrics)
FYI - #1938 mentions a potential solution but work would need to be done at a base output plugin level if we didn't want to batch in out_stackdriver directly. This looks like an involved change and that issue has been open for 4.5years
This is a problem we have tried but failed to fix in the past. I believe it affects numerous other output plugins.
The root of the problem is the size of a chunk doesn't equate to the size of the Cloud Logging payload, and we can't accurately predict it to allow for any intelligent batching. Getting a msgpack payload of some size doesn't mean the size of the payload once converted to JSON is going to match, since JSON is so much more expensive to represent the same thing.
The road I went down last time I tried to fix this was to try and come up with a rough heuristic for how big a chunk would be before it became too big for a Cloud Logging request payload. In that scenario, I would split the chunk in half, and recursively do this on each half of the payload until we end up with a list of Cloud Logging requests that would make it through. This change is non-trivial, in particular I remember trying to split the event chunks in half was a rat's nest. (Maybe it would be easier with the log_event_decoder API that exists now 🤔)
The idea in the issue mentioned would probably be better. I'll see if I can engage Fluent Bit maintainers if they have any ideas as well.
Bug Report
Describe the bug
If you set a tail input with a large Buffer_Chunk_Size and Buffer_Max_Size, the chunks that are created and passed to fluentbit are larger than a max 10485760 bytes they are rejected by cloud logging and dropped by the stackdriver output plugin with the following error:
To Reproduce
tail
input successfully reads all messages from the container (and can be verified by checking the prometheus metrics)fluentbit_input_records_total{name="tail.0"} 125000002
out_stackdriver fails to create properly sized requests to cloud logging:
Most of the records here have been dropped by out_stackdriver plugin
(This can most likely happen in any situation where a fluentbit chunk is greater than 10485760, in fluentbit chunks can be up to 2MB
Expected behavior
out_stackdriver plugin should batch cloud logging payloads and not rely on the incoming chunk to be below the 10485760 bytes limit. I believe fluentbit chunks can be around 2MB based on https://docs.fluentbit.io/manual/v/1.8/administration/buffering-and-storage#chunks
Your Environment
Additional context
The text was updated successfully, but these errors were encountered: