-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Bug Report
Describe the bug
We are seeing fluent-bit crashes in our Kubernetes cluster.
The crashes started when we upgraded from fluent-bit version 4.0.3 to 4.0.7.
I also upgraded our clusters from Kubernetes v1.32.7 to v1.33.3 at the same time, but to rule that out I downgraded fluent-bit back to 4.0.3 and the crashes stopped.
I then worked through installing the point releases for fluent-bit to try to work out which release introduced the issue.
There were no crashes with fluent-bit versions 4.0.4, 4.0.5 nor 4.0.6 either.
It was only once I reached version 4.0.7 that the crashes started again.
I have the output of a few of the crashes from yesterday using version 4.0.7 below.
Fluent Bit v4.0.7
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
______ _ _ ______ _ _ ___ _____
| ___| | | | | ___ (_) | / || _ |
| |_ | |_ _ ___ _ __ | |_ | |_/ /_| |_ __ __/ /| || |/' |
| _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| || /| |
| | | | |_| | __/ | | | |_ | |_/ / | |_ \ V /\___ |\ |_/ /
\_| |_|\__,_|\___|_| |_|\__| \____/|_|\__| \_/ |_(_)___/
[2025/08/13 05:22:47] [engine] caught signal (SIGSEGV)
#0 0x7d364b5cbabb in ???() at ???:0
#1 0x7d364b49413e in ???() at ???:0
#2 0x7d364b76c705 in ???() at ???:0
#3 0x7d364b766c1f in ???() at ???:0
#4 0x7d364b767106 in ???() at ???:0
#5 0x7d364b7489f2 in ???() at ???:0
#6 0x5e229c9e0489 in tls_net_write() at src/tls/openssl.c:1011
#7 0x5e229c9e1216 in flb_tls_net_write_async() at src/tls/flb_tls.c:487
#8 0x5e229c9f09cc in flb_io_net_write() at src/flb_io.c:699
#9 0x5e229c9f1dd5 in flb_http_do_request() at src/flb_http_client.c:1396
#10 0x5e229c9f269c in flb_http_do() at src/flb_http_client.c:1530
#11 0x5e229caba233 in http_post() at plugins/out_http/http.c:284
#12 0x5e229cabb717 in cb_http_flush() at plugins/out_http/http.c:641
#13 0x5e229d037166 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
This one is somewhat similar in that lines #0 to #5 correspond with lines #8 to #13 above.
[2025/08/13 05:46:27] [engine] caught signal (SIGSEGV)
#0 0x5f217362c9cc in flb_io_net_write() at src/flb_io.c:699
#1 0x5f217362ddd5 in flb_http_do_request() at src/flb_http_client.c:1396
#2 0x5f217362e69c in flb_http_do() at src/flb_http_client.c:1530
#3 0x5f21736f6233 in http_post() at plugins/out_http/http.c:284
#4 0x5f21736f7717 in cb_http_flush() at plugins/out_http/http.c:641
#5 0x5f2173c73166 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6 0xffffffffffffffff in ???() at ???:0
And this crash is totally different to the others...
[2025/08/13 05:54:13] [engine] caught signal (SIGSEGV)
#0 0x5bd101c0abf1 in ares_htable_find() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable.c:184
#1 0x5bd101c0b2d1 in ares_htable_get() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable.c:381
#2 0x5bd101bfbc3d in ares_htable_szvp_get() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable_szvp.c:155
#3 0x5bd101bfbc88 in ares_htable_szvp_get_direct() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable_szvp.c:169
#4 0x5bd101bf3448 in process_answer() at lib/c-ares-1.34.4/src/lib/ares_process.c:730
#5 0x5bd101bf3448 in read_answers() at lib/c-ares-1.34.4/src/lib/ares_process.c:553
#6 0x5bd101bf3448 in process_read() at lib/c-ares-1.34.4/src/lib/ares_process.c:587
#7 0x5bd101bf3448 in ares_process_fds_nolock() at lib/c-ares-1.34.4/src/lib/ares_process.c:225
#8 0x5bd101bf3d28 in ares_process_fds_nolock() at lib/c-ares-1.34.4/src/lib/ares_process.c:260
#9 0x5bd101bf3d28 in ares_process_fds() at lib/c-ares-1.34.4/src/lib/ares_process.c:257
#10 0x5bd101bf3d8b in ares_process_fd() at lib/c-ares-1.34.4/src/lib/ares_process.c:284
#11 0x5bd10181fba3 in flb_net_getaddrinfo_event_handler() at src/flb_network.c:915
#12 0x5bd10181fba3 in flb_net_getaddrinfo_event_handler() at src/flb_network.c:904
#13 0x5bd10181635d in output_thread() at src/flb_output_thread.c:318
#14 0x5bd10183142d in step_callback() at src/flb_worker.c:43
#15 0x7b502ad551f4 in ???() at ???:0
#16 0x7b502add589b in ???() at ???:0
And then a crash from today after I had worked back up to version 4.0.7 again after trying 4.0.3 onwards:
[2025/08/14 04:01:14] [ warn] [msgpack2json] unknown msgpack type 601716864
[2025/08/14 04:01:14] [engine] caught signal (SIGSEGV)
#0 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#1 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#2 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#3 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#4 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#5 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#6 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#7 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#8 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#9 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#10 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#11 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#12 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#13 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#14 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#15 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#16 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#17 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#18 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#19 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#20 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#21 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#22 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#23 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#24 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#25 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#26 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#27 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#28 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#29 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#30 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#31 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#32 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#33 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#34 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#35 0x6317371684c1 in msgpack2json() at src/flb_pack.c:775
#36 0x631737167f12 in msgpack2json() at src/flb_pack.c:737
#37 0x63173716970a in flb_msgpack_to_json() at src/flb_pack.c:812
#38 0x63173716981f in flb_msgpack_raw_to_json_sds() at src/flb_pack.c:852
#39 0x63173716a181 in flb_pack_msgpack_to_json_format() at src/flb_pack.c:1225
#40 0x6317372883c3 in compose_payload() at plugins/out_http/http.c:441
#41 0x6317372886b9 in cb_http_flush() at plugins/out_http/http.c:631
#42 0x631737804166 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
And another one that looks similar to one I gathered yesterday:
[2025/08/14 04:41:38] [engine] caught signal (SIGSEGV)
#0 0x5962cb909bf1 in ares_htable_find() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable.c:184
#1 0x5962cb90a2d1 in ares_htable_get() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable.c:381
#2 0x5962cb8fa6bc in ares_htable_asvp_get() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable_asvp.c:189
#3 0x5962cb8fa708 in ares_htable_asvp_get_direct() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable_asvp.c:204
#4 0x5962cb8ed41f in ares_conn_from_fd() at lib/c-ares-1.34.4/src/lib/ares_conn.c:505
#5 0x5962cb8f2148 in process_write() at lib/c-ares-1.34.4/src/lib/ares_process.c:384
#6 0x5962cb8f2148 in ares_process_fds_nolock() at lib/c-ares-1.34.4/src/lib/ares_process.c:211
#7 0x5962cb8f2d28 in ares_process_fds_nolock() at lib/c-ares-1.34.4/src/lib/ares_process.c:260
#8 0x5962cb8f2d28 in ares_process_fds() at lib/c-ares-1.34.4/src/lib/ares_process.c:257
#9 0x5962cb8f2d8b in ares_process_fd() at lib/c-ares-1.34.4/src/lib/ares_process.c:284
#10 0x5962cb51eba3 in flb_net_getaddrinfo_event_handler() at src/flb_network.c:915
#11 0x5962cb51eba3 in flb_net_getaddrinfo_event_handler() at src/flb_network.c:904
#12 0x5962cb51535d in output_thread() at src/flb_output_thread.c:318
#13 0x5962cb53042d in step_callback() at src/flb_worker.c:43
#14 0x709c3a2851f4 in ???() at ???:0
#15 0x709c3a30589b in ???() at ???:0
#16 0xffffffffffffffff in ???() at ???:0
And another
[2025/08/14 05:02:35] [engine] caught signal (SIGSEGV)
#0 0x7c529a9f99f2 in ???() at ???:0
#1 0x56a381f1c489 in tls_net_write() at src/tls/openssl.c:1011
#2 0x56a381f1d216 in flb_tls_net_write_async() at src/tls/flb_tls.c:487
#3 0x56a381f2c9cc in flb_io_net_write() at src/flb_io.c:699
#4 0x56a381f2ddd5 in flb_http_do_request() at src/flb_http_client.c:1396
#5 0x56a381f2e69c in flb_http_do() at src/flb_http_client.c:1530
#6 0x56a381ff6233 in http_post() at plugins/out_http/http.c:284
#7 0x56a381ff7717 in cb_http_flush() at plugins/out_http/http.c:641
#8 0x56a382573166 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9 0xffffffffffffffff in ???() at ???:0
I noted it was only 2 nodes where the crashes were happening.
This is where the fluent-bit pods are running on the same nodes where the logstash install (that we call logstash-cluster-shipper) is that they are pushing the logs to.
Below you can see where a cluster has been left running fluent-but 4.0.7 overnight and 2 of the pods have been restarting while all the other pods are fine.
$ kubectl p
NAME READY STATUS RESTARTS AGE IP NODE SPOT AZ
fluent-bit-264df 1/1 Running 0 19h45m 10.9.63.16 i-014c6e1f605659839 x a
fluent-bit-2gfx5 1/1 Running 0 21h3m 10.9.78.226 i-0c5914c6ce4d8b721 x b
fluent-bit-2vdps 1/1 Running 0 21h1m 10.9.128.50 i-08185a3482cd5c50d x a
fluent-bit-4hjdf 1/1 Running 0 21h3m 10.9.101.82 i-0544ff5dd870e9d5f x c
fluent-bit-4lhgm 1/1 Running 86 (20m44s ago) 17h48m 10.9.125.114 i-0587f5f17e1f77b16 x c
fluent-bit-674h6 1/1 Running 0 21h3m 10.9.95.117 i-0658acb35f8b262dd x b
fluent-bit-69bhn 1/1 Running 0 20h34m 10.9.105.98 i-08977b516a22192d1 ✓ c
fluent-bit-6wpk9 1/1 Running 0 15h36m 10.9.94.48 i-07257d472830a7f69 x b
fluent-bit-7frwx 1/1 Running 0 21h1m 10.9.129.16 i-08d6d8dfcc1961978 x b
fluent-bit-7lkrw 1/1 Running 0 2h44m 10.9.64.192 i-029c96c9bad4cda1f x b
fluent-bit-7zz6d 1/1 Running 64 (1m25s ago) 17h41m 10.9.56.162 i-09cb920d6b96eea4d x a
fluent-bit-9x8k7 1/1 Running 0 57m57s 10.9.92.114 i-0bd41e2e19b941f16 x b
fluent-bit-bktkd 1/1 Running 0 2h47m 10.9.91.226 i-071599c31f37bba3c x b
fluent-bit-bkwrd 1/1 Running 0 21h1m 10.9.130.114 i-03d5d4dff4a8ed4ec x c
fluent-bit-bm4pc 1/1 Running 0 21h1m 10.9.128.146 i-0adaf0fc20c37badf x a
fluent-bit-bs9fd 1/1 Running 0 3h26m 10.9.55.194 i-017def9b9f61c94db x a
fluent-bit-fg2ml 1/1 Running 0 21h3m 10.9.48.194 i-018e5c83d1b546330 x a
fluent-bit-fr99k 1/1 Running 0 15h44m 10.9.87.130 i-0aadcf2c01253fd2e x b
fluent-bit-jgczr 1/1 Running 0 21h3m 10.9.61.66 i-0b29e10c5ec09ab49 x a
fluent-bit-lfdq2 1/1 Running 0 21h3m 10.9.126.242 i-072366caed49c7a13 x c
fluent-bit-lh66m 1/1 Running 0 21h1m 10.9.130.162 i-07e497070f1332aac x c
fluent-bit-mh8q9 1/1 Running 0 21h3m 10.9.78.208 i-0aead6b9f1caf47c9 x b
fluent-bit-n92c7 1/1 Running 0 14h36m 10.9.112.34 i-0089917f280ef866e x c
fluent-bit-nj8xl 1/1 Running 0 14h36m 10.9.120.194 i-00012cabe4badb501 x c
fluent-bit-pfw96 1/1 Running 0 20h38m 10.9.53.2 i-01a100b43add29f2b x a
fluent-bit-ptdx7 1/1 Running 0 21h3m 10.9.32.242 i-06884f474eb216593 x a
fluent-bit-q5kr7 1/1 Running 0 21h3m 10.9.87.233 i-05dafcdf99be41b65 x b
fluent-bit-qrbkw 1/1 Running 0 21h3m 10.9.69.82 i-09c3f9a8e06b8f9db x b
fluent-bit-r5zwh 1/1 Running 0 1h49m 10.9.109.162 i-033c0d47e41cde736 x c
fluent-bit-rkfdv 1/1 Running 0 21h3m 10.9.50.98 i-02806e6e0abb1fb22 x a
fluent-bit-sdmwk 1/1 Running 0 21h3m 10.9.97.194 i-09bdec1fa15a5c242 x c
fluent-bit-v48d8 1/1 Running 0 21h3m 10.9.124.210 i-0a9abd540639cbbd3 x c
fluent-bit-v7f4t 1/1 Running 0 3h44m 10.9.126.178 i-0137ba4598aec8e6b x c
fluent-bit-v9vsp 1/1 Running 0 21h3m 10.9.57.66 i-080e259d01d753d92 x a
fluent-bit-wrtrh 1/1 Running 0 21h1m 10.9.129.194 i-0fdad33b775beb177 x b
fluent-bit-xwvw4 1/1 Running 0 14h36m 10.9.47.34 i-07d3fa5d0c7440172 x a
And again today, the 4.0.7 pods that are crashing are the ones running on the nodes where logstash is running.
When I first noticed the problem yesterday, I replaced the nodes where the crashing fluent-bit pods were in case they had some sort of issue, but that didn't help.
I have 4 clusters and I noted the problem on all of them, and always on the nodes that were also running logstash.
Configuration
We deploy fluent-bit via the helm chart hosted at https://fluent.github.io/helm-charts
The values we provide to it follow.
config:
service: |
[SERVICE]
Daemon Off
Flush 5
HTTP_Server On
Log_Level warn
Parsers_File custom_parsers.conf
Parsers_File parsers.conf
inputs: |
[INPUT]
Buffer_Chunk_Size 800k
Buffer_Max_Size 20MB
DB /var/log/containers/fluent-bit.db
DB.locking true
Multiline.parser cri, docker
Name tail
Path /var/log/containers/*.log
Refresh_Interval 1
Skip_Long_Lines Off
Tag kube.*
[INPUT]
Buffer_Chunk_Size 512k
Buffer_Max_Size 5MB
DB /var/log/kube-audit.db
DB.locking true
Name tail
Parser kube_audit
Path /var/log/kube-audit.log
Tag kube_audit
[INPUT]
Name systemd
Read_From_Tail On
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Tag host.*
filters: |
[FILTER]
# If the log has been broken up into multiple lines, rejoin them.
Match *
Mode partial_message
Multiline.key_content log
Name multiline
[FILTER]
# Enrich with kube data
Annotations Off
Buffer_Size 512k
K8S-Logging.Parser On
Keep_Log Off
Labels Off
Match kube.*
Merge_Log On
Name kubernetes
[FILTER]
Name lua
Match kube.*
# lua function to get the size of a log message
script /fluent-bit/scripts/functions.lua
call add_size
# Put rewrite_tag rules early to reduce the amount of work required.
# That's because a new message is produced with the new tag and gets processed by all the filters again starting from the top.
# However it needs to be after the `kubernetes` filter since that requires the original tag name to enrich the data.
# Ruby regex format can be tested at https://rubular.com/
# Spaces are not allowed in the regex since it needs to be one contiguous string so use \s instead.
[FILTER]
# Retag Apache logs where the timestamp has the format of %d/%b/%Y:%H:%M:%S.%L instead of just %d/%b/%Y:%H:%M:%S
# The new tag isn't called `apache-high-precision` because the *apache* match later chooses the heweb_apache parser.
Match *apache*
Name rewrite_tag
# $KEY REGEX NEW_TAG KEEP
Rule $log ^[^\[]+\[\d{1,2}\/[^\/]{3}\/\d{4}:\d\d:\d\d:\d\d\.\d+\s\+\d{4}\] web-high-precision false
[FILTER]
# Retag Nginx logs where the timestamp has the format of %d/%b/%Y:%H:%M:%S.%L instead of just %d/%b/%Y:%H:%M:%S
# The new tag isn't called `nginx-high-precision` because the *nginx* match later chooses the nginx_common parser.
Match *nginx*
Name rewrite_tag
# $KEY REGEX NEW_TAG KEEP
Rule $log ^\S+\s\S+\s\S+\s\[\d{1,2}\/[^\/]{3}\/\d{4}:\d\d:\d\d:\d\d\.\d+\s\+\d{4}\] ngx-high-precision false
[FILTER]
Match *php*
Name rewrite_tag
# $KEY REGEX NEW_TAG KEEP
Rule $channel AccessLogs Security_logs false
Rule $channel Security Security_logs false
[FILTER]
# Lift the kubernetes records so we can remove the unrequired ones
Add_prefix kubernetes.
Match kube.*
Name nest
Nested_under kubernetes
Operation lift
[FILTER]
# Remove unrequired kubernetes fields
Match kube.*
Name record_modifier
Remove_key kubernetes.container_hash
Remove_key kubernetes.docker_id
Remove_key kubernetes.container_image
[FILTER]
# We plan to remove the php-logs container. This is a hack to keep existing saved searches and alerts working when we do.
Condition Key_Value_Equals kubernetes.container_name php-fpm
Condition Key_Exists monolog_level
Match kube.*
Name modify
Set kubernetes.container_name php-logs
[FILTER]
# Lift the kubernetes.labels as well so we can modify the `app` one.
Add_prefix kubernetes.labels.
Match kube.*
Name nest
Nested_under kubernetes.labels
Operation lift
[FILTER]
# Rename the `app` label to `app.kubernetes.io/name` to stop the clash with all the `app.kubernetes.io/*` labels.
Hard_rename kubernetes.labels.app kubernetes.labels.app.kubernetes.io/name
Match kube.*
Name modify
[FILTER]
# Nest the Kubernetes labels back again.
Match kube.*
Name nest
Nest_under kubernetes.labels
Operation nest
Remove_prefix kubernetes.labels.
Wildcard kubernetes.labels.*
[FILTER]
# Nest the kubernetes records back under kubernetes
Match kube.*
Name nest
Nest_under kubernetes
Operation nest
Remove_prefix kubernetes.
Wildcard kubernetes.*
# Drop logs for coredns and node-local-dns ipv6 queries not found.
[FILTER]
Exclude log ^\[INFO\].*AAAA IN .*
Match_Regex .*(coredns|node-local-dns).*
Name grep
# Elasticsearch and Kibana produce logs with top-level keys of `log.level` and `log.logger` which are parsed by
# Fluent Bit into a top-level key of `log` as an object containing properties of `level` and `logger`.
#
# This conflicts with our top-level `log` key that we store the original log as plain text, so we need to rename
# these log.* keys into flat top-level keys that do not conflict.
[FILTER]
# Rename log.level to log_level.
Condition Key_exists log.level
Condition Key_value_equals kubernetes.namespace_name elasticsearch
Hard_rename log.level log_level
Match kube.*
Name modify
[FILTER]
# Rename log.logger to log_logger.
Condition Key_exists log.logger
Condition Key_value_equals kubernetes.namespace_name elasticsearch
Hard_rename log.logger log_logger
Match kube.*
Name modify
[FILTER]
# Send apache logs to the heweb_apache parser
Key_Name log
Match *apache*
Name parser
Parser heweb_apache
Reserve_Data True
[FILTER]
# Send apache logs with a high time precision to the heweb_apache parser
Key_Name log
Match *web-high-precision*
Name parser
Parser heweb_apache_high_precision
Reserve_Data True
# Fix up logs so that they can be ingested by Elasticsearch.
[FILTER]
Name lua
Match *
script /fluent-bit/scripts/functions.lua
call clean_logs
[FILTER]
# A number of fields share their names with other JSON logs being imported but with clashing data types.
# Just place all of them below a `kube_audit` key to avoid that.
Match kube_audit
Name nest
Nest_Under kube_audit
Operation nest
Wildcard *
[FILTER]
# Send nginx logs to the nginx_common parser
Key_Name log
Match *nginx*
Name parser
Parser nginx_common
Reserve_Data True
[FILTER]
# Send Nginx logs with a high time precision to the nginx_high_precision parser
Key_Name log
Match *ngx-high-precision*
Name parser
Parser nginx_high_precision
Reserve_Data True
[FILTER]
# regex stdout resque log format
Exclude log ^\[notice\]*
Match *heweb_resque*
Name grep
[FILTER]
Add_prefix logEvent_
Match *elasticsearch*
Name nest
Nested_under logEvent.url
Operation lift
[FILTER]
# regex stdout resque log format
Key_Name log
Match *heweb_resque*
Name parser
Parser resque_stdout
Reserve_Data True
[FILTER]
# Populate cluster_name field
Match *
Name record_modifier
Record cluster_name prod2.he0.io
[FILTER]
# Drop unneeded records as docker handles ts
# These fields result in ES errors
Match *
Name record_modifier
Remove_key params
Remove_key time
Remove_key ts
Remove_key url
Remove_key headers
Remove_key host
Remove_key _p
[FILTER]
# Parse coredns and node-local-dns records
Key_Name log
Match_Regex .*(coredns|node-cache).*
Name parser
Parser coredns
Reserve_Data True
[FILTER]
# parse redash logs
Key_Name log
Match *redash*
Name parser
Parser redash
Reserve_Data True
[FILTER]
Name record_modifier
Match *system*
Remove_key node
Remove_key pod
Remove_key resource
Remove_key service
[FILTER]
Name modify
Match Security_logs
Remove_wildcard kubernetes
[FILTER]
# Rename the source field from kubernetes-event-exporter to avoid a type clash.
Match *kube-event-exporter*
Name modify
Rename source kube_event_source
[FILTER]
# Rename the policy field from kyverno to avoid a type clash.
Match *kyverno*
Name modify
Rename policy kyverno_policy
[FILTER]
# Lift the extra records so we can rename the trace id
Add_prefix extra.
Match *
Name nest
Nested_under extra
Operation lift
[FILTER]
Match *
Name modify
Rename extra.X-Amzn-Trace-Id X-Amzn-Trace-Id
[FILTER]
# Nest the extra records back under extra
Match *
Name nest
Nest_under extra
Operation nest
Remove_prefix extra.
Wildcard extra.*
[FILTER]
# Give Amazon trace IDs a consistent name.
# This is at the end after all the custom parsers have been chosen since some of them produce X_Amzn_Trace_Id
# Unfortunately they can't produce `X-Amzn-Trace-Id` because a hyphen is an illegal character for the regex group names.
Match *
Name modify
Rename request_X-Amzn-Trace-Id X-Amzn-Trace-Id
Rename X_Amzn_Trace_Id X-Amzn-Trace-Id
outputs: |
[OUTPUT]
compress gzip
Format json
Host logstash-cluster-shipper.elasticsearch.svc.cluster.local
Match *
Name http
Port 443
Retry_Limit 120
tls On
tls.verify Off
json_date_format iso8601
[OUTPUT]
# Send to Elasticsearch security logs
#Retry_Limit False
Generate_ID On
Host elasticsearch.prod1.apps.he0.io
HTTP_User ***REDACTED***
HTTP_Passwd ***REDACTED***
Include_Tag_Key On
Index security-logs-ilm
Match Security_logs
Name es
Port 443
Replace_Dots On
storage.total_limit_size 10G
Suppress_Type_Name On
tls On
tls.verify Off
Trace_Error On
customParsers: |
[PARSER]
Name coredns
Format regex
Regex ^\[(?<level>\S*)\] (?<remote>.*):(?<port>\S*) - (?<coredns_id>\S*) "(?<type>\S*) (?<coredns_class>\S*) (?<coredns_name>\S*) (?<coredns_proto>\S*) (?<size>\S*) (?<do>\S*) (?<buffer_size>\S*)" (?<rcode>\S*) (?<rflags>\S*) (?<rsize>\S*) (?<duration>\S*)
[PARSER]
Name heweb_apache
Format regex
Key_Name log
Regex ^(?<hosts>(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+)(?:(?:,\s*(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+))?)*|-) - (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+) (?<url_path>.*?) HTTP[^/]*/[.\d]+" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*) "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)" (?<http_request_time_us>\S*) "(?<Origin>[^\"]*)" "(?<X_Amzn_Trace_Id>[^\"]*)"
Time_Format %d/%b/%Y:%H:%M:%S %z
Time_Key time
[PARSER]
Name heweb_apache_high_precision
Format regex
Key_Name log
Regex ^(?<hosts>(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+)(?:(?:,\s*(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+))?)*|-) - (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+) (?<url_path>.*?) HTTP[^/]*/[.\d]+" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*) "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)" (?<http_request_time_us>\S*) "(?<Origin>[^\"]*)" "(?<X_Amzn_Trace_Id>[^\"]*)"
Time_Format %d/%b/%Y:%H:%M:%S.%L %z
Time_Key time
[PARSER]
Name klog
Format regex
Regex (?<ID>\S*)\s(?<ts>\d{2}:\d{2}:\d{2}\.\d{6})\s*(?<line>\d*)\s(?<file>\S*])\s(?<message>.*)
# Command | Decoder | Field | Optional Action |
# ==============|=============|=======|===================|
Decode_Field_As escaped log try_next
Decode_Field_As escaped_utf8 log
[PARSER]
Format json
Name kube_audit
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
Time_Keep true
Time_Key requestReceivedTimestamp
[PARSER]
Name nginx_common
Format regex
Regex ^(?<client_ip>\S*) (?<server_domain>\S*) (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+)(?: +(?<url_path>[^\"]*?)(?: +\S*)?)?" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*)(?: "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)")? (?<http_request_time>\S*) (?<http_request_bytes>\S*) (?<X_Amzn_Trace_Id>\S*) (?<X_Forwarded_For>.*)
Time_Format %d/%b/%Y:%H:%M:%S %z
Time_Key time
[PARSER]
Name nginx_high_precision
Format regex
Regex ^(?<client_ip>\S*) (?<server_domain>\S*) (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+)(?: +(?<url_path>[^\"]*?)(?: +\S*)?)?" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*)(?: "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)")? (?<http_request_time>\S*) (?<http_request_bytes>\S*) (?<X_Amzn_Trace_Id>\S*) (?<X_Forwarded_For>.*)
Time_Format %d/%b/%Y:%H:%M:%S.%L %z
Time_Key time
[PARSER]
Name resque_stdout
Format regex
Regex ^\[(?<severity>\S*)\] .*\(Job\{(?<queue>\S*)\} \| ID\: (?<JobID>\S*) \| (?<class>\S*) \| \[(?<jsonData>.*)\]\)(?<message>.*)
[PARSER]
Name redash
Format regex
Regex ^\[(?<timestamp>.*)\]\[PID\:(?<pid>.*)\]\[(?<level>.*)\]\[(?<output>.*)\] method=(?<method>\S*) path=(?<path>\S*) endpoint=(?<endpoint>\S*) status=(?<status>\S*) content_type=(?<content_type>\S*)( charset=(?<charset>\S*))? content_length=(?<content_length>\S*) duration=(?<duration>\S*) query_count=(?<query_count>\S*) query_duration=(?<query_duration>\S*)
dnsConfig:
options:
- name: ndots
value: "1"
luaScripts:
functions.lua: |
{{- (readFile "fluent-bit.lua") | nindent 4 }}
priorityClassName: daemon-sets
resources:
limits:
memory: 256Mi
requests:
cpu: 120m
memory: 48Mi
service:
annotations:
service.kubernetes.io/topology-mode: auto
tolerations:
- operator: Exists
effect: NoExecute
- operator: Exists
effect: NoSchedule
updateStrategy:
rollingUpdate:
maxUnavailable: 20%
type: RollingUpdate
To Reproduce
I don't know how to reliably reproduce the problem. All I know is 4.0.7 became unstable on our cluster with various seg faults, while version 4.0.6 is fine.
Expected behavior
No crashes
Your Environment
- Version used:
4.0.7
- Configuration:
See the problem description.
- Environment name and version (e.g. Kubernetes? What version?):
Kubernetes version: v1.33.3
- Server type and version:
AWS EC2 instances.
- Operating System and version:
Kubernetes is running on Ubuntu 22.04 EC2 instances
- Filters and plugins:
Should be covered by the configuration shown above.
Additional context
If you think the crashes could be caused by functions.lua then I can also supply the content for that.
Since the problem is happening only on pods talking to the Service that is directing traffic to the logstash pods that are also on the same nodes, it is probably worth mentioning that our kube-proxy is configured to use the ipvs mode (rather than iptables) with the lc scheduler.