Skip to content

Various SIGSEGV crashes in fluent-bit version 4.0.7 #10729

@jim-barber-he

Description

@jim-barber-he

Bug Report

Describe the bug

We are seeing fluent-bit crashes in our Kubernetes cluster.
The crashes started when we upgraded from fluent-bit version 4.0.3 to 4.0.7.

I also upgraded our clusters from Kubernetes v1.32.7 to v1.33.3 at the same time, but to rule that out I downgraded fluent-bit back to 4.0.3 and the crashes stopped.

I then worked through installing the point releases for fluent-bit to try to work out which release introduced the issue.
There were no crashes with fluent-bit versions 4.0.4, 4.0.5 nor 4.0.6 either.
It was only once I reached version 4.0.7 that the crashes started again.

I have the output of a few of the crashes from yesterday using version 4.0.7 below.

Fluent Bit v4.0.7
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/08/13 05:22:47] [engine] caught signal (SIGSEGV)
#0  0x7d364b5cbabb      in  ???() at ???:0
#1  0x7d364b49413e      in  ???() at ???:0
#2  0x7d364b76c705      in  ???() at ???:0
#3  0x7d364b766c1f      in  ???() at ???:0
#4  0x7d364b767106      in  ???() at ???:0
#5  0x7d364b7489f2      in  ???() at ???:0
#6  0x5e229c9e0489      in  tls_net_write() at src/tls/openssl.c:1011
#7  0x5e229c9e1216      in  flb_tls_net_write_async() at src/tls/flb_tls.c:487
#8  0x5e229c9f09cc      in  flb_io_net_write() at src/flb_io.c:699
#9  0x5e229c9f1dd5      in  flb_http_do_request() at src/flb_http_client.c:1396
#10 0x5e229c9f269c      in  flb_http_do() at src/flb_http_client.c:1530
#11 0x5e229caba233      in  http_post() at plugins/out_http/http.c:284
#12 0x5e229cabb717      in  cb_http_flush() at plugins/out_http/http.c:641
#13 0x5e229d037166      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

This one is somewhat similar in that lines #0 to #5 correspond with lines #8 to #13 above.

[2025/08/13 05:46:27] [engine] caught signal (SIGSEGV)
#0  0x5f217362c9cc      in  flb_io_net_write() at src/flb_io.c:699
#1  0x5f217362ddd5      in  flb_http_do_request() at src/flb_http_client.c:1396
#2  0x5f217362e69c      in  flb_http_do() at src/flb_http_client.c:1530
#3  0x5f21736f6233      in  http_post() at plugins/out_http/http.c:284
#4  0x5f21736f7717      in  cb_http_flush() at plugins/out_http/http.c:641
#5  0x5f2173c73166      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6  0xffffffffffffffff  in  ???() at ???:0

And this crash is totally different to the others...

[2025/08/13 05:54:13] [engine] caught signal (SIGSEGV)
#0  0x5bd101c0abf1      in  ares_htable_find() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable.c:184
#1  0x5bd101c0b2d1      in  ares_htable_get() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable.c:381
#2  0x5bd101bfbc3d      in  ares_htable_szvp_get() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable_szvp.c:155
#3  0x5bd101bfbc88      in  ares_htable_szvp_get_direct() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable_szvp.c:169
#4  0x5bd101bf3448      in  process_answer() at lib/c-ares-1.34.4/src/lib/ares_process.c:730
#5  0x5bd101bf3448      in  read_answers() at lib/c-ares-1.34.4/src/lib/ares_process.c:553
#6  0x5bd101bf3448      in  process_read() at lib/c-ares-1.34.4/src/lib/ares_process.c:587
#7  0x5bd101bf3448      in  ares_process_fds_nolock() at lib/c-ares-1.34.4/src/lib/ares_process.c:225
#8  0x5bd101bf3d28      in  ares_process_fds_nolock() at lib/c-ares-1.34.4/src/lib/ares_process.c:260
#9  0x5bd101bf3d28      in  ares_process_fds() at lib/c-ares-1.34.4/src/lib/ares_process.c:257
#10 0x5bd101bf3d8b      in  ares_process_fd() at lib/c-ares-1.34.4/src/lib/ares_process.c:284
#11 0x5bd10181fba3      in  flb_net_getaddrinfo_event_handler() at src/flb_network.c:915
#12 0x5bd10181fba3      in  flb_net_getaddrinfo_event_handler() at src/flb_network.c:904
#13 0x5bd10181635d      in  output_thread() at src/flb_output_thread.c:318
#14 0x5bd10183142d      in  step_callback() at src/flb_worker.c:43
#15 0x7b502ad551f4      in  ???() at ???:0
#16 0x7b502add589b      in  ???() at ???:0

And then a crash from today after I had worked back up to version 4.0.7 again after trying 4.0.3 onwards:

[2025/08/14 04:01:14] [ warn] [msgpack2json] unknown msgpack type 601716864
[2025/08/14 04:01:14] [engine] caught signal (SIGSEGV)
#0  0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#1  0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#2  0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#3  0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#4  0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#5  0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#6  0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#7  0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#8  0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#9  0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#10 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#11 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#12 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#13 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#14 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#15 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#16 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#17 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#18 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#19 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#20 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#21 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#22 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#23 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#24 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#25 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#26 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#27 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#28 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#29 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#30 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#31 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#32 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#33 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#34 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#35 0x6317371684c1      in  msgpack2json() at src/flb_pack.c:775
#36 0x631737167f12      in  msgpack2json() at src/flb_pack.c:737
#37 0x63173716970a      in  flb_msgpack_to_json() at src/flb_pack.c:812
#38 0x63173716981f      in  flb_msgpack_raw_to_json_sds() at src/flb_pack.c:852
#39 0x63173716a181      in  flb_pack_msgpack_to_json_format() at src/flb_pack.c:1225
#40 0x6317372883c3      in  compose_payload() at plugins/out_http/http.c:441
#41 0x6317372886b9      in  cb_http_flush() at plugins/out_http/http.c:631
#42 0x631737804166      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

And another one that looks similar to one I gathered yesterday:

[2025/08/14 04:41:38] [engine] caught signal (SIGSEGV)
#0  0x5962cb909bf1      in  ares_htable_find() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable.c:184
#1  0x5962cb90a2d1      in  ares_htable_get() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable.c:381
#2  0x5962cb8fa6bc      in  ares_htable_asvp_get() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable_asvp.c:189
#3  0x5962cb8fa708      in  ares_htable_asvp_get_direct() at lib/c-ares-1.34.4/src/lib/dsa/ares_htable_asvp.c:204
#4  0x5962cb8ed41f      in  ares_conn_from_fd() at lib/c-ares-1.34.4/src/lib/ares_conn.c:505
#5  0x5962cb8f2148      in  process_write() at lib/c-ares-1.34.4/src/lib/ares_process.c:384
#6  0x5962cb8f2148      in  ares_process_fds_nolock() at lib/c-ares-1.34.4/src/lib/ares_process.c:211
#7  0x5962cb8f2d28      in  ares_process_fds_nolock() at lib/c-ares-1.34.4/src/lib/ares_process.c:260
#8  0x5962cb8f2d28      in  ares_process_fds() at lib/c-ares-1.34.4/src/lib/ares_process.c:257
#9  0x5962cb8f2d8b      in  ares_process_fd() at lib/c-ares-1.34.4/src/lib/ares_process.c:284
#10 0x5962cb51eba3      in  flb_net_getaddrinfo_event_handler() at src/flb_network.c:915
#11 0x5962cb51eba3      in  flb_net_getaddrinfo_event_handler() at src/flb_network.c:904
#12 0x5962cb51535d      in  output_thread() at src/flb_output_thread.c:318
#13 0x5962cb53042d      in  step_callback() at src/flb_worker.c:43
#14 0x709c3a2851f4      in  ???() at ???:0
#15 0x709c3a30589b      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

And another

[2025/08/14 05:02:35] [engine] caught signal (SIGSEGV)
#0  0x7c529a9f99f2      in  ???() at ???:0
#1  0x56a381f1c489      in  tls_net_write() at src/tls/openssl.c:1011
#2  0x56a381f1d216      in  flb_tls_net_write_async() at src/tls/flb_tls.c:487
#3  0x56a381f2c9cc      in  flb_io_net_write() at src/flb_io.c:699
#4  0x56a381f2ddd5      in  flb_http_do_request() at src/flb_http_client.c:1396
#5  0x56a381f2e69c      in  flb_http_do() at src/flb_http_client.c:1530
#6  0x56a381ff6233      in  http_post() at plugins/out_http/http.c:284
#7  0x56a381ff7717      in  cb_http_flush() at plugins/out_http/http.c:641
#8  0x56a382573166      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0xffffffffffffffff  in  ???() at ???:0

I noted it was only 2 nodes where the crashes were happening.
This is where the fluent-bit pods are running on the same nodes where the logstash install (that we call logstash-cluster-shipper) is that they are pushing the logs to.
Below you can see where a cluster has been left running fluent-but 4.0.7 overnight and 2 of the pods have been restarting while all the other pods are fine.

$ kubectl p
NAME              READY  STATUS   RESTARTS         AGE     IP            NODE                 SPOT  AZ
fluent-bit-264df  1/1    Running  0                19h45m  10.9.63.16    i-014c6e1f605659839  x     a
fluent-bit-2gfx5  1/1    Running  0                21h3m   10.9.78.226   i-0c5914c6ce4d8b721  x     b
fluent-bit-2vdps  1/1    Running  0                21h1m   10.9.128.50   i-08185a3482cd5c50d  x     a
fluent-bit-4hjdf  1/1    Running  0                21h3m   10.9.101.82   i-0544ff5dd870e9d5f  x     c
fluent-bit-4lhgm  1/1    Running  86 (20m44s ago)  17h48m  10.9.125.114  i-0587f5f17e1f77b16  x     c
fluent-bit-674h6  1/1    Running  0                21h3m   10.9.95.117   i-0658acb35f8b262dd  x     b
fluent-bit-69bhn  1/1    Running  0                20h34m  10.9.105.98   i-08977b516a22192d1  ✓     c
fluent-bit-6wpk9  1/1    Running  0                15h36m  10.9.94.48    i-07257d472830a7f69  x     b
fluent-bit-7frwx  1/1    Running  0                21h1m   10.9.129.16   i-08d6d8dfcc1961978  x     b
fluent-bit-7lkrw  1/1    Running  0                2h44m   10.9.64.192   i-029c96c9bad4cda1f  x     b
fluent-bit-7zz6d  1/1    Running  64 (1m25s ago)   17h41m  10.9.56.162   i-09cb920d6b96eea4d  x     a
fluent-bit-9x8k7  1/1    Running  0                57m57s  10.9.92.114   i-0bd41e2e19b941f16  x     b
fluent-bit-bktkd  1/1    Running  0                2h47m   10.9.91.226   i-071599c31f37bba3c  x     b
fluent-bit-bkwrd  1/1    Running  0                21h1m   10.9.130.114  i-03d5d4dff4a8ed4ec  x     c
fluent-bit-bm4pc  1/1    Running  0                21h1m   10.9.128.146  i-0adaf0fc20c37badf  x     a
fluent-bit-bs9fd  1/1    Running  0                3h26m   10.9.55.194   i-017def9b9f61c94db  x     a
fluent-bit-fg2ml  1/1    Running  0                21h3m   10.9.48.194   i-018e5c83d1b546330  x     a
fluent-bit-fr99k  1/1    Running  0                15h44m  10.9.87.130   i-0aadcf2c01253fd2e  x     b
fluent-bit-jgczr  1/1    Running  0                21h3m   10.9.61.66    i-0b29e10c5ec09ab49  x     a
fluent-bit-lfdq2  1/1    Running  0                21h3m   10.9.126.242  i-072366caed49c7a13  x     c
fluent-bit-lh66m  1/1    Running  0                21h1m   10.9.130.162  i-07e497070f1332aac  x     c
fluent-bit-mh8q9  1/1    Running  0                21h3m   10.9.78.208   i-0aead6b9f1caf47c9  x     b
fluent-bit-n92c7  1/1    Running  0                14h36m  10.9.112.34   i-0089917f280ef866e  x     c
fluent-bit-nj8xl  1/1    Running  0                14h36m  10.9.120.194  i-00012cabe4badb501  x     c
fluent-bit-pfw96  1/1    Running  0                20h38m  10.9.53.2     i-01a100b43add29f2b  x     a
fluent-bit-ptdx7  1/1    Running  0                21h3m   10.9.32.242   i-06884f474eb216593  x     a
fluent-bit-q5kr7  1/1    Running  0                21h3m   10.9.87.233   i-05dafcdf99be41b65  x     b
fluent-bit-qrbkw  1/1    Running  0                21h3m   10.9.69.82    i-09c3f9a8e06b8f9db  x     b
fluent-bit-r5zwh  1/1    Running  0                1h49m   10.9.109.162  i-033c0d47e41cde736  x     c
fluent-bit-rkfdv  1/1    Running  0                21h3m   10.9.50.98    i-02806e6e0abb1fb22  x     a
fluent-bit-sdmwk  1/1    Running  0                21h3m   10.9.97.194   i-09bdec1fa15a5c242  x     c
fluent-bit-v48d8  1/1    Running  0                21h3m   10.9.124.210  i-0a9abd540639cbbd3  x     c
fluent-bit-v7f4t  1/1    Running  0                3h44m   10.9.126.178  i-0137ba4598aec8e6b  x     c
fluent-bit-v9vsp  1/1    Running  0                21h3m   10.9.57.66    i-080e259d01d753d92  x     a
fluent-bit-wrtrh  1/1    Running  0                21h1m   10.9.129.194  i-0fdad33b775beb177  x     b
fluent-bit-xwvw4  1/1    Running  0                14h36m  10.9.47.34    i-07d3fa5d0c7440172  x     a

And again today, the 4.0.7 pods that are crashing are the ones running on the nodes where logstash is running.

When I first noticed the problem yesterday, I replaced the nodes where the crashing fluent-bit pods were in case they had some sort of issue, but that didn't help.
I have 4 clusters and I noted the problem on all of them, and always on the nodes that were also running logstash.

Configuration

We deploy fluent-bit via the helm chart hosted at https://fluent.github.io/helm-charts
The values we provide to it follow.

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush 5
        HTTP_Server On
        Log_Level warn
        Parsers_File custom_parsers.conf
        Parsers_File parsers.conf

  inputs: |
    [INPUT]
        Buffer_Chunk_Size 800k
        Buffer_Max_Size 20MB
        DB /var/log/containers/fluent-bit.db
        DB.locking true
        Multiline.parser cri, docker
        Name tail
        Path /var/log/containers/*.log
        Refresh_Interval 1
        Skip_Long_Lines Off
        Tag kube.*
    [INPUT]
        Buffer_Chunk_Size 512k
        Buffer_Max_Size 5MB
        DB /var/log/kube-audit.db
        DB.locking true
        Name tail
        Parser kube_audit
        Path /var/log/kube-audit.log
        Tag kube_audit
    [INPUT]
        Name systemd
        Read_From_Tail On
        Systemd_Filter _SYSTEMD_UNIT=kubelet.service
        Tag host.*

  filters: |
    [FILTER]
        # If the log has been broken up into multiple lines, rejoin them.
        Match                 *
        Mode                  partial_message
        Multiline.key_content log
        Name                  multiline

    [FILTER]
        # Enrich with kube data
        Annotations Off
        Buffer_Size 512k
        K8S-Logging.Parser On
        Keep_Log Off
        Labels Off
        Match kube.*
        Merge_Log On
        Name kubernetes

    [FILTER]
        Name                lua
        Match               kube.*
        # lua function to get the size of a log message
        script              /fluent-bit/scripts/functions.lua
        call                add_size

    # Put rewrite_tag rules early to reduce the amount of work required.
    # That's because a new message is produced with the new tag and gets processed by all the filters again starting from the top.
    # However it needs to be after the `kubernetes` filter since that requires the original tag name to enrich the data.
    # Ruby regex format can be tested at https://rubular.com/
    # Spaces are not allowed in the regex since it needs to be one contiguous string so use \s instead.
    [FILTER]
        # Retag Apache logs where the timestamp has the format of %d/%b/%Y:%H:%M:%S.%L instead of just %d/%b/%Y:%H:%M:%S
        # The new tag isn't called `apache-high-precision` because the *apache* match later chooses the heweb_apache parser.
        Match *apache*
        Name rewrite_tag
        #    $KEY  REGEX                                                                     NEW_TAG             KEEP
        Rule $log  ^[^\[]+\[\d{1,2}\/[^\/]{3}\/\d{4}:\d\d:\d\d:\d\d\.\d+\s\+\d{4}\]          web-high-precision  false
    [FILTER]
        # Retag Nginx logs where the timestamp has the format of %d/%b/%Y:%H:%M:%S.%L instead of just %d/%b/%Y:%H:%M:%S
        # The new tag isn't called `nginx-high-precision` because the *nginx* match later chooses the nginx_common parser.
        Match *nginx*
        Name rewrite_tag
        #    $KEY  REGEX                                                                     NEW_TAG             KEEP
        Rule $log  ^\S+\s\S+\s\S+\s\[\d{1,2}\/[^\/]{3}\/\d{4}:\d\d:\d\d:\d\d\.\d+\s\+\d{4}\] ngx-high-precision  false
    [FILTER]
        Match *php*
        Name rewrite_tag
        #    $KEY     REGEX      NEW_TAG       KEEP
        Rule $channel AccessLogs Security_logs false
        Rule $channel Security   Security_logs false

    [FILTER]
        # Lift the kubernetes records so we can remove the unrequired ones
        Add_prefix kubernetes.
        Match kube.*
        Name nest
        Nested_under kubernetes
        Operation lift
    [FILTER]
        # Remove unrequired kubernetes fields
        Match kube.*
        Name record_modifier
        Remove_key kubernetes.container_hash
        Remove_key kubernetes.docker_id
        Remove_key kubernetes.container_image
    [FILTER]
        # We plan to remove the php-logs container. This is a hack to keep existing saved searches and alerts working when we do.
        Condition Key_Value_Equals kubernetes.container_name php-fpm
        Condition Key_Exists monolog_level
        Match kube.*
        Name modify
        Set kubernetes.container_name php-logs
    [FILTER]
        # Lift the kubernetes.labels as well so we can modify the `app` one.
        Add_prefix kubernetes.labels.
        Match kube.*
        Name nest
        Nested_under kubernetes.labels
        Operation lift
    [FILTER]
        # Rename the `app` label to `app.kubernetes.io/name` to stop the clash with all the `app.kubernetes.io/*` labels.
        Hard_rename kubernetes.labels.app kubernetes.labels.app.kubernetes.io/name
        Match kube.*
        Name modify
    [FILTER]
        # Nest the Kubernetes labels back again.
        Match kube.*
        Name nest
        Nest_under kubernetes.labels
        Operation nest
        Remove_prefix kubernetes.labels.
        Wildcard kubernetes.labels.*
    [FILTER]
        # Nest the kubernetes records back under kubernetes
        Match kube.*
        Name nest
        Nest_under kubernetes
        Operation nest
        Remove_prefix kubernetes.
        Wildcard kubernetes.*

    # Drop logs for coredns and node-local-dns ipv6 queries not found.
    [FILTER]
        Exclude log ^\[INFO\].*AAAA IN .*
        Match_Regex  .*(coredns|node-local-dns).*
        Name   grep

    # Elasticsearch and Kibana produce logs with top-level keys of `log.level` and `log.logger` which are parsed by
    # Fluent Bit into a top-level key of `log` as an object containing properties of `level` and `logger`.
    #
    # This conflicts with our top-level `log` key that we store the original log as plain text, so we need to rename
    # these log.* keys into flat top-level keys that do not conflict.
    [FILTER]
        # Rename log.level to log_level.
        Condition Key_exists log.level
        Condition Key_value_equals kubernetes.namespace_name elasticsearch
        Hard_rename log.level log_level
        Match kube.*
        Name modify
    [FILTER]
        # Rename log.logger to log_logger.
        Condition Key_exists log.logger
        Condition Key_value_equals kubernetes.namespace_name elasticsearch
        Hard_rename log.logger log_logger
        Match kube.*
        Name modify

    [FILTER]
        # Send apache logs to the heweb_apache parser
        Key_Name log
        Match *apache*
        Name parser
        Parser heweb_apache
        Reserve_Data True
    [FILTER]
        # Send apache logs with a high time precision to the heweb_apache parser
        Key_Name log
        Match *web-high-precision*
        Name parser
        Parser heweb_apache_high_precision
        Reserve_Data True

    # Fix up logs so that they can be ingested by Elasticsearch.
    [FILTER]
        Name lua
        Match *
        script /fluent-bit/scripts/functions.lua
        call clean_logs

    [FILTER]
        # A number of fields share their names with other JSON logs being imported but with clashing data types.
        # Just place all of them below a `kube_audit` key to avoid that.
        Match kube_audit
        Name nest
        Nest_Under kube_audit
        Operation nest
        Wildcard *

    [FILTER]
        # Send nginx logs to the nginx_common parser
        Key_Name log
        Match *nginx*
        Name parser
        Parser nginx_common
        Reserve_Data True
    [FILTER]
        # Send Nginx logs with a high time precision to the nginx_high_precision parser
        Key_Name log
        Match *ngx-high-precision*
        Name parser
        Parser nginx_high_precision
        Reserve_Data True
    [FILTER]
        # regex stdout resque log format
        Exclude log ^\[notice\]*
        Match  *heweb_resque*
        Name   grep
    [FILTER]
        Add_prefix logEvent_
        Match  *elasticsearch*
        Name   nest
        Nested_under logEvent.url
        Operation lift
    [FILTER]
        # regex stdout resque log format
        Key_Name log
        Match *heweb_resque*
        Name parser
        Parser resque_stdout
        Reserve_Data True
    [FILTER]
        # Populate cluster_name field
        Match *
        Name record_modifier
        Record cluster_name prod2.he0.io
    [FILTER]
        # Drop unneeded records as docker handles ts
        # These fields result in ES errors
        Match *
        Name record_modifier
        Remove_key params
        Remove_key time
        Remove_key ts
        Remove_key url
        Remove_key headers
        Remove_key host
        Remove_key _p
    [FILTER]
        # Parse coredns and node-local-dns records
        Key_Name log
        Match_Regex .*(coredns|node-cache).*
        Name parser
        Parser coredns
        Reserve_Data True
    [FILTER]
        # parse redash logs
        Key_Name log
        Match *redash*
        Name parser
        Parser redash
        Reserve_Data True
    [FILTER]
        Name record_modifier
        Match *system*
        Remove_key node
        Remove_key pod
        Remove_key resource
        Remove_key service
    [FILTER]
        Name modify
        Match Security_logs
        Remove_wildcard kubernetes
    [FILTER]
        # Rename the source field from kubernetes-event-exporter to avoid a type clash.
        Match *kube-event-exporter*
        Name modify
        Rename source kube_event_source
    [FILTER]
        # Rename the policy field from kyverno to avoid a type clash.
        Match *kyverno*
        Name modify
        Rename policy kyverno_policy
    [FILTER]
        # Lift the extra records so we can rename the trace id
        Add_prefix extra.
        Match *
        Name nest
        Nested_under extra
        Operation lift
    [FILTER]
        Match *
        Name modify
        Rename extra.X-Amzn-Trace-Id X-Amzn-Trace-Id
    [FILTER]
        # Nest the extra records back under extra
        Match *
        Name nest
        Nest_under extra
        Operation nest
        Remove_prefix extra.
        Wildcard extra.*
    [FILTER]
        # Give Amazon trace IDs a consistent name.
        # This is at the end after all the custom parsers have been chosen since some of them produce X_Amzn_Trace_Id
        # Unfortunately they can't produce `X-Amzn-Trace-Id` because a hyphen is an illegal character for the regex group names.
        Match *
        Name modify
        Rename request_X-Amzn-Trace-Id X-Amzn-Trace-Id
        Rename X_Amzn_Trace_Id X-Amzn-Trace-Id

  outputs: |
    [OUTPUT]
        compress gzip
        Format json
        Host logstash-cluster-shipper.elasticsearch.svc.cluster.local
        Match  *
        Name   http
        Port   443
        Retry_Limit 120
        tls On
        tls.verify Off
        json_date_format iso8601
    [OUTPUT]
        # Send to Elasticsearch security logs
        #Retry_Limit False
        Generate_ID On
        Host elasticsearch.prod1.apps.he0.io
        HTTP_User ***REDACTED***
        HTTP_Passwd ***REDACTED***
        Include_Tag_Key On
        Index security-logs-ilm
        Match Security_logs
        Name es
        Port 443
        Replace_Dots On
        storage.total_limit_size 10G
        Suppress_Type_Name On
        tls On
        tls.verify Off
        Trace_Error On

  customParsers: |
    [PARSER]
        Name coredns
        Format regex
        Regex ^\[(?<level>\S*)\] (?<remote>.*):(?<port>\S*) - (?<coredns_id>\S*) "(?<type>\S*) (?<coredns_class>\S*) (?<coredns_name>\S*) (?<coredns_proto>\S*) (?<size>\S*) (?<do>\S*) (?<buffer_size>\S*)" (?<rcode>\S*) (?<rflags>\S*) (?<rsize>\S*) (?<duration>\S*)
    [PARSER]
        Name heweb_apache
        Format regex
        Key_Name log
        Regex ^(?<hosts>(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+)(?:(?:,\s*(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+))?)*|-) - (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+) (?<url_path>.*?) HTTP[^/]*/[.\d]+" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*) "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)" (?<http_request_time_us>\S*) "(?<Origin>[^\"]*)" "(?<X_Amzn_Trace_Id>[^\"]*)"
        Time_Format %d/%b/%Y:%H:%M:%S %z
        Time_Key time
    [PARSER]
        Name heweb_apache_high_precision
        Format regex
        Key_Name log
        Regex ^(?<hosts>(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+)(?:(?:,\s*(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+))?)*|-) - (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+) (?<url_path>.*?) HTTP[^/]*/[.\d]+" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*) "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)" (?<http_request_time_us>\S*) "(?<Origin>[^\"]*)" "(?<X_Amzn_Trace_Id>[^\"]*)"
        Time_Format %d/%b/%Y:%H:%M:%S.%L %z
        Time_Key time
    [PARSER]
        Name klog
        Format regex
        Regex (?<ID>\S*)\s(?<ts>\d{2}:\d{2}:\d{2}\.\d{6})\s*(?<line>\d*)\s(?<file>\S*])\s(?<message>.*)
        # Command       |  Decoder    | Field | Optional Action   |
        # ==============|=============|=======|===================|
        Decode_Field_As    escaped       log        try_next
        Decode_Field_As    escaped_utf8  log
    [PARSER]
        Format json
        Name kube_audit
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z
        Time_Keep true
        Time_Key requestReceivedTimestamp
    [PARSER]
        Name nginx_common
        Format regex
        Regex ^(?<client_ip>\S*) (?<server_domain>\S*) (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+)(?: +(?<url_path>[^\"]*?)(?: +\S*)?)?" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*)(?: "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)")? (?<http_request_time>\S*) (?<http_request_bytes>\S*) (?<X_Amzn_Trace_Id>\S*) (?<X_Forwarded_For>.*)
        Time_Format %d/%b/%Y:%H:%M:%S %z
        Time_Key time
    [PARSER]
        Name nginx_high_precision
        Format regex
        Regex ^(?<client_ip>\S*) (?<server_domain>\S*) (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+)(?: +(?<url_path>[^\"]*?)(?: +\S*)?)?" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*)(?: "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)")? (?<http_request_time>\S*) (?<http_request_bytes>\S*) (?<X_Amzn_Trace_Id>\S*) (?<X_Forwarded_For>.*)
        Time_Format %d/%b/%Y:%H:%M:%S.%L %z
        Time_Key time
    [PARSER]
        Name resque_stdout
        Format regex
        Regex ^\[(?<severity>\S*)\] .*\(Job\{(?<queue>\S*)\} \| ID\: (?<JobID>\S*) \| (?<class>\S*) \| \[(?<jsonData>.*)\]\)(?<message>.*)
    [PARSER]
        Name redash
        Format regex
        Regex ^\[(?<timestamp>.*)\]\[PID\:(?<pid>.*)\]\[(?<level>.*)\]\[(?<output>.*)\] method=(?<method>\S*) path=(?<path>\S*) endpoint=(?<endpoint>\S*) status=(?<status>\S*) content_type=(?<content_type>\S*)( charset=(?<charset>\S*))? content_length=(?<content_length>\S*) duration=(?<duration>\S*) query_count=(?<query_count>\S*) query_duration=(?<query_duration>\S*)

dnsConfig:
  options:
    - name: ndots
      value: "1"

luaScripts:
  functions.lua: |
    {{- (readFile "fluent-bit.lua") | nindent 4 }}

priorityClassName: daemon-sets

resources:
  limits:
    memory: 256Mi
  requests:
    cpu: 120m
    memory: 48Mi

service:
  annotations:
    service.kubernetes.io/topology-mode: auto

tolerations:
  - operator: Exists
    effect: NoExecute
  - operator: Exists
    effect: NoSchedule

updateStrategy:
  rollingUpdate:
    maxUnavailable: 20%
  type: RollingUpdate

To Reproduce

I don't know how to reliably reproduce the problem. All I know is 4.0.7 became unstable on our cluster with various seg faults, while version 4.0.6 is fine.

Expected behavior

No crashes

Your Environment

  • Version used:

4.0.7

  • Configuration:

See the problem description.

  • Environment name and version (e.g. Kubernetes? What version?):

Kubernetes version: v1.33.3

  • Server type and version:

AWS EC2 instances.

  • Operating System and version:

Kubernetes is running on Ubuntu 22.04 EC2 instances

  • Filters and plugins:

Should be covered by the configuration shown above.

Additional context

If you think the crashes could be caused by functions.lua then I can also supply the content for that.

Since the problem is happening only on pods talking to the Service that is directing traffic to the logstash pods that are also on the same nodes, it is probably worth mentioning that our kube-proxy is configured to use the ipvs mode (rather than iptables) with the lc scheduler.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions