-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Bug Report
Describe the bug
FluentBit is hanging and stops collecting logs after a hot reload that has been triggered.
Context
We observed 24 pods (of 46 pods of a DS) are presenting the same pattern logs
[2025/06/25 20:13:32] [engine] caught signal (SIGHUP)
[2025/06/25 20:13:32] [ info] reloading instance pid=1 tid=0x7f37941dae40
[2025/06/25 20:13:32] [ info] [reload] stop everything of the old context
[2025/06/25 20:13:32] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[2025/06/25 20:13:32] [ info] [reload] start everything
The behaviour is very similar to what is reported here:
- Fluent-bit hanging and stopping operation #9927
- Hot reload stuck in progress after pausing inputs #9354 (but we don't have the info pause logs and it stucks after the "[ info] [reload] start everything"
FluentBit is hang, consumes almost no resources (CPU, memory) and no logs are collected.
It seems spending time on sleep
cat /proc/1548315/stack
[<0>] hrtimer_nanosleep+0x95/0x120
[<0>] common_nsleep+0x40/0x50
[<0>] __x64_sys_clock_nanosleep+0xc7/0x130
[<0>] do_syscall_64+0x35/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0xd6
strace: Process 1548315 attached
restart_syscall(<... resuming interrupted clock_nanosleep ...>) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe5b0605c0) = 0
Expected behavior
The reload should be done and logs to be collected continously
Your Environment
- Version used: 3.2.4
- Deployed by FluentBit Helm Chart fluent-bit-0.47.10 on Kubernetes v1.2.6
Additional context
We trigger the hot reload when the secret containing the certificate used by FluentBit kafka input (mtls required) has been updated.
It impacts 24 pods of 46. So huge impact on our logs collection pipelines :(