-
Notifications
You must be signed in to change notification settings - Fork 1.9k
crash on hot reload hangs #10840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash on hot reload hangs #10840
Conversation
WalkthroughAdds hot-reload safety timeouts and abort paths to the engine shutdown tick: recounts running tasks and memory/filesystem backlog, logs counts, aborts if hot-reload timeout reached, and on final grace attempts chunk segregation or flush; otherwise proceeds with normal stop/cleanup. (≤50 words) Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant T as Shutdown Timer
participant E as Engine
participant S as Storage
participant L as Logger
T->>E: shutdown_tick()
E->>E: recount running tasks + mem_chunks + fs_chunks
E->>L: log backlog counts and grace state
alt Hot-reload timeout (shutdown_by_hot_reloading && grace_count >= 10)
E->>L: log error + running tasks
E-->>E: abort (return -1)
else Backlog remains within grace
opt grace_count == 1
alt storage_bl_flush_on_shutdown enabled
E->>S: sb_segregate_chunks()
alt segregation fails
E-->>E: return -2
else segregation OK
E->>E: flb_engine_flush(NULL)
end
else
E->>E: flb_engine_flush(NULL)
end
end
E-->>T: continue waiting (increment grace_count)
else No outstanding work
E->>L: log service stop
E-->>E: set exit status, destroy timer, shutdown
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks (1 passed, 2 warnings)❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
3820327 to
3bab772
Compare
3bab772 to
40e38d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/flb_engine.c (1)
1101-1108: Don’t return -2 during shutdown; log and continueIn src/flb_engine.c at the shutdown loop (around line 1104), change the error path so it only logs and proceeds, ensuring the full shutdown sequence (including flb_engine_shutdown()) always runs:
- if (ret < 0) { - flb_error("[engine] could not segregate backlog chunks during shutdown"); - return -2; - } + if (ret < 0) { + flb_error("[engine] could not segregate backlog chunks during shutdown; proceeding without segregation"); + /* continue shutdown path without early exit */ + }This avoids exiting flb_engine_start() mid-shutdown on a segregation failure—callers only special‐case -1 and would otherwise skip cleanup.
🧹 Nitpick comments (3)
src/flb_engine.c (3)
1087-1094: Hot-reload timeout: make 300s configurable, improve messaging, and best-effort notify before abortGood safeguard. Recommend:
- Use a constant or config knob instead of hardcoded 300.
- Log “reached (>= %ds)” to match the condition and include the value.
- Best-effort notify library users before abort via flb_engine_failed(config).
Apply within range:
- if (ret > 0 && config->shutdown_by_hot_reloading && config->grace_count >= 300) { - flb_error("[engine] hot reload timeout exceeded (300s), crashing to prevent " - "indefinite hang"); - flb_task_running_print(config); - abort(); - } + if (ret > 0 && config->shutdown_by_hot_reloading && + config->grace_count >= FLB_HOT_RELOAD_TIMEOUT_SECS) { + flb_error("[engine] hot reload timeout reached (>= %ds), crashing to prevent indefinite hang", + FLB_HOT_RELOAD_TIMEOUT_SECS); + flb_task_running_print(config); + (void) flb_engine_failed(config); /* best-effort notification */ + abort(); + }Outside this hunk (add once near includes):
#ifndef FLB_HOT_RELOAD_TIMEOUT_SECS #define FLB_HOT_RELOAD_TIMEOUT_SECS 300 #endif
1076-1084: Rate-limit shutdown backlog logging to avoid log spamThese paths run every second; printing can be noisy/heavy in large backlogs. Log on first tick and then periodically.
- if ((mem_chunks + fs_chunks) > 0) { + if ((mem_chunks + fs_chunks) > 0 && + (config->grace_count == 1 || (config->grace_count % 10) == 0)) { flb_info("[engine] pending chunk count: memory=%d, filesystem=%d; grace_timer=%d", mem_chunks, fs_chunks, config->grace_count); } - if (tasks > 0) { - flb_task_running_print(config); - } + if (tasks > 0 && + (config->grace_count == 1 || (config->grace_count % 30) == 0)) { + flb_task_running_print(config); + }
1043-1054: Consistency nit: treat invalid fd as -1 (like flush_fd) rather than <= 0Elsewhere timer fds use -1 for invalid; using <= 0 here is slightly inconsistent and could mask rare fd=0 edge cases.
- if (config->shutdown_fd <= 0) { + if (config->shutdown_fd == -1) { config->shutdown_fd = mk_event_timeout_create(evl, 1, 0, event);Please verify initial value of config->shutdown_fd is -1.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/flb_engine.c(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/flb_engine.c (1)
src/flb_task.c (1)
flb_task_running_print(299-345)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: PR - fuzzing test
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-centos-7
🔇 Additional comments (1)
src/flb_engine.c (1)
1094-1112: LGTM: preserves existing grace semantics while adding safe first-tick actionsThe preservation of “ret > 0 && (grace_count < grace || grace == -1)” is correct, and the first-tick segregation/flush hook is a pragmatic addition.
cosmo0920
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks useful and making more safety for infinite stuck.
|
@stoksc But we need to Signed-off line in each commit. So, could you git commit -s and push force again? |
src/flb_engine.c
Outdated
| flb_error("[engine] hot reload timeout exceeded (300s), crashing to prevent " | ||
| "indefinite hang"); | ||
| flb_task_running_print(config); | ||
| abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about returning -1 instead of abort() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in process of writing a test that reproduces the issue. I'll try that once I'm done. Tried to follow the code, looks like as long as something in the error flow sets the flb_ctx_t->status to !ok it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not easy to test the sort of hangs we were seeing that way in a unit test; do you have a suggestion? i started trying to sleep or flb_time_sleep the thread w/ dummy but it's on the main thread so that doesn't work either way.
i've changed it to return -1 and i can test with the repro i have internally but a unit test for hotreload would be nice.
i'm going to hack on a watchdog thread version that could handle a truly hung plugin on the main thread too
40e38d9 to
66933d6
Compare
Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
66933d6 to
9100ac8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/flb_engine.c (1)
1085-1091: Hot-reload timeout guard looks good and aligns with earlier feedback to avoid abort().Returning
-1instead of aborting prevents a hard crash and still lets callers decide teardown behavior. Nice improvement.
🧹 Nitpick comments (2)
src/flb_engine.c (2)
1085-1091: Clarify log text and avoid unnecessary task-print when none exist.Current message says “crashing” but we’re returning an error. Also we always print running tasks even if
tasks == 0.Apply this minimal tweak:
- flb_error("[engine] hot reload timeout exceeded (10s), crashing to prevent " - "indefinite hang"); - flb_task_running_print(config); + flb_error("[engine] hot-reload shutdown timed out after 10s; aborting shutdown with error"); + if (tasks > 0) { + flb_task_running_print(config); + }
1085-1091: Make timeout configurable instead of hard-coded 10s.Consider a config knob (e.g.,
HotReloadTimeoutor env var) defaulting to 10s, so operators can tune for slow backends or stricter SLOs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/flb_engine.c(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/flb_engine.c (1)
src/flb_task.c (1)
flb_task_running_print(299-345)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: PR - fuzzing test
🔇 Additional comments (1)
src/flb_engine.c (1)
1085-1091: No action required forflb_engine_starterror handling
The library mode path insrc/flb_lib.cchecks forret == -1and invokesflb_engine_failed(config)followed byflb_engine_shutdown(config), ensuring proper teardown without resource leaks.
Summary by CodeRabbit
New Features
Bug Fixes