crash on hot reload hangs #10840

stoksc · 2025-09-04T23:16:15Z

Summary by CodeRabbit

New Features
- Added a hard safety timeout for hot-reloads to abort after extended grace when work remains.
- Recounts pending work during shutdown and improves diagnostics with backlog, memory and filesystem chunk details.
- When configured, pending data is segregated or flushed during shutdown to ensure data handling.
Bug Fixes
- Prevents rare stalls by aborting hot-reloads after repeated grace timeouts.
- Preserves clean, graceful shutdown when no backlog exists.

coderabbitai · 2025-09-04T23:16:23Z

Walkthrough

Adds hot-reload safety timeouts and abort paths to the engine shutdown tick: recounts running tasks and memory/filesystem backlog, logs counts, aborts if hot-reload timeout reached, and on final grace attempts chunk segregation or flush; otherwise proceeds with normal stop/cleanup. (≤50 words)

Changes

Cohort / File(s)	Summary
Engine shutdown and hot-reload handling `src/flb_engine.c`	Add hot-reload hard timeout/abort when shutdown_by_hot_reloading is enabled and grace_count >= 10; perform a recount of running tasks and storage (memory/fs) chunks and log backlog; on final grace attempt `sb_segregate_chunks` if `storage_bl_flush_on_shutdown` enabled (return -2 on failure) or call `flb_engine_flush`; retain existing grace wait and normal stop/cleanup paths.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Shutdown Timer
  participant E as Engine
  participant S as Storage
  participant L as Logger

  T->>E: shutdown_tick()
  E->>E: recount running tasks + mem_chunks + fs_chunks
  E->>L: log backlog counts and grace state
  alt Hot-reload timeout (shutdown_by_hot_reloading && grace_count >= 10)
    E->>L: log error + running tasks
    E-->>E: abort (return -1)
  else Backlog remains within grace
    opt grace_count == 1
      alt storage_bl_flush_on_shutdown enabled
        E->>S: sb_segregate_chunks()
        alt segregation fails
          E-->>E: return -2
        else segregation OK
          E->>E: flb_engine_flush(NULL)
        end
      else
        E->>E: flb_engine_flush(NULL)
      end
    end
    E-->>T: continue waiting (increment grace_count)
  else No outstanding work
    E->>L: log service stop
    E-->>E: set exit status, destroy timer, shutdown
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

engine: fix busy chunks on reload. #10720 — Modifies the same src/flb_engine.c shutdown/reload flow and pending-chunk handling; likely directly related.

Suggested reviewers

edsiper
koleini
fujimotos

Pre-merge checks (1 passed, 2 warnings)

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The current title “crash on hot reload hangs” is vague and grammatically awkward, and it does not clearly convey the primary change introduced by this PR, which is the addition of a hot-reload safety timeout in the engine shutdown path to prevent hangs. As a result, a teammate scanning the history would not immediately understand the key feature or fix implemented. The title fails to meet the requirement for concise and specific phrasing that highlights the most important change.	Please rename the title to clearly reflect the main change and its intent, for example: “Add hot-reload safety timeout to engine shutdown to prevent hangs,” so that it succinctly describes the feature and its purpose.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

Poem

A rabbit taps the graceful clock, counts chunks with whiskered care,
If reloads hang beyond the time, we sound the final flare.
Segregate, then flush away, or bail with tidy logs in tow,
Hop home, the engine quiets — down the burrow we go. 🐇

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/flb_engine.c (1)
1101-1108: Don’t return -2 during shutdown; log and continue

In src/flb_engine.c at the shutdown loop (around line 1104), change the error path so it only logs and proceeds, ensuring the full shutdown sequence (including flb_engine_shutdown()) always runs:
-    if (ret < 0) {
-        flb_error("[engine] could not segregate backlog chunks during shutdown");
-        return -2;
-    }
+    if (ret < 0) {
+        flb_error("[engine] could not segregate backlog chunks during shutdown; proceeding without segregation");
+        /* continue shutdown path without early exit */
+    }
This avoids exiting flb_engine_start() mid-shutdown on a segregation failure—callers only special‐case -1 and would otherwise skip cleanup.

🧹 Nitpick comments (3)

src/flb_engine.c (3)

1087-1094: Hot-reload timeout: make 300s configurable, improve messaging, and best-effort notify before abort

Good safeguard. Recommend:

Use a constant or config knob instead of hardcoded 300.
Log “reached (>= %ds)” to match the condition and include the value.
Best-effort notify library users before abort via flb_engine_failed(config).

Apply within range:

-                    if (ret > 0 && config->shutdown_by_hot_reloading && config->grace_count >= 300) {
-                        flb_error("[engine] hot reload timeout exceeded (300s), crashing to prevent "
-                                  "indefinite hang");
-                        flb_task_running_print(config);
-                        abort();
-                    }
+                    if (ret > 0 && config->shutdown_by_hot_reloading &&
+                        config->grace_count >= FLB_HOT_RELOAD_TIMEOUT_SECS) {
+                        flb_error("[engine] hot reload timeout reached (>= %ds), crashing to prevent indefinite hang",
+                                  FLB_HOT_RELOAD_TIMEOUT_SECS);
+                        flb_task_running_print(config);
+                        (void) flb_engine_failed(config); /* best-effort notification */
+                        abort();
+                    }

Outside this hunk (add once near includes):

#ifndef FLB_HOT_RELOAD_TIMEOUT_SECS
#define FLB_HOT_RELOAD_TIMEOUT_SECS 300
#endif

1076-1084: Rate-limit shutdown backlog logging to avoid log spam

These paths run every second; printing can be noisy/heavy in large backlogs. Log on first tick and then periodically.

-                    if ((mem_chunks + fs_chunks) > 0) {
+                    if ((mem_chunks + fs_chunks) > 0 &&
+                        (config->grace_count == 1 || (config->grace_count % 10) == 0)) {
                         flb_info("[engine] pending chunk count: memory=%d, filesystem=%d; grace_timer=%d",
                                  mem_chunks, fs_chunks, config->grace_count);
                     }
 
-                    if (tasks > 0) {
-                        flb_task_running_print(config);
-                    }
+                    if (tasks > 0 &&
+                        (config->grace_count == 1 || (config->grace_count % 30) == 0)) {
+                        flb_task_running_print(config);
+                    }

1043-1054: Consistency nit: treat invalid fd as -1 (like flush_fd) rather than <= 0

Elsewhere timer fds use -1 for invalid; using <= 0 here is slightly inconsistent and could mask rare fd=0 edge cases.

-                    if (config->shutdown_fd <= 0) {
+                    if (config->shutdown_fd == -1) {
                         config->shutdown_fd = mk_event_timeout_create(evl,
                                                                       1,
                                                                       0,
                                                                       event);

Please verify initial value of config->shutdown_fd is -1.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c01c62c and 3820327.

📒 Files selected for processing (1)

src/flb_engine.c (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/flb_engine.c (1)

src/flb_task.c (1)

flb_task_running_print (299-345)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
GitHub Check: PR - fuzzing test
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
GitHub Check: pr-compile-centos-7

🔇 Additional comments (1)

src/flb_engine.c (1)

1094-1112: LGTM: preserves existing grace semantics while adding safe first-tick actions

The preservation of “ret > 0 && (grace_count < grace || grace == -1)” is correct, and the first-tick segregation/flush hook is a pragmatic addition.

cosmo0920

Looks useful and making more safety for infinite stuck.

cosmo0920 · 2025-09-05T05:50:23Z

@stoksc But we need to Signed-off line in each commit. So, could you git commit -s and push force again?

edsiper · 2025-09-05T16:58:57Z

src/flb_engine.c

+                        flb_error("[engine] hot reload timeout exceeded (300s), crashing to prevent "
+                                  "indefinite hang");
+                        flb_task_running_print(config);
+                        abort();


what about returning -1 instead of abort() ?

I'm in process of writing a test that reproduces the issue. I'll try that once I'm done. Tried to follow the code, looks like as long as something in the error flow sets the flb_ctx_t->status to !ok it should work.

it's not easy to test the sort of hangs we were seeing that way in a unit test; do you have a suggestion? i started trying to sleep or flb_time_sleep the thread w/ dummy but it's on the main thread so that doesn't work either way.

i've changed it to return -1 and i can test with the repro i have internally but a unit test for hotreload would be nice.

i'm going to hack on a watchdog thread version that could handle a truly hung plugin on the main thread too

Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/flb_engine.c (1)

1085-1091: Hot-reload timeout guard looks good and aligns with earlier feedback to avoid abort().

Returning -1 instead of aborting prevents a hard crash and still lets callers decide teardown behavior. Nice improvement.

🧹 Nitpick comments (2)

src/flb_engine.c (2)
1085-1091: Clarify log text and avoid unnecessary task-print when none exist.

Current message says “crashing” but we’re returning an error. Also we always print running tasks even if tasks == 0.

Apply this minimal tweak:
-                        flb_error("[engine] hot reload timeout exceeded (10s), crashing to prevent "
-                                  "indefinite hang");
-                        flb_task_running_print(config);
+                        flb_error("[engine] hot-reload shutdown timed out after 10s; aborting shutdown with error");
+                        if (tasks > 0) {
+                            flb_task_running_print(config);
+                        }
1085-1091: Make timeout configurable instead of hard-coded 10s.

Consider a config knob (e.g., HotReloadTimeout or env var) defaulting to 10s, so operators can tune for slow backends or stricter SLOs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40e38d9 and 9100ac8.

📒 Files selected for processing (1)

src/flb_engine.c (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/flb_engine.c (1)

src/flb_task.c (1)

flb_task_running_print (299-345)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)

GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
GitHub Check: PR - fuzzing test

🔇 Additional comments (1)

src/flb_engine.c (1)

1085-1091: No action required for flb_engine_start error handling
The library mode path in src/flb_lib.c checks for ret == -1 and invokes flb_engine_failed(config) followed by flb_engine_shutdown(config), ensuring proper teardown without resource leaks.

stoksc requested review from edsiper, fujimotos and koleini as code owners September 4, 2025 23:16

stoksc temporarily deployed to pr September 4, 2025 23:16 — with GitHub Actions Inactive

github-actions bot added the docs-required label Sep 4, 2025

stoksc force-pushed the stoksc/crash-on-hotreload-hangs branch from 3820327 to 3bab772 Compare September 4, 2025 23:18

stoksc temporarily deployed to pr September 4, 2025 23:18 — with GitHub Actions Inactive

stoksc force-pushed the stoksc/crash-on-hotreload-hangs branch from 3bab772 to 40e38d9 Compare September 4, 2025 23:19

stoksc temporarily deployed to pr September 4, 2025 23:19 — with GitHub Actions Inactive

coderabbitai bot reviewed Sep 4, 2025

View reviewed changes

stoksc temporarily deployed to pr September 4, 2025 23:36 — with GitHub Actions Inactive

cosmo0920 approved these changes Sep 5, 2025

View reviewed changes

edsiper reviewed Sep 5, 2025

View reviewed changes

stoksc force-pushed the stoksc/crash-on-hotreload-hangs branch from 40e38d9 to 66933d6 Compare September 10, 2025 13:15

stoksc temporarily deployed to pr September 10, 2025 13:15 — with GitHub Actions Inactive

crash on hot reload hangs

9100ac8

Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>

stoksc force-pushed the stoksc/crash-on-hotreload-hangs branch from 66933d6 to 9100ac8 Compare September 10, 2025 13:19

stoksc temporarily deployed to pr September 10, 2025 13:19 — with GitHub Actions Inactive

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

stoksc temporarily deployed to pr September 10, 2025 13:37 — with GitHub Actions Inactive

stoksc closed this Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

crash on hot reload hangs #10840

crash on hot reload hangs #10840

Uh oh!

stoksc commented Sep 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

cosmo0920 left a comment

Uh oh!

cosmo0920 commented Sep 5, 2025

Uh oh!

edsiper Sep 5, 2025

Uh oh!

stoksc Sep 8, 2025

Uh oh!

stoksc Sep 10, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

crash on hot reload hangs #10840

crash on hot reload hangs #10840

Uh oh!

Conversation

stoksc commented Sep 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks (1 passed, 2 warnings)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cosmo0920 left a comment

Choose a reason for hiding this comment

Uh oh!

cosmo0920 commented Sep 5, 2025

Uh oh!

edsiper Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

stoksc Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

stoksc Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stoksc commented Sep 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 4, 2025 •

edited

Loading