Skip to content

Conversation

@stoksc
Copy link
Contributor

@stoksc stoksc commented Sep 4, 2025

Summary by CodeRabbit

  • New Features

    • Added a hard safety timeout for hot-reloads to abort after extended grace when work remains.
    • Recounts pending work during shutdown and improves diagnostics with backlog, memory and filesystem chunk details.
    • When configured, pending data is segregated or flushed during shutdown to ensure data handling.
  • Bug Fixes

    • Prevents rare stalls by aborting hot-reloads after repeated grace timeouts.
    • Preserves clean, graceful shutdown when no backlog exists.

@coderabbitai
Copy link

coderabbitai bot commented Sep 4, 2025

Walkthrough

Adds hot-reload safety timeouts and abort paths to the engine shutdown tick: recounts running tasks and memory/filesystem backlog, logs counts, aborts if hot-reload timeout reached, and on final grace attempts chunk segregation or flush; otherwise proceeds with normal stop/cleanup. (≤50 words)

Changes

Cohort / File(s) Summary
Engine shutdown and hot-reload handling
src/flb_engine.c
Add hot-reload hard timeout/abort when shutdown_by_hot_reloading is enabled and grace_count >= 10; perform a recount of running tasks and storage (memory/fs) chunks and log backlog; on final grace attempt sb_segregate_chunks if storage_bl_flush_on_shutdown enabled (return -2 on failure) or call flb_engine_flush; retain existing grace wait and normal stop/cleanup paths.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Shutdown Timer
  participant E as Engine
  participant S as Storage
  participant L as Logger

  T->>E: shutdown_tick()
  E->>E: recount running tasks + mem_chunks + fs_chunks
  E->>L: log backlog counts and grace state
  alt Hot-reload timeout (shutdown_by_hot_reloading && grace_count >= 10)
    E->>L: log error + running tasks
    E-->>E: abort (return -1)
  else Backlog remains within grace
    opt grace_count == 1
      alt storage_bl_flush_on_shutdown enabled
        E->>S: sb_segregate_chunks()
        alt segregation fails
          E-->>E: return -2
        else segregation OK
          E->>E: flb_engine_flush(NULL)
        end
      else
        E->>E: flb_engine_flush(NULL)
      end
    end
    E-->>T: continue waiting (increment grace_count)
  else No outstanding work
    E->>L: log service stop
    E-->>E: set exit status, destroy timer, shutdown
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • edsiper
  • koleini
  • fujimotos

Pre-merge checks (1 passed, 2 warnings)

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The current title “crash on hot reload hangs” is vague and grammatically awkward, and it does not clearly convey the primary change introduced by this PR, which is the addition of a hot-reload safety timeout in the engine shutdown path to prevent hangs. As a result, a teammate scanning the history would not immediately understand the key feature or fix implemented. The title fails to meet the requirement for concise and specific phrasing that highlights the most important change. Please rename the title to clearly reflect the main change and its intent, for example: “Add hot-reload safety timeout to engine shutdown to prevent hangs,” so that it succinctly describes the feature and its purpose.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Poem

A rabbit taps the graceful clock, counts chunks with whiskered care,
If reloads hang beyond the time, we sound the final flare.
Segregate, then flush away, or bail with tidy logs in tow,
Hop home, the engine quiets — down the burrow we go. 🐇

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/flb_engine.c (1)

1101-1108: Don’t return -2 during shutdown; log and continue

In src/flb_engine.c at the shutdown loop (around line 1104), change the error path so it only logs and proceeds, ensuring the full shutdown sequence (including flb_engine_shutdown()) always runs:

-    if (ret < 0) {
-        flb_error("[engine] could not segregate backlog chunks during shutdown");
-        return -2;
-    }
+    if (ret < 0) {
+        flb_error("[engine] could not segregate backlog chunks during shutdown; proceeding without segregation");
+        /* continue shutdown path without early exit */
+    }

This avoids exiting flb_engine_start() mid-shutdown on a segregation failure—callers only special‐case -1 and would otherwise skip cleanup.

🧹 Nitpick comments (3)
src/flb_engine.c (3)

1087-1094: Hot-reload timeout: make 300s configurable, improve messaging, and best-effort notify before abort

Good safeguard. Recommend:

  • Use a constant or config knob instead of hardcoded 300.
  • Log “reached (>= %ds)” to match the condition and include the value.
  • Best-effort notify library users before abort via flb_engine_failed(config).

Apply within range:

-                    if (ret > 0 && config->shutdown_by_hot_reloading && config->grace_count >= 300) {
-                        flb_error("[engine] hot reload timeout exceeded (300s), crashing to prevent "
-                                  "indefinite hang");
-                        flb_task_running_print(config);
-                        abort();
-                    }
+                    if (ret > 0 && config->shutdown_by_hot_reloading &&
+                        config->grace_count >= FLB_HOT_RELOAD_TIMEOUT_SECS) {
+                        flb_error("[engine] hot reload timeout reached (>= %ds), crashing to prevent indefinite hang",
+                                  FLB_HOT_RELOAD_TIMEOUT_SECS);
+                        flb_task_running_print(config);
+                        (void) flb_engine_failed(config); /* best-effort notification */
+                        abort();
+                    }

Outside this hunk (add once near includes):

#ifndef FLB_HOT_RELOAD_TIMEOUT_SECS
#define FLB_HOT_RELOAD_TIMEOUT_SECS 300
#endif

1076-1084: Rate-limit shutdown backlog logging to avoid log spam

These paths run every second; printing can be noisy/heavy in large backlogs. Log on first tick and then periodically.

-                    if ((mem_chunks + fs_chunks) > 0) {
+                    if ((mem_chunks + fs_chunks) > 0 &&
+                        (config->grace_count == 1 || (config->grace_count % 10) == 0)) {
                         flb_info("[engine] pending chunk count: memory=%d, filesystem=%d; grace_timer=%d",
                                  mem_chunks, fs_chunks, config->grace_count);
                     }
 
-                    if (tasks > 0) {
-                        flb_task_running_print(config);
-                    }
+                    if (tasks > 0 &&
+                        (config->grace_count == 1 || (config->grace_count % 30) == 0)) {
+                        flb_task_running_print(config);
+                    }

1043-1054: Consistency nit: treat invalid fd as -1 (like flush_fd) rather than <= 0

Elsewhere timer fds use -1 for invalid; using <= 0 here is slightly inconsistent and could mask rare fd=0 edge cases.

-                    if (config->shutdown_fd <= 0) {
+                    if (config->shutdown_fd == -1) {
                         config->shutdown_fd = mk_event_timeout_create(evl,
                                                                       1,
                                                                       0,
                                                                       event);

Please verify initial value of config->shutdown_fd is -1.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c01c62c and 3820327.

📒 Files selected for processing (1)
  • src/flb_engine.c (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/flb_engine.c (1)
src/flb_task.c (1)
  • flb_task_running_print (299-345)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: PR - fuzzing test
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-centos-7
🔇 Additional comments (1)
src/flb_engine.c (1)

1094-1112: LGTM: preserves existing grace semantics while adding safe first-tick actions

The preservation of “ret > 0 && (grace_count < grace || grace == -1)” is correct, and the first-tick segregation/flush hook is a pragmatic addition.

Copy link
Contributor

@cosmo0920 cosmo0920 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks useful and making more safety for infinite stuck.

@cosmo0920
Copy link
Contributor

@stoksc But we need to Signed-off line in each commit. So, could you git commit -s and push force again?

src/flb_engine.c Outdated
flb_error("[engine] hot reload timeout exceeded (300s), crashing to prevent "
"indefinite hang");
flb_task_running_print(config);
abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about returning -1 instead of abort() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in process of writing a test that reproduces the issue. I'll try that once I'm done. Tried to follow the code, looks like as long as something in the error flow sets the flb_ctx_t->status to !ok it should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not easy to test the sort of hangs we were seeing that way in a unit test; do you have a suggestion? i started trying to sleep or flb_time_sleep the thread w/ dummy but it's on the main thread so that doesn't work either way.

i've changed it to return -1 and i can test with the repro i have internally but a unit test for hotreload would be nice.

i'm going to hack on a watchdog thread version that could handle a truly hung plugin on the main thread too

Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/flb_engine.c (1)

1085-1091: Hot-reload timeout guard looks good and aligns with earlier feedback to avoid abort().

Returning -1 instead of aborting prevents a hard crash and still lets callers decide teardown behavior. Nice improvement.

🧹 Nitpick comments (2)
src/flb_engine.c (2)

1085-1091: Clarify log text and avoid unnecessary task-print when none exist.

Current message says “crashing” but we’re returning an error. Also we always print running tasks even if tasks == 0.

Apply this minimal tweak:

-                        flb_error("[engine] hot reload timeout exceeded (10s), crashing to prevent "
-                                  "indefinite hang");
-                        flb_task_running_print(config);
+                        flb_error("[engine] hot-reload shutdown timed out after 10s; aborting shutdown with error");
+                        if (tasks > 0) {
+                            flb_task_running_print(config);
+                        }

1085-1091: Make timeout configurable instead of hard-coded 10s.

Consider a config knob (e.g., HotReloadTimeout or env var) defaulting to 10s, so operators can tune for slow backends or stricter SLOs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40e38d9 and 9100ac8.

📒 Files selected for processing (1)
  • src/flb_engine.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/flb_engine.c (1)
src/flb_task.c (1)
  • flb_task_running_print (299-345)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (1)
src/flb_engine.c (1)

1085-1091: No action required for flb_engine_start error handling
The library mode path in src/flb_lib.c checks for ret == -1 and invokes flb_engine_failed(config) followed by flb_engine_shutdown(config), ensuring proper teardown without resource leaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants