Skip to content

Conversation

timmilesdw
Copy link
Contributor

@timmilesdw timmilesdw commented Sep 4, 2025

Overview

Implement graceful shutdown for client-go informers and add structured shutdown logs across shell-operator components.

What this PR does / why we need it

  • Introduces proper lifecycle tracking and waiting for informer goroutines to exit:
    • Adds a done channel to Factory and starts informer.Run(...) exactly once per factory; closes done when the goroutine exits.
    • FactoryStore.Stop(...) now, upon removing the last handler, cancels the shared context and waits for <-done before deleting the factory entry.
    • Adds FactoryStore.WaitStopped(index) to block until a factory for a given FactoryIndex fully stops.
  • Adds completion signaling to informers:
    • resourceInformer.wait() waits for shared informer termination via FactoryStore.WaitStopped.
    • namespaceInformer wraps Run() with a done channel and exposes wait().
  • Adds upper-level waits:
    • Monitor now has Wait() to wait for all resource/namespace informers to stop.
    • KubeEventsManager now has Wait() to aggregate Monitor.Wait() calls without holding locks while blocking.
  • Makes operator shutdown truly graceful with logs:
    • Shutdown() now logs each stage, cancels KubeEventsManager, then waits for completion (Wait()), then stops and waits for queues.

Special notes for your reviewer

  • Concurrency:
    • KubeEventsManager.Wait() snapshots monitors under RLock and releases the lock before waiting, avoiding deadlocks and map iteration races.
    • FactoryStore.Stop(...) waits outside the lock for the done channel, then reacquires the lock to delete the factory and broadcast to waiters.
  • Logging:
    • Added clear shutdown logs to operator.Shutdown(); feel free to request additional log details or levels.
  • Risk/mitigation:
    • If an informer fails to exit after cancel(), waits will block. Current behavior intentionally favors correctness. If desired, we can add timeouts to Wait() paths in a follow-up.
  • How to validate:
    • Start shell-operator; send SIGTERM/SIGINT; observe logs:
      • “shutdown: begin” → “schedule manager stopped” → “kube events manager canceled, waiting for informers” → “kube events manager done” → “task queues stop signaled, waiting” → “task queues stopped”.

Signed-off-by: Timur Tuktamyshev <timur.tuktamyshev@flant.com>
Signed-off-by: Timur Tuktamyshev <timur.tuktamyshev@flant.com>
@timmilesdw timmilesdw added the enhancement New feature or request label Sep 4, 2025
@timmilesdw timmilesdw self-assigned this Sep 4, 2025
@timmilesdw timmilesdw force-pushed the fix/stop-informer branch 2 times, most recently from 71b5130 to cde1198 Compare September 4, 2025 10:21
Signed-off-by: Timur Tuktamyshev <timur.tuktamyshev@flant.com>
Signed-off-by: Timur Tuktamyshev <timur.tuktamyshev@flant.com>
@timmilesdw timmilesdw marked this pull request as ready for review September 4, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant