Skip to content

Conversation

@timmilesdw
Copy link
Contributor

@timmilesdw timmilesdw commented Sep 4, 2025

Overview

Implement graceful shutdown for client-go informers and add structured shutdown logs across shell-operator components.

What this PR does / why we need it

  • Introduces proper lifecycle tracking and waiting for informer goroutines to exit:
    • Adds a done channel to Factory and starts informer.Run(...) exactly once per factory; closes done when the goroutine exits.
    • FactoryStore.Stop(...) now, upon removing the last handler, cancels the shared context and waits for <-done before deleting the factory entry.
    • Adds FactoryStore.WaitStopped(index) to block until a factory for a given FactoryIndex fully stops.
  • Adds completion signaling to informers:
    • resourceInformer.wait() waits for shared informer termination via FactoryStore.WaitStopped.
    • namespaceInformer wraps Run() with a done channel and exposes wait().
  • Adds upper-level waits:
    • Monitor now has Wait() to wait for all resource/namespace informers to stop.
    • KubeEventsManager now has Wait() to aggregate Monitor.Wait() calls without holding locks while blocking.
  • Makes operator shutdown truly graceful with logs:
    • Shutdown() now logs each stage, cancels KubeEventsManager, then waits for completion (Wait()), then stops and waits for queues.

Special notes for your reviewer

  • Concurrency:
    • KubeEventsManager.Wait() snapshots monitors under RLock and releases the lock before waiting, avoiding deadlocks and map iteration races.
    • FactoryStore.Stop(...) waits outside the lock for the done channel, then reacquires the lock to delete the factory and broadcast to waiters.
  • Logging:
    • Added clear shutdown logs to operator.Shutdown(); feel free to request additional log details or levels.
  • Risk/mitigation:
    • If an informer fails to exit after cancel(), waits will block. Current behavior intentionally favors correctness. If desired, we can add timeouts to Wait() paths in a follow-up.
  • How to validate:
    • Start shell-operator; send SIGTERM/SIGINT; observe logs:
      • “shutdown: begin” → “schedule manager stopped” → “kube events manager canceled, waiting for informers” → “kube events manager done” → “task queues stop signaled, waiting” → “task queues stopped”.

Signed-off-by: Timur Tuktamyshev <timur.tuktamyshev@flant.com>
Signed-off-by: Timur Tuktamyshev <timur.tuktamyshev@flant.com>
@timmilesdw timmilesdw added the enhancement New feature or request label Sep 4, 2025
@timmilesdw timmilesdw self-assigned this Sep 4, 2025
@timmilesdw timmilesdw force-pushed the fix/stop-informer branch 2 times, most recently from 71b5130 to cde1198 Compare September 4, 2025 10:21
Signed-off-by: Timur Tuktamyshev <timur.tuktamyshev@flant.com>
Signed-off-by: Timur Tuktamyshev <timur.tuktamyshev@flant.com>
@timmilesdw timmilesdw marked this pull request as ready for review September 4, 2025 14:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements graceful shutdown for client-go informers with proper lifecycle tracking and completion signaling. The changes ensure that all informer goroutines fully exit before resources are cleaned up, preventing potential race conditions and panics during shutdown.

Changes:

  • Replaced PauseHandleEvents() method with Stop() and Wait() methods across the kube events manager stack
  • Added proper lifecycle tracking to Factory with a done channel that signals when informer goroutines exit
  • Implemented WaitStopped() in FactoryStore using condition variables to allow blocking until factory cleanup completes
  • Enhanced operator shutdown with structured logging at each phase of the shutdown sequence

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/kube_events_manager/factory.go Added done channel to Factory, implemented WaitStopped() with condition variable, ensured single informer.Run() per factory
pkg/kube_events_manager/kube_events_manager.go Replaced PauseHandleEvents() with Stop() and Wait(), snapshots monitors before waiting to avoid deadlocks
pkg/kube_events_manager/monitor.go Converted PauseHandleEvents() to Wait() that blocks until all informers stop
pkg/kube_events_manager/resource_informer.go Replaced pauseHandleEvents() with wait() that blocks on FactoryStore.WaitStopped()
pkg/kube_events_manager/namespace_informer.go Added done channel and wait() method, wrapped Run() to signal completion
pkg/shell-operator/operator.go Added structured shutdown logs and proper Wait() call after Stop() for informers
test/integration/kube_event_manager/kube_event_manager_test.go Updated test cleanup to use Stop() instead of PauseHandleEvents()
Comments suppressed due to low confidence (1)

pkg/kube_events_manager/resource_informer.go:64

  • The stopped field is still checked in handleWatchEvent() (line 281), but it's never set to true in the new implementation since pauseHandleEvents() was removed. This appears to be dead code that should either be removed or properly set when the context is canceled.
	stopped bool

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Pavel Okhlopkov <pavel.okhlopkov@flant.com>
Signed-off-by: Pavel Okhlopkov <pavel.okhlopkov@flant.com>
@ldmonster ldmonster requested a review from Copilot January 14, 2026 13:40
@ldmonster ldmonster changed the title Fix/stop informer [shell-operator] fix: stop informer correctly Jan 14, 2026
@ldmonster ldmonster merged commit 9581c0f into main Jan 14, 2026
15 checks passed
@ldmonster ldmonster deleted the fix/stop-informer branch January 14, 2026 13:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ldmonster ldmonster added the run/tests Run tests on full matrix of k8s versions label Jan 14, 2026
@github-actions github-actions bot removed the run/tests Run tests on full matrix of k8s versions label Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants