Rethinking termination signals and timeout behavior

The current termination model supports graceful shutdown via single Ctrl+C and forced shutdown via double Ctrl+C, but it oversimplifies the reality of how processes terminate in production environments. We need to acknowledge and handle three distinct termination scenarios: graceful (single signal), forced (double signal), and abrupt (kill -9, OOM, power loss), each with different guarantees and behaviors.

Graceful shutdown receives a SIGTERM, initiates orderly cleanup, and waits for components to finish their work. Forced shutdown sends a second signal when the user loses patience, telling the process to stop waiting and terminate immediately. Abrupt termination happens without warning through external events that the process cannot intercept. Currently, we document and handle the first two but largely ignore the third, despite it being common in production.

The Kubernetes context adds another dimension. When Kubernetes decides to terminate a pod, it sends SIGTERM and waits 30 seconds before sending SIGKILL. This is not negotiable—after 30 seconds, the process dies abruptly regardless of cleanup state. Our current forced shutdown timeout doesn't account for this constraint. If we configure a 30-second forced shutdown timeout and Kubernetes kills us at 30 seconds, we never complete the forced shutdown phase. We're racing against a deadline we can't see.

A production-ready lifecycle manager should understand its execution environment and adjust termination behavior accordingly. When running under Kubernetes, the forced shutdown timeout should be slightly less than the pod's `terminationGracePeriodSeconds` to ensure we can complete orderly termination before SIGKILL arrives. This might mean defaulting to 24 seconds when Kubernetes is detected, leaving 6 seconds of buffer for cleanup and logging. The exact values should be configurable, but the defaults should be environment-aware.

Beyond timeout tuning, we need to document the reality of abrupt termination. Component authors must understand that cleanup is not guaranteed to run. Cleanup functions handle the graceful and forced cases, but abrupt termination bypasses them entirely. This affects design decisions around data durability, transaction boundaries, and external system state. Documentation should explicitly call out operations that require guaranteed cleanup versus those that can tolerate abrupt termination.

## Acceptance Criteria

- Three termination modes are clearly documented: graceful, forced, and abrupt
- Forced shutdown timeout becomes configurable with environment-aware defaults
- Kubernetes detection sets appropriate timeout (e.g., 24 seconds for 30-second grace period)
- Configuration allows override of environment-based timeout defaults
- Documentation explains cleanup guarantees and design implications for each mode
- Component examples demonstrate proper handling of all termination scenarios

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking termination signals and timeout behavior #13

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Rethinking termination signals and timeout behavior #13

Description

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions