Skip to content

Improve System Resilience: Prevent Updates Without Initial Signal After Disturbances (Crash/OOM/Trace Loss) #166

Description

@usfalami

Background

Currently, in scenarios where the system is disturbed (such as trace loss, crashes, OOM events, or trace deletions), there is a risk that updates may occur without receiving the expected initial signal. This could compromise trace reliability and create inconsistencies on the server side.

Problem

  • System disturbances (e.g., accidental trace deletions, process crashes, out-of-memory events) may cause trace entries to be lost.
  • Following such disturbances, the system may continue to send update traces without having sent the corresponding initial/start signal.
  • This behavior can cause serious issues on the server (e.g., update operations referencing non-existent or unsynchronized traces, data corruption, etc.).

Proposal

  • Implement resilience mechanisms to detect system disturbances and prevent emission of update traces if the departure/start signal was never sent.
  • Ensure that after OOM, crash, or loss of traces, update traces are not generated until a new valid departure/start signal is recorded.
  • Optionally, log or notify about the event to facilitate observability and debugging.

Acceptance Criteria

  • System does not emit update traces without a valid initial signal after disturbance scenarios (crash, OOM, trace loss, deletion, etc.).
  • Disturbances are detected, and appropriate fallback or fail-safe measures are put in place.
  • Add relevant logging or notifications as needed for post-incident analysis.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No fields configured for Task.

Projects

Status
Todo

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions