Skip to content

Jobs stopping unexpectedly #962

@sheerlox

Description

@sheerlox

Supersedes the following issues:

Bug description

Under certain conditions, cron jobs stop executing unexpectedly without any error or log output. This behavior has been observed in jobs running at high frequency (e.g., every second or every few seconds) as well as jobs scheduled less frequently. The job ceases operation after a variable period (from as little as 15 minutes to several days) and does not resume until the application or container is restarted. In some cases, the issue appears to coincide with low available machine resources or system time adjustments, while in others it occurs without any apparent external trigger.

Root cause analysis

The root cause of these issues is the way the library calculates the time until the next scheduled execution. The process involves two separate fetches of the current time within the sendAt function, leading to a potential race condition:

1. Initial time fetch

When sendAt starts, it fetches the current time to calculate the next execution time based on the cron expression (source).

sendAt(i?: number): DateTime | DateTime[] {
  let date =
    this.realDate && this.source instanceof DateTime
      ? this.source
      : DateTime.local();

2. Timeout calculation

After sendAt completes, the library fetches the current time again to compute the remaining delay (source).

getTimeout() {
  return Math.max(-1, this.sendAt().toMillis() - DateTime.local().toMillis());
}

If the delay between these two time fetches is sufficiently large (due to execution delays or a system time change), the computed timeout can become negative. In such cases, the job is stopped immediately (source).

if (timeout >= 0) {
  // ...
  setCronTimeout(timeout);
} else {
  this.stop();
}

Reproduction

The bug can be reproduced by stubbing Date.now() to simulate either a prolonged execution of sendAt or a system time jump between the initial time fetch and the subsequent calculation. This controlled manipulation forces getTimeout to compute a negative timeout, which triggers the job to stop.

See the implemented test case.

Proposed fix

Introduce a configurable threshold with sensible default to distinguish between minor delays and significant timing discrepancies. When a negative timeout is detected, if the negative timeout is:

  • Within threshold: schedule the job immediately (and log a warning)
  • Outside threshold: skip the current execution (and log a warning)

This solution combines the strengths of either always executing immediately or always skipping execution while mitigating their drawbacks:

  • Balanced execution: in real-world scenarios, minor delays (for example, those under 500ms/1000ms) are often acceptable. Executing the job immediately in these cases prevents unnecessary stoppages and keeps high-frequency tasks running with minimal interruption.

  • Flexibility: making the threshold configurable allows developers to tailor the behavior based on the criticality and frequency of their cron jobs.

  • Observability: logging warnings when a large negative timeout is encountered provides critical insight into timing issues. This transparency aids in diagnostics and helps developers fine-tune the threshold settings based on observed behavior in production.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions