-
Notifications
You must be signed in to change notification settings - Fork 632
Description
Supersedes the following issues:
- Cron job stops after random hours/days #805
- The job stops if the time in the system changes at an unfortunate moment #514
- cron job stops after certain hours. #232
- Cron doesn't run after 15 minutes of running every 2 minutes #231
Bug description
Under certain conditions, cron jobs stop executing unexpectedly without any error or log output. This behavior has been observed in jobs running at high frequency (e.g., every second or every few seconds) as well as jobs scheduled less frequently. The job ceases operation after a variable period (from as little as 15 minutes to several days) and does not resume until the application or container is restarted. In some cases, the issue appears to coincide with low available machine resources or system time adjustments, while in others it occurs without any apparent external trigger.
Root cause analysis
The root cause of these issues is the way the library calculates the time until the next scheduled execution. The process involves two separate fetches of the current time within the sendAt
function, leading to a potential race condition:
1. Initial time fetch
When sendAt
starts, it fetches the current time to calculate the next execution time based on the cron expression (source).
sendAt(i?: number): DateTime | DateTime[] {
let date =
this.realDate && this.source instanceof DateTime
? this.source
: DateTime.local();
2. Timeout calculation
After sendAt
completes, the library fetches the current time again to compute the remaining delay (source).
getTimeout() {
return Math.max(-1, this.sendAt().toMillis() - DateTime.local().toMillis());
}
If the delay between these two time fetches is sufficiently large (due to execution delays or a system time change), the computed timeout can become negative. In such cases, the job is stopped immediately (source).
if (timeout >= 0) {
// ...
setCronTimeout(timeout);
} else {
this.stop();
}
Reproduction
The bug can be reproduced by stubbing Date.now()
to simulate either a prolonged execution of sendAt
or a system time jump between the initial time fetch and the subsequent calculation. This controlled manipulation forces getTimeout
to compute a negative timeout, which triggers the job to stop.
See the implemented test case.
Proposed fix
Introduce a configurable threshold with sensible default to distinguish between minor delays and significant timing discrepancies. When a negative timeout is detected, if the negative timeout is:
- Within threshold: schedule the job immediately (and log a warning)
- Outside threshold: skip the current execution (and log a warning)
This solution combines the strengths of either always executing immediately or always skipping execution while mitigating their drawbacks:
-
Balanced execution: in real-world scenarios, minor delays (for example, those under 500ms/1000ms) are often acceptable. Executing the job immediately in these cases prevents unnecessary stoppages and keeps high-frequency tasks running with minimal interruption.
-
Flexibility: making the threshold configurable allows developers to tailor the behavior based on the criticality and frequency of their cron jobs.
-
Observability: logging warnings when a large negative timeout is encountered provides critical insight into timing issues. This transparency aids in diagnostics and helps developers fine-tune the threshold settings based on observed behavior in production.