-
Notifications
You must be signed in to change notification settings - Fork 25.3k
#111433 Watch Next Run Interval Resets On Shard Move or Node Restart #115102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#111433 Watch Next Run Interval Resets On Shard Move or Node Restart #115102
Conversation
when restarting, moving shards or resuming from stopped.
Hi @lukewhiting, I've created a changelog YAML for you. |
Pinging @elastic/es-data-management (Team:Data Management) |
fail("waiting too long for all watches to be triggered"); | ||
} | ||
|
||
advanceClockIfNeeded(clock.instant().plusMillis(1100).atZone(ZoneOffset.UTC)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have short comments on each of these 4 tests describing what they're for. It took me a few minutes to figure out how they all differed. And in the case of this test, are you trying to show that it does not execute too many times if you advance the clock a good bit? Is it worth adding another latch or two to prove that the watch hasn't run when you expect it not to have run yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's essentially 2 conditions tested across 4 tests here. Watches with a lastCheckedTime
and those who don't have a lastCheckedTime
but do have a lastActivationTime
. For each of those, we test both the startup of the watcher service and adding those watches to an already running service.
For the startup tests, we you the latches to ensure it runs once before the interval (to verify it picked up the last run time) and for each add to running service test we start up watcher, tick the clock forward to show some time passing then add the watches and check they execute once before the interval time and once again after the interval time.
I have added some comments to better explain each test and good idea on adding a check to make sure they don't run more times than expected. I have added that as well.
…ns happen during test
…atcher-interval-on-shard-move
…atcher-interval-on-shard-move
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💚 Backport successful
|
…estart (elastic#115102) * Switch Watcher scheduler to use last exec time when restarting, moving shards or resuming from stopped. * Add tests for last runtime calculation * Update docs/changelog/115102.yaml * Add counter to watcher job executions to check no additional executions happen during test
…estart (elastic#115102) * Switch Watcher scheduler to use last exec time when restarting, moving shards or resuming from stopped. * Add tests for last runtime calculation * Update docs/changelog/115102.yaml * Add counter to watcher job executions to check no additional executions happen during test
…estart (elastic#115102) * Switch Watcher scheduler to use last exec time when restarting, moving shards or resuming from stopped. * Add tests for last runtime calculation * Update docs/changelog/115102.yaml * Add counter to watcher job executions to check no additional executions happen during test
This PR fixes #111433 by making initial scheduling of IntervalSchedules within the watcher
TickerScheduleTriggerEngine
aware of the correct last run time of a watch, rather than using the current time. This allows it to correctly calculate the next runtime rather than waiting the full duration of the watch's interval before starting the first run.This code uses the last time the watch started execution (
lastCheckedTime
) by default, falling back first to the last time the watch was last activated (which is relevant for newly created or recently unpaused tasks) before finally falling back tonow()
as a last resort.The use of the last start time vs last completion time (which is not persistently stored at the moment) does mean we don't need to store a new item for every watch in the cluster state, reducing bloat, but it does mean that a watch that is migrated to a new node mid run may execute early on it's next run. I think this tradeoff is acceptable given the rarity and low impact of such a scenario VS the impact of increasing the cluster state size for every watch added.