#111433 Watch Next Run Interval Resets On Shard Move or Node Restart #115102

lukewhiting · 2024-10-18T13:06:52Z

This PR fixes #111433 by making initial scheduling of IntervalSchedules within the watcher TickerScheduleTriggerEngine aware of the correct last run time of a watch, rather than using the current time. This allows it to correctly calculate the next runtime rather than waiting the full duration of the watch's interval before starting the first run.

This code uses the last time the watch started execution (lastCheckedTime) by default, falling back first to the last time the watch was last activated (which is relevant for newly created or recently unpaused tasks) before finally falling back to now() as a last resort.

The use of the last start time vs last completion time (which is not persistently stored at the moment) does mean we don't need to store a new item for every watch in the cluster state, reducing bloat, but it does mean that a watch that is migrated to a new node mid run may execute early on it's next run. I think this tradeoff is acceptable given the rarity and low impact of such a scenario VS the impact of increasing the cluster state size for every watch added.

when restarting, moving shards or resuming from stopped.

elasticsearchmachine · 2024-10-18T13:07:27Z

Hi @lukewhiting, I've created a changelog YAML for you.

elasticsearchmachine · 2024-10-21T10:08:54Z

Pinging @elastic/es-data-management (Team:Data Management)

masseyke · 2024-10-21T18:03:11Z

.../java/org/elasticsearch/xpack/watcher/trigger/schedule/engine/TickerScheduleEngineTests.java

+            fail("waiting too long for all watches to be triggered");
+        }
+
+        advanceClockIfNeeded(clock.instant().plusMillis(1100).atZone(ZoneOffset.UTC));


It would be good to have short comments on each of these 4 tests describing what they're for. It took me a few minutes to figure out how they all differed. And in the case of this test, are you trying to show that it does not execute too many times if you advance the clock a good bit? Is it worth adding another latch or two to prove that the watch hasn't run when you expect it not to have run yet?

So there's essentially 2 conditions tested across 4 tests here. Watches with a lastCheckedTime and those who don't have a lastCheckedTime but do have a lastActivationTime. For each of those, we test both the startup of the watcher service and adding those watches to an already running service.

For the startup tests, we you the latches to ensure it runs once before the interval (to verify it picked up the last run time) and for each add to running service test we start up watcher, tick the clock forward to show some time passing then add the watches and check they execute once before the interval time and once again after the interval time.

I have added some comments to better explain each test and good idea on adding a check to make sure they don't run more times than expected. I have added that as well.

…ns happen during test

…atcher-interval-on-shard-move

masseyke

LGTM

elasticsearchmachine · 2024-10-22T14:28:44Z

💚 Backport successful

Status	Branch	Result
✅	8.x

…estart (elastic#115102) * Switch Watcher scheduler to use last exec time when restarting, moving shards or resuming from stopped. * Add tests for last runtime calculation * Update docs/changelog/115102.yaml * Add counter to watcher job executions to check no additional executions happen during test

…115102) (#115329) * Switch Watcher scheduler to use last exec time when restarting, moving shards or resuming from stopped. * Add tests for last runtime calculation * Update docs/changelog/115102.yaml * Add counter to watcher job executions to check no additional executions happen during test

…estart (elastic#115102) * Switch Watcher scheduler to use last exec time when restarting, moving shards or resuming from stopped. * Add tests for last runtime calculation * Update docs/changelog/115102.yaml * Add counter to watcher job executions to check no additional executions happen during test

lukewhiting added 2 commits October 18, 2024 13:55

Switch Watcher scheduler to use last exec time

bee4aa1

when restarting, moving shards or resuming from stopped.

Add tests for last runtime calculation

e3ffbeb

lukewhiting added >bug :Data Management/Watcher auto-backport Automatically create backport pull requests when merged v9.0.0 v8.17.0 labels Oct 18, 2024

lukewhiting requested review from masseyke and nielsbauman October 18, 2024 13:06

Update docs/changelog/115102.yaml

c7ecfd8

lukewhiting requested review from a team and removed request for nielsbauman October 21, 2024 10:08

lukewhiting marked this pull request as ready for review October 21, 2024 10:08

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Oct 21, 2024

masseyke reviewed Oct 21, 2024

View reviewed changes

lukewhiting added 3 commits October 22, 2024 10:45

Add counter to watcher job executions to check no additional executio…

6f37ce1

…ns happen during test

Merge branch 'main' of github.com:elastic/elasticsearch into 111433-w…

34a2203

…atcher-interval-on-shard-move

Merge branch 'main' of github.com:elastic/elasticsearch into 111433-w…

6906a74

…atcher-interval-on-shard-move

masseyke approved these changes Oct 22, 2024

View reviewed changes

lukewhiting merged commit 07374ab into elastic:main Oct 22, 2024
16 checks passed

lukewhiting mentioned this pull request Oct 22, 2024

[8.x] #111433 Watch Next Run Interval Resets On Shard Move or Node Restart (#115102) #115329

Merged

lukewhiting deleted the 111433-watcher-interval-on-shard-move branch October 22, 2024 14:41

lukewhiting restored the 111433-watcher-interval-on-shard-move branch October 22, 2024 14:41

lukewhiting deleted the 111433-watcher-interval-on-shard-move branch October 22, 2024 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

#111433 Watch Next Run Interval Resets On Shard Move or Node Restart #115102

#111433 Watch Next Run Interval Resets On Shard Move or Node Restart #115102

Uh oh!

lukewhiting commented Oct 18, 2024

Uh oh!

elasticsearchmachine commented Oct 18, 2024

Uh oh!

elasticsearchmachine commented Oct 21, 2024

Uh oh!

masseyke Oct 21, 2024

Uh oh!

lukewhiting Oct 22, 2024 •

edited

Loading

Uh oh!

masseyke left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 22, 2024

Uh oh!

Uh oh!

#111433 Watch Next Run Interval Resets On Shard Move or Node Restart #115102

#111433 Watch Next Run Interval Resets On Shard Move or Node Restart #115102

Uh oh!

Conversation

lukewhiting commented Oct 18, 2024

Uh oh!

elasticsearchmachine commented Oct 18, 2024

Uh oh!

elasticsearchmachine commented Oct 21, 2024

Uh oh!

masseyke Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

lukewhiting Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 22, 2024

💚 Backport successful

Uh oh!

Uh oh!

lukewhiting Oct 22, 2024 •

edited

Loading