Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#111433 Watch Next Run Interval Resets On Shard Move or Node Restart #115102

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lukewhiting
Copy link
Contributor

This PR fixes #111433 by making initial scheduling of IntervalSchedules within the watcher TickerScheduleTriggerEngine aware of the correct last run time of a watch, rather than using the current time. This allows it to correctly calculate the next runtime rather than waiting the full duration of the watch's interval before starting the first run.

This code uses the last time the watch started execution (lastCheckedTime) by default, falling back first to the last time the watch was last activated (which is relevant for newly created or recently unpaused tasks) before finally falling back to now() as a last resort.

The use of the last start time vs last completion time (which is not persistently stored at the moment) does mean we don't need to store a new item for every watch in the cluster state, reducing bloat, but it does mean that a watch that is migrated to a new node mid run may execute early on it's next run. I think this tradeoff is acceptable given the rarity and low impact of such a scenario VS the impact of increasing the cluster state size for every watch added.

@lukewhiting lukewhiting added >bug :Data Management/Watcher auto-backport Automatically create backport pull requests when merged v9.0.0 v8.17.0 labels Oct 18, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @lukewhiting, I've created a changelog YAML for you.

@lukewhiting lukewhiting requested review from a team and removed request for nielsbauman October 21, 2024 10:08
@lukewhiting lukewhiting marked this pull request as ready for review October 21, 2024 10:08
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Oct 21, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

fail("waiting too long for all watches to be triggered");
}

advanceClockIfNeeded(clock.instant().plusMillis(1100).atZone(ZoneOffset.UTC));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have short comments on each of these 4 tests describing what they're for. It took me a few minutes to figure out how they all differed. And in the case of this test, are you trying to show that it does not execute too many times if you advance the clock a good bit? Is it worth adding another latch or two to prove that the watch hasn't run when you expect it not to have run yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Data Management/Watcher Team:Data Management Meta label for data/management team v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reallocation of .watches shards resets interval schedule watches
3 participants