increase watcher stop test max to 60s #45679

jakelandis · 2019-08-16T21:04:33Z

As of #43939 Watcher tests now correctly block until all Watch executions
kicked off by that test are finished. Prior we allowed tests to finish with
outstanding watch executions. It was known that this would increase the
time needed to finish a test. However, running the tests on CI can be slow
and on at least 1 occasion it took 60s to actually finish.

This PR simply increases the max allowable timeout for Watcher tests
to clean up after themselves.

Example of a log that illustrates that the time out needs to be increase:
Tests (with 30s max):

13:25:12   1> [2019-08-17T03:23:53,392][INFO ][o.e.s.WatcherRestIT      ] [test] [yaml=painless/30_inline_watch/Test execute watch api with an inline watch] before test
13:25:12   1> [2019-08-17T03:24:26,457][INFO ][o.e.s.WatcherRestIT      ] [test] There are still tasks running after this test that might break subsequent tests [indices:data/write/bulk, indices:data/write/bulk[s]].
13:25:12   1> [2019-08-17T03:24:26,457][INFO ][o.e.s.WatcherRestIT      ] [test] [yaml=painless/30_inline_watch/Test execute watch api with an inline watch] after test
13:25:12   2> REPRODUCE WITH: ./gradlew :x-pack:qa:smoke-test-watcher:integTestRunner --tests "org.elasticsearch.smoketest.WatcherRestIT.test {yaml=painless/30_inline_watch/Test execute watch api with an inline watch}" -Dtests.seed=CDFB722B1E040D96 -Dtests.security.manager=true -Dtests.locale=so-DJ -Dtests.timezone=Japan -Dcompiler.java=12 -Druntime.java=11
13:25:12   2> java.lang.AssertionError: waiting until stopping state reached stopped state
13:25:12         at __randomizedtesting.SeedInfo.seed([CDFB722B1E040D96:45AF4DF1B0F8606E]:0)

Server

[2019-08-16T18:23:53,489][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] executing watch [_inlined_] 
[2019-08-16T18:23:53,493][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] finished [_inlined_]/[_inlined__a3a9c868-cac7-45e1-8916-f212e792c754-2019-08-16T18:23:53.488691Z]
[2019-08-16T18:23:53,524][INFO ][o.e.x.w.WatcherService   ] [integTest-0] stopping watch service, reason [watcher manually marked to shutdown by cluster state update].
[2019-08-16T18:24:26,395][INFO ][o.e.c.m.MetaDataIndexTemplateService] [integTest-0] removing template [.logstash-management]
[2019-08-16T18:24:26,414][INFO ][o.e.c.m.TemplateUpgradeService] [integTest-0] Starting template upgrade to version 8.0.0, 1 templates will be updated and 0 will be removed
[2019-08-16T18:24:26,424][INFO ][o.e.c.m.MetaDataIndexTemplateService] [integTest-0] adding template [.logstash-management] for index patterns [.logstash]
[2019-08-16T18:24:26,447][INFO ][o.e.c.m.TemplateUpgradeService] [integTest-0] Templates were upgraded successfully to version 8.0.0
[2019-08-16T18:24:53,016][ERROR][o.e.x.w.Watcher          ] [integTest-0] watch history could not be written [test_watch_eeb836eb-0cd1-427c-9c98-0d7398cfae08-2019-08-16T18:23:52.926447Z], failure [[.watcher-history-10-2019.08.16] IndexNotFoundException[no such index [.watcher-history-10-2019.08.16]]]   
[2019-08-16T18:24:53,017][ERROR][o.e.x.w.Watcher          ] [integTest-0] triggered watches could not be deleted [test_watch_eeb836eb-0cd1-427c-9c98-0d7398cfae08-2019-08-16T18:23:52.926447Z], failure [[.triggered_watches] IndexNotFoundException[no such index [.triggered_watches]]]
[2019-08-16T18:24:53,018][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] finished [test_watch]/[test_watch_eeb836eb-0cd1-427c-9c98-0d7398cfae08-2019-08-16T18:23:52.926447Z] 
[2019-08-16T18:24:53,018][INFO ][o.e.x.w.WatcherLifeCycleService] [integTest-0] watcher has stopped

Since these are run on the same machine, it is OK to compare timestamps with precision.

The test logs above says:
The test started at 23:53,392 and finished at 24:26,457.
Since the timeout in this example is 30s, and the time the test took is 30+s we can start to assume a timeout.
"waiting until stopping state reached stopped state" -> this says that we told Watcher to shut down, but it never fully reached a shutdown state in the allowed time (30s in this case)
"There are still tasks running after this test" -> is an another indication that test timed out with outstanding work to be done.

Looking at the server logs:
We start to look just after 23:53,392, and see that the Watch itself executed VERY fast 23:53,489 -> 23:53,493
So the test framework executed the Watch and started the shutdown process at 23:53,524 (which now ensure all outstanding work is done before finishing the shutdown)
We can also see that the Watcher did not fully stop until 24:53,018, so this test on this machine at this moment in time, took just shy of 60s from starting to shut down till it actually shuts down (on my local machine it takes milliseconds)
"watch history could not be written" and "triggered watches could not be deleted" are artifacts of the test framework cleaning up, but the lingering task continues to live in the back ground. With #43939 this should only happen if the stop timeout is exceeded.
"finished [test_watch]" -> is an indication the next test running before the prior test has actually finished. Again with #43939 this should only happen if the stop timeout is exceeded.

elasticmachine · 2019-08-16T21:04:36Z

Pinging @elastic/es-core-features

dakrone

LGTM

As of elastic#43939 Watcher tests now correctly block until all Watch executions kicked off by that test are finished. Prior we allowed tests to finish with outstanding watch executions. It was known that this would increase the time needed to finish a test. However, running the tests on CI can be slow and on at least 1 occasion it took 60s to actually finish. This PR simply increases the max allowable timeout for Watcher tests to clean up after themselves.

As of #43939 Watcher tests now correctly block until all Watch executions kicked off by that test are finished. Prior we allowed tests to finish with outstanding watch executions. It was known that this would increase the time needed to finish a test. However, running the tests on CI can be slow and on at least 1 occasion it took 60s to actually finish. This PR simply increases the max allowable timeout for Watcher tests to clean up after themselves.

increase watcher stop test max to 60s

f17725f

jakelandis added >test Issues or PRs that are addressing/adding tests :Data Management/Watcher labels Aug 16, 2019

dakrone approved these changes Aug 19, 2019

View reviewed changes

jakelandis merged commit 0e3e8c6 into elastic:master Aug 19, 2019

jakelandis deleted the watcher_test_timeout_increase branch August 19, 2019 19:24

jakelandis added backport pending v7.4.0 v8.0.0 labels Aug 19, 2019

jakelandis removed the backport pending label Aug 26, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

increase watcher stop test max to 60s #45679

increase watcher stop test max to 60s #45679

Uh oh!

jakelandis commented Aug 16, 2019 •

edited

Loading

Uh oh!

elasticmachine commented Aug 16, 2019

Uh oh!

dakrone left a comment

Uh oh!

Uh oh!

increase watcher stop test max to 60s #45679

increase watcher stop test max to 60s #45679

Uh oh!

Conversation

jakelandis commented Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Aug 16, 2019

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jakelandis commented Aug 16, 2019 •

edited

Loading