Skip to content

org.elasticsearch.xpack.slm.SnapshotLifecycleIT Fails due to Rate Limiting #46205

Closed
@original-brownbear

Description

@original-brownbear

There's numerous failures of org.elasticsearch.xpack.slm.SnapshotLifecycleIT at the moment. The reason for this is that these tests use rate limiting with very low rate limits on the snapshot repository to simulate snapshot aborts and other concurrent scenarios.

Example Failure -> https://gradle-enterprise.elastic.co/s/sscyvnvkf23gy/console-log


Suite: Test class org.elasticsearch.xpack.slm.SnapshotLifecycleIT
--
1> [2019-08-31T12:22:27,369][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyFailure] before test
1> [2019-08-31T12:22:27,385][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyFailure] initializing REST clients against [http://[::1]:32827, http://127.0.0.1:33495, http://[::1]:43357, http://127.0.0.1:37929, http://[::1]:34097, http://127.0.0.1:43543, http://[::1]:36011, http://127.0.0.1:36555]
1> [2019-08-31T12:22:31,480][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyFailure] after test
1> [2019-08-31T12:22:31,539][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyManualExecution] before test
1> [2019-08-31T12:22:35,781][INFO ][o.e.x.s.SnapshotLifecycleIT] [testPolicyManualExecution] after test
1> [2019-08-31T12:22:35,824][INFO ][o.e.x.s.SnapshotLifecycleIT] [testSnapshotInProgress] before test
1> [2019-08-31T12:22:38,715][INFO ][o.e.x.s.SnapshotLifecycleIT] [testSnapshotInProgress] after test
1> [2019-08-31T12:22:38,769][INFO ][o.e.x.s.SnapshotLifecycleIT] [testFullPolicySnapshot] before test
1> [2019-08-31T12:24:40,221][INFO ][o.e.x.s.SnapshotLifecycleIT] [testFullPolicySnapshot] There are still tasks running after this test that might break subsequent tests [cluster:admin/repository/put].
1> [2019-08-31T12:24:40,222][INFO ][o.e.x.s.SnapshotLifecycleIT] [testFullPolicySnapshot] after test
2> REPRODUCE WITH: ./gradlew :x-pack:plugin:ilm:qa:multi-node:integTestRunner --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleIT.testFullPolicySnapshot" -Dtests.seed=4A8659A9FC9A2C94 -Dtests.security.manager=true -Dtests.locale=fy -Dtests.timezone=Atlantic/Azores -Dcompiler.java=12 -Druntime.java=11
2> java.net.SocketTimeoutException: 30.000 milliseconds timeout on connection http-outgoing-40 [ACTIVE]
at __randomizedtesting.SeedInfo.seed([4A8659A9FC9A2C94:435FF4A6DC4174C1]:0)
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:778)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:218)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:221)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:221)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:221)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:205)
at org.elasticsearch.xpack.slm.SnapshotLifecycleIT.inializeRepo(SnapshotLifecycleIT.java:379)
at org.elasticsearch.xpack.slm.SnapshotLifecycleIT.inializeRepo(SnapshotLifecycleIT.java:364)
at org.elasticsearch.xpack.slm.SnapshotLifecycleIT.testFullPolicySnapshot(SnapshotLifecycleIT.java:85)
 
Caused by:
java.net.SocketTimeoutException: 30.000 milliseconds timeout on connection http-outgoing-40 [ACTIVE]

The problem with this approach is that low rate limits can lead to extremely long sleep times in the rate limiter. In one spot we limit to 1b/s but read 8k in one go -> we get minutes of sleeping. These tests passed more often before #42791 and #45689 but that PR changed timings in a way that made this trigger more often (I think this is due to the fact that we now write data in the first step of snapshotting and thus simply build up long waits before the concurrent action is tested ... before we had quite a bit of delay from first writing the snapshot metadata).
I tried fixing improving this situation by using only a single snapshot thread in #46195 but it wasn't enough evidently.

I'll see what better solution to these tests I can find here.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions