-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry remote state download while bootstrap #15950
base: main
Are you sure you want to change the base?
Conversation
❌ Gradle check result for 3e02411: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Sooraj Sinha <soosinha@amazon.com>
3e02411
to
8f79550
Compare
❕ Gradle check result for 8f79550: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15950 +/- ##
============================================
- Coverage 71.90% 71.68% -0.23%
+ Complexity 64216 64112 -104
============================================
Files 5272 5277 +5
Lines 300597 300708 +111
Branches 43440 43451 +11
============================================
- Hits 216151 215567 -584
- Misses 66680 67385 +705
+ Partials 17766 17756 -10 ☔ View full report in Codecov by Sentry. |
server/src/main/java/org/opensearch/gateway/GatewayMetaState.java
Outdated
Show resolved
Hide resolved
...rc/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreClusterStateRestoreIT.java
Outdated
Show resolved
Hide resolved
int delayInMills = 100; | ||
for (int attempt = 1; attempt <= maxAttempts; attempt++) { | ||
try { | ||
return restoreClusterState(remoteStoreRestoreService, clusterState, lastKnownClusterUUID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a info log for every attempt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe reuse some retryable entity already present in cluster manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bukhtawar I do not see a suitable retryable entity already present. Let me know if you have any reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about extending this Action
OpenSearch/server/src/main/java/org/opensearch/action/support/RetryableAction.java
Lines 59 to 81 in 09bacee
public abstract class RetryableAction<Response> { | |
private final Logger logger; | |
private final AtomicBoolean isDone = new AtomicBoolean(false); | |
private final ThreadPool threadPool; | |
private final long initialDelayMillis; | |
private final long timeoutMillis; | |
private final long startMillis; | |
private final ActionListener<Response> finalListener; | |
private final String executor; | |
private final BackoffPolicy backoffPolicy; | |
private volatile Scheduler.ScheduledCancellable retryTask; | |
public RetryableAction( | |
Logger logger, | |
ThreadPool threadPool, | |
TimeValue initialDelay, | |
TimeValue timeoutValue, | |
ActionListener<Response> listener | |
) { | |
this( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bukhtawar This RetryableAction is mostly used for transport action to provide support for retries. It needs the threadpool as well to to execute the action asynchronously. Also we need to provide a total timeout during which the retries to happen.
We will need provide total timeout and add plumbing for providing threadpool to use this. But the total timeout is difficult to predict as it depends on the total number of files to download due to which the download time will vary.
Also, we need to way to wait for the action to complete as the download of remote state is pre-requisite to move forward.
I think we should keep the logic simple as it is now.
Signed-off-by: Sooraj Sinha <soosinha@amazon.com>
Description
When a remote state enabled cluster manager node boots-up, it tries to download the remote state. But if there some issue while downloading like file not present then the download fails with an exception. The OpenSearch process stays active but unresponsive. This PR addresses this by:
Error
so that process exists and a new process is spawned.Related Issues
NA
Check List
API changes companion pull request created, if applicable.Public documentation issue/PR created, if applicable.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.