-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry when killing s3 based segments #14776
Conversation
(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry. Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but one question, maybe code can be simplified
s3Client.deleteObjects(deleteObjectsRequest); | ||
return null; | ||
} | ||
catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the point of try/catch around it, it doesn't seem to do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed
darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error? |
Maybe you can iterate the errors in multideletefailure object and either
|
@maytasm , There doesn't seem to be any public method that aws sdk exposes to check whether a particular errorCode is recoverable, only whether an exception is recoverable. This exception type as you stated, exposes a list of Errors, which do not include the particular exception type, only the code, message, and a few other things. There are several other ways for this retry mechanism to view the exception as retryable, besides the errorCode, as you stated, such as the statusCode, etc. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This retry looks like it will make multiObjectDeleteException be turned into an exception and we wont ever catch the MultiObjectDeleteException and we wont have a list of files that it wasn't able to delete I was wrong.
@@ -150,7 +151,14 @@ private boolean deleteKeysForBucket( | |||
s3Bucket, | |||
keysToDeleteStrings | |||
); | |||
s3Client.deleteObjects(deleteObjectsRequest); | |||
RetryUtils.retry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retryS3Operation is cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks updated
You are right. Seems like the Error in MultiObjectDeleteException are not that useful in term of determining if we can retry or not. I have another idea. If a batch delete (deleteObjects) fails, we get the individual failed keys and use the single delete (deleteObject) with the RetryUtils.retry block for each single delete call. The single delete can work with the RetryUtils.retry since we can determine if it is retry-able or not from the Exception thrown. The other option is to retry on MultiObjectDeleteException regardless of checking the individual Error contained in the error list. @jasonk000 @TSFenwick Any other suggestions? |
@maytasm @zachjsh i have two thoughts
|
Do you know when and how would we get AmazonServiceException (and not MultiObjectDeleteException)? I have only seen MultiObjectDeleteException Exception when I was running Kill tasks. For #2, we do know the list of files that we fail to delete. However, knowing if the failure for each of those failure is retry-able or not is not so easy. |
@maytasm For for 2) aws returns a string version of the http code in the list of failed deletes. so we could make a list of reasonable error codes for that to retry on. |
I'm not sure there is any value in filtering the list to delete; if a file has already been deleted and a delete is again issued, S3 will just ignore the request. ie: there's no harm in asking for a file to be deleted twice. Unless we are sure that all errors are not retryable, it should suffice to send a retry. |
I agree with this. It seems AWS sdk library has some complex logic for how to determine whether an error is retryable based on error code which I'm not sure is worth replicating (aws library doesnt expose these methods that determine whether errorCode is retryable, only exception). My thought is to just retry up to 3 times regardless of failure. What do you think? |
@jasonk000 The value is little. I was just over engineering it. I was thinking if you are deleting 100,000 segments there isn't a point to retry each batch delete if only 1 file in each batch fails for a retryable reason. you would get a more performant call by only calling delete on files that you know weren't deleted for a retryable reason. The odds of this happening is low and the code complexity might not be worth it. I think your solution is a simple and efficient solution to this.
@zachjsh I think this is a bad idea. we shouldn't retry for all cases since we can and should easily know whats retry-able and whats not. doing unnecessary network calls is something that should be avoided when its easy to avoid since it would just be iterating over a list of at most 1000 deleteErrors. |
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
cloud/aws-common/src/main/java/org/apache/druid/common/aws/AWSClientUtil.java
Show resolved
Hide resolved
codes to retryable set in future
* request is used in org.apache.druid.storage.s3.S3DataSegmentKiller to delete a batch of segments from deep | ||
* storage. | ||
*/ | ||
private static final Set<String> RECOVERABLE_ERROR_CODES = ImmutableSet.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add "InternalError" to this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ServiceUnavailable" too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"503 SlowDown" too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically all the 500, 502, 503, 504 from https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comment
Description
s3 deleteObjects request sent when killing s3 based segments now being retried, if failure is retry-able.
This PR has: