retry when killing s3 based segments #14776

zachjsh · 2023-08-08T04:23:33Z

Description

s3 deleteObjects request sent when killing s3 based segments now being retried, if failure is retry-able.

This PR has:

…egments

maytasm · 2023-08-08T04:46:30Z

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.

Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on
public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); }
Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

maytasm

See comment

jasonk000

lgtm, but one question, maybe code can be simplified

jasonk000 · 2023-08-08T04:50:32Z

...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java

+                s3Client.deleteObjects(deleteObjectsRequest);
+                return null;
+              }
+              catch (Exception e) {


I don't understand the point of try/catch around it, it doesn't seem to do anything?

Thanks, fixed

zachjsh · 2023-08-08T04:57:09Z

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.

Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); } Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error?

maytasm · 2023-08-08T05:35:44Z

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.
Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); } Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error?

Maybe you can iterate the errors in multideletefailure object and either

(smarter way?) filter out all non retry-able before retrying (only retry the paths that has retry-able error)
(super simple way) just retry if any of the exception is retry-able (so you don’t have the modify the request) — this is still a little better than always retrying blindly

zachjsh · 2023-08-08T16:47:05Z

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.
Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); } Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error?

Maybe you can iterate the errors in multideletefailure object and either
* (smarter way?) filter out all non retry-able before retrying (only retry the paths that has retry-able error)

* (super simple way) just retry if any of the exception is retry-able (so you don’t have the modify the request)  — this is still a little better than always retrying blindly

@maytasm , There doesn't seem to be any public method that aws sdk exposes to check whether a particular errorCode is recoverable, only whether an exception is recoverable. This exception type as you stated, exposes a list of Errors, which do not include the particular exception type, only the code, message, and a few other things. There are several other ways for this retry mechanism to view the exception as retryable, besides the errorCode, as you stated, such as the statusCode, etc. Let me know what you think.

TSFenwick

This retry looks like it will make multiObjectDeleteException be turned into an exception and we wont ever catch the MultiObjectDeleteException and we wont have a list of files that it wasn't able to delete I was wrong.

TSFenwick · 2023-08-08T23:43:17Z

...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java

@@ -150,7 +151,14 @@ private boolean deleteKeysForBucket(
            s3Bucket,
            keysToDeleteStrings
        );
-        s3Client.deleteObjects(deleteObjectsRequest);
+        RetryUtils.retry(


retryS3Operation is cleaner

thanks updated

maytasm · 2023-08-09T00:49:41Z

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.
Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); } Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error?

Maybe you can iterate the errors in multideletefailure object and either
* (smarter way?) filter out all non retry-able before retrying (only retry the paths that has retry-able error)

* (super simple way) just retry if any of the exception is retry-able (so you don’t have the modify the request)  — this is still a little better than always retrying blindly
@maytasm , There doesn't seem to be any public method that aws sdk exposes to check whether a particular errorCode is recoverable, only whether an exception is recoverable. This exception type as you stated, exposes a list of Errors, which do not include the particular exception type, only the code, message, and a few other things. There are several other ways for this retry mechanism to view the exception as retryable, besides the errorCode, as you stated, such as the statusCode, etc. Let me know what you think.

You are right. Seems like the Error in MultiObjectDeleteException are not that useful in term of determining if we can retry or not. I have another idea. If a batch delete (deleteObjects) fails, we get the individual failed keys and use the single delete (deleteObject) with the RetryUtils.retry block for each single delete call. The single delete can work with the RetryUtils.retry since we can determine if it is retry-able or not from the Exception thrown.

The other option is to retry on MultiObjectDeleteException regardless of checking the individual Error contained in the error list.

@jasonk000 @TSFenwick Any other suggestions?

TSFenwick · 2023-08-09T01:08:45Z

@maytasm @zachjsh i have two thoughts

is the retry as it is in this PR if it just retries the AmazonServiceException and doesn't retry the MultiObjectDeleteException is a benefit compared to the current code. we would just need to comment that it wont retry on MultiObjectDeleteException and figure that out later.
is we are essentially getting a list of all buckets and files that can't be deleted since we are logging them. We could just try to do another delete call on that information without using the retry utils...

maytasm · 2023-08-09T01:20:02Z

@maytasm @zachjsh i have two thoughts

is the retry as it is in this PR if it just retries the AmazonServiceException and doesn't retry the MultiObjectDeleteException is a benefit compared to the current code. we would just need to comment that it wont retry on MultiObjectDeleteException and figure that out later.

is we are essentially getting a list of all buckets and files that can't be deleted since we are logging them. We could just try to do another delete call on that information without using the retry utils...

Do you know when and how would we get AmazonServiceException (and not MultiObjectDeleteException)? I have only seen MultiObjectDeleteException Exception when I was running Kill tasks.

For #2, we do know the list of files that we fail to delete. However, knowing if the failure for each of those failure is retry-able or not is not so easy.

TSFenwick · 2023-08-09T01:58:55Z

@maytasm For AmazonServiceException I believe are more global things. i was able to trigger it in my testing by using an incorrect aws apikey. i see in https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList we could get a 500 error which is something that could be argued to be retryable or a 503, where we would want to retry, but make sure we back off correctly.

for 2) aws returns a string version of the http code in the list of failed deletes. so we could make a list of reasonable error codes for that to retry on.

jasonk000 · 2023-08-09T18:03:39Z

I'm not sure there is any value in filtering the list to delete; if a file has already been deleted and a delete is again issued, S3 will just ignore the request. ie: there's no harm in asking for a file to be deleted twice. Unless we are sure that all errors are not retryable, it should suffice to send a retry.

zachjsh · 2023-08-09T18:12:35Z

I'm not sure there is any value in filtering the list to delete; if a file has already been deleted and a delete is again issued, S3 will just ignore the request. ie: there's no harm in asking for a file to be deleted twice. Unless we are sure that all errors are not retryable, it should suffice to send a retry.

I agree with this. It seems AWS sdk library has some complex logic for how to determine whether an error is retryable based on error code which I'm not sure is worth replicating (aws library doesnt expose these methods that determine whether errorCode is retryable, only exception). My thought is to just retry up to 3 times regardless of failure. What do you think?

TSFenwick · 2023-08-09T18:22:31Z

I'm not sure there is any value in filtering the list to delete; if a file has already been deleted and a delete is again issued, S3 will just ignore the request. ie: there's no harm in asking for a file to be deleted twice. Unless we are sure that all errors are not retryable, it should suffice to send a retry.

@jasonk000 The value is little. I was just over engineering it. I was thinking if you are deleting 100,000 segments there isn't a point to retry each batch delete if only 1 file in each batch fails for a retryable reason. you would get a more performant call by only calling delete on files that you know weren't deleted for a retryable reason. The odds of this happening is low and the code complexity might not be worth it.

I think your solution is a simple and efficient solution to this.

My thought is to just retry up to 3 times regardless of failure. What do you think?

@zachjsh I think this is a bad idea. we shouldn't retry for all cases since we can and should easily know whats retry-able and whats not. doing unnecessary network calls is something that should be avoided when its easy to avoid since it would just be iterating over a list of at most 1000 deleteErrors.

...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java

cloud/aws-common/src/main/java/org/apache/druid/common/aws/AWSClientUtil.java

codes to retryable set in future

maytasm · 2023-08-09T19:58:41Z

cloud/aws-common/src/main/java/org/apache/druid/common/aws/AWSClientUtil.java

+   * request is used in org.apache.druid.storage.s3.S3DataSegmentKiller to delete a batch of segments from deep
+   * storage.
+   */
+  private static final Set<String> RECOVERABLE_ERROR_CODES = ImmutableSet.of(


Can you add "InternalError" to this list?

"ServiceUnavailable" too

"503 SlowDown" too

Basically all the 500, 502, 503, 504 from https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html

thanks added

maytasm

Please see comment

retry retryable failures for S3 delete object requests when killing s…

2cff11f

…egments

zachjsh requested review from maytasm, jasonk000 and suneet-s August 8, 2023 04:23

maytasm requested changes Aug 8, 2023

View reviewed changes

jasonk000 approved these changes Aug 8, 2023

View reviewed changes

kfaraz added Area - Deep Storage Error handling labels Aug 8, 2023

* address review comments

b15ec49

zachjsh requested a review from maytasm August 8, 2023 16:47

TSFenwick suggested changes Aug 8, 2023

View reviewed changes

TSFenwick reviewed Aug 8, 2023

View reviewed changes

* cleaner retry

a1164b2

* retry if exception instance of MultiObjectDeleteException

a4c33d9

jasonk000 reviewed Aug 9, 2023

View reviewed changes

...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java Outdated Show resolved Hide resolved

* review comments

12affb0

zachjsh commented Aug 9, 2023

View reviewed changes

cloud/aws-common/src/main/java/org/apache/druid/common/aws/AWSClientUtil.java Show resolved Hide resolved

* add error code to log error so that we can add any needed missing

588abbb

codes to retryable set in future

maytasm reviewed Aug 9, 2023

View reviewed changes

maytasm requested changes Aug 9, 2023

View reviewed changes

* address review comments

32d0e84

zachjsh requested a review from maytasm August 9, 2023 21:34

maytasm approved these changes Aug 9, 2023

View reviewed changes

TSFenwick approved these changes Aug 9, 2023

View reviewed changes

zachjsh added 2 commits August 9, 2023 19:29

* increase test code coverage

1e3f556

* improve test coverage

a7ecfe3

zachjsh merged commit 23306c4 into apache:master Aug 10, 2023
74 checks passed

zachjsh deleted the s3-segment-killer-retry branch August 10, 2023 18:04

LakshSingla added this to the 28.0 milestone Oct 12, 2023

renatocron mentioned this pull request Nov 2, 2023

Kill Task improvements #15312

Open

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry when killing s3 based segments #14776

retry when killing s3 based segments #14776

zachjsh commented Aug 8, 2023 •

edited

Loading

maytasm commented Aug 8, 2023

maytasm left a comment

jasonk000 left a comment

jasonk000 Aug 8, 2023

zachjsh Aug 8, 2023

zachjsh commented Aug 8, 2023

maytasm commented Aug 8, 2023

zachjsh commented Aug 8, 2023

TSFenwick left a comment •

edited

Loading

TSFenwick Aug 8, 2023

zachjsh Aug 9, 2023

maytasm commented Aug 9, 2023

TSFenwick commented Aug 9, 2023 •

edited

Loading

maytasm commented Aug 9, 2023

TSFenwick commented Aug 9, 2023

jasonk000 commented Aug 9, 2023

zachjsh commented Aug 9, 2023

TSFenwick commented Aug 9, 2023 •

edited

Loading

maytasm Aug 9, 2023

maytasm Aug 9, 2023

maytasm Aug 9, 2023 •

edited

Loading

maytasm Aug 9, 2023

zachjsh Aug 9, 2023

maytasm left a comment

retry when killing s3 based segments #14776

retry when killing s3 based segments #14776

Conversation

zachjsh commented Aug 8, 2023 • edited Loading

Description

maytasm commented Aug 8, 2023

maytasm left a comment

Choose a reason for hiding this comment

jasonk000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachjsh commented Aug 8, 2023

maytasm commented Aug 8, 2023

zachjsh commented Aug 8, 2023

TSFenwick left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maytasm commented Aug 9, 2023

TSFenwick commented Aug 9, 2023 • edited Loading

maytasm commented Aug 9, 2023

TSFenwick commented Aug 9, 2023

jasonk000 commented Aug 9, 2023

zachjsh commented Aug 9, 2023

TSFenwick commented Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maytasm Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maytasm left a comment

Choose a reason for hiding this comment

zachjsh commented Aug 8, 2023 •

edited

Loading

TSFenwick left a comment •

edited

Loading

TSFenwick commented Aug 9, 2023 •

edited

Loading

TSFenwick commented Aug 9, 2023 •

edited

Loading

maytasm Aug 9, 2023 •

edited

Loading