Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry when killing s3 based segments #14776

Merged
merged 9 commits into from
Aug 10, 2023

Conversation

zachjsh
Copy link
Contributor

@zachjsh zachjsh commented Aug 8, 2023

Description

s3 deleteObjects request sent when killing s3 based segments now being retried, if failure is retry-able.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@maytasm
Copy link
Contributor

maytasm commented Aug 8, 2023

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.

Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on
public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); }
Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

Copy link
Contributor

@maytasm maytasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment

Copy link
Contributor

@jasonk000 jasonk000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but one question, maybe code can be simplified

s3Client.deleteObjects(deleteObjectsRequest);
return null;
}
catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the point of try/catch around it, it doesn't seem to do anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

@zachjsh
Copy link
Contributor Author

zachjsh commented Aug 8, 2023

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.

Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); } Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error?

@maytasm
Copy link
Contributor

maytasm commented Aug 8, 2023

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.
Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); } Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error?

Maybe you can iterate the errors in multideletefailure object and either

  • (smarter way?) filter out all non retry-able before retrying (only retry the paths that has retry-able error)
  • (super simple way) just retry if any of the exception is retry-able (so you don’t have the modify the request) — this is still a little better than always retrying blindly

@zachjsh
Copy link
Contributor Author

zachjsh commented Aug 8, 2023

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.
Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); } Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error?

Maybe you can iterate the errors in multideletefailure object and either

* (smarter way?) filter out all non retry-able before retrying (only retry the paths that has retry-able error)

* (super simple way) just retry if any of the exception is retry-able (so you don’t have the modify the request)  — this is still a little better than always retrying blindly

@maytasm , There doesn't seem to be any public method that aws sdk exposes to check whether a particular errorCode is recoverable, only whether an exception is recoverable. This exception type as you stated, exposes a list of Errors, which do not include the particular exception type, only the code, message, and a few other things. There are several other ways for this retry mechanism to view the exception as retryable, besides the errorCode, as you stated, such as the statusCode, etc. Let me know what you think.

@zachjsh zachjsh requested a review from maytasm August 8, 2023 16:47
Copy link
Contributor

@TSFenwick TSFenwick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry looks like it will make multiObjectDeleteException be turned into an exception and we wont ever catch the MultiObjectDeleteException and we wont have a list of files that it wasn't able to delete I was wrong.

@@ -150,7 +151,14 @@ private boolean deleteKeysForBucket(
s3Bucket,
keysToDeleteStrings
);
s3Client.deleteObjects(deleteObjectsRequest);
RetryUtils.retry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryS3Operation is cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks updated

@maytasm
Copy link
Contributor

maytasm commented Aug 9, 2023

(I am not 100% sure but) this may not work because we are using deleteObjects, which returns MultiObjectDeleteException. MultiObjectDeleteException error code is null (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html#getErrorCode--) which may not cause us to retry.
Basically the "retry-able" error is actually inside the MultiObjectDeleteException list of Errors but MultiObjectDeleteException itself may not be retry-able. I took a look at AWSClientUtil.isClientExceptionRecoverable and it seems like it would depend on public static boolean isRetryableServiceException(AmazonServiceException exception) { return RETRYABLE_STATUS_CODES.contains(exception.getStatusCode()) || RETRYABLE_ERROR_CODES.contains(exception.getErrorCode()) || reasonPhraseMatchesErrorCode(exception, RETRYABLE_ERROR_CODES); } Since MultiObjectDeleteException getErrorCode always returns null (as indicated by the doc), the second and third checks would always be false. I am not about the status code but it might be 200.

darn I think you're right. Hmm. In this case, should we just always retry up to 3 times regardless of the error?

Maybe you can iterate the errors in multideletefailure object and either

* (smarter way?) filter out all non retry-able before retrying (only retry the paths that has retry-able error)

* (super simple way) just retry if any of the exception is retry-able (so you don’t have the modify the request)  — this is still a little better than always retrying blindly

@maytasm , There doesn't seem to be any public method that aws sdk exposes to check whether a particular errorCode is recoverable, only whether an exception is recoverable. This exception type as you stated, exposes a list of Errors, which do not include the particular exception type, only the code, message, and a few other things. There are several other ways for this retry mechanism to view the exception as retryable, besides the errorCode, as you stated, such as the statusCode, etc. Let me know what you think.

You are right. Seems like the Error in MultiObjectDeleteException are not that useful in term of determining if we can retry or not. I have another idea. If a batch delete (deleteObjects) fails, we get the individual failed keys and use the single delete (deleteObject) with the RetryUtils.retry block for each single delete call. The single delete can work with the RetryUtils.retry since we can determine if it is retry-able or not from the Exception thrown.

The other option is to retry on MultiObjectDeleteException regardless of checking the individual Error contained in the error list.

@jasonk000 @TSFenwick Any other suggestions?

@TSFenwick
Copy link
Contributor

TSFenwick commented Aug 9, 2023

@maytasm @zachjsh i have two thoughts

  1. is the retry as it is in this PR if it just retries the AmazonServiceException and doesn't retry the MultiObjectDeleteException is a benefit compared to the current code. we would just need to comment that it wont retry on MultiObjectDeleteException and figure that out later.
  2. is we are essentially getting a list of all buckets and files that can't be deleted since we are logging them. We could just try to do another delete call on that information without using the retry utils...

@maytasm
Copy link
Contributor

maytasm commented Aug 9, 2023

@maytasm @zachjsh i have two thoughts

  1. is the retry as it is in this PR if it just retries the AmazonServiceException and doesn't retry the MultiObjectDeleteException is a benefit compared to the current code. we would just need to comment that it wont retry on MultiObjectDeleteException and figure that out later.
  2. is we are essentially getting a list of all buckets and files that can't be deleted since we are logging them. We could just try to do another delete call on that information without using the retry utils...

Do you know when and how would we get AmazonServiceException (and not MultiObjectDeleteException)? I have only seen MultiObjectDeleteException Exception when I was running Kill tasks.

For #2, we do know the list of files that we fail to delete. However, knowing if the failure for each of those failure is retry-able or not is not so easy.

@TSFenwick
Copy link
Contributor

@maytasm For AmazonServiceException I believe are more global things. i was able to trigger it in my testing by using an incorrect aws apikey. i see in https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList we could get a 500 error which is something that could be argued to be retryable or a 503, where we would want to retry, but make sure we back off correctly.

for 2) aws returns a string version of the http code in the list of failed deletes. so we could make a list of reasonable error codes for that to retry on.

@jasonk000
Copy link
Contributor

I'm not sure there is any value in filtering the list to delete; if a file has already been deleted and a delete is again issued, S3 will just ignore the request. ie: there's no harm in asking for a file to be deleted twice. Unless we are sure that all errors are not retryable, it should suffice to send a retry.

@zachjsh
Copy link
Contributor Author

zachjsh commented Aug 9, 2023

I'm not sure there is any value in filtering the list to delete; if a file has already been deleted and a delete is again issued, S3 will just ignore the request. ie: there's no harm in asking for a file to be deleted twice. Unless we are sure that all errors are not retryable, it should suffice to send a retry.

I agree with this. It seems AWS sdk library has some complex logic for how to determine whether an error is retryable based on error code which I'm not sure is worth replicating (aws library doesnt expose these methods that determine whether errorCode is retryable, only exception). My thought is to just retry up to 3 times regardless of failure. What do you think?

@TSFenwick
Copy link
Contributor

TSFenwick commented Aug 9, 2023

I'm not sure there is any value in filtering the list to delete; if a file has already been deleted and a delete is again issued, S3 will just ignore the request. ie: there's no harm in asking for a file to be deleted twice. Unless we are sure that all errors are not retryable, it should suffice to send a retry.

@jasonk000 The value is little. I was just over engineering it. I was thinking if you are deleting 100,000 segments there isn't a point to retry each batch delete if only 1 file in each batch fails for a retryable reason. you would get a more performant call by only calling delete on files that you know weren't deleted for a retryable reason. The odds of this happening is low and the code complexity might not be worth it.

I think your solution is a simple and efficient solution to this.

My thought is to just retry up to 3 times regardless of failure. What do you think?

@zachjsh I think this is a bad idea. we shouldn't retry for all cases since we can and should easily know whats retry-able and whats not. doing unnecessary network calls is something that should be avoided when its easy to avoid since it would just be iterating over a list of at most 1000 deleteErrors.

* request is used in org.apache.druid.storage.s3.S3DataSegmentKiller to delete a batch of segments from deep
* storage.
*/
private static final Set<String> RECOVERABLE_ERROR_CODES = ImmutableSet.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add "InternalError" to this list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ServiceUnavailable" too

Copy link
Contributor

@maytasm maytasm Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"503 SlowDown" too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks added

Copy link
Contributor

@maytasm maytasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comment

@zachjsh zachjsh requested a review from maytasm August 9, 2023 21:34
@zachjsh zachjsh merged commit 23306c4 into apache:master Aug 10, 2023
74 checks passed
@zachjsh zachjsh deleted the s3-segment-killer-retry branch August 10, 2023 18:04
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants