-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-19096: [ABFS] [CST Optimization] Enhancing Client-Side Throttling Metrics Updating Logic #6276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...ols/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsRestOperation.java
Outdated
Show resolved
Hide resolved
...ols/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsRestOperation.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/hadoop/fs/azurebfs/services/retryReasonCategories/ServerErrorRetryReason.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/services/TestAbfsRestOperationMockFailures.java
Outdated
Show resolved
Hide resolved
import static org.mockito.ArgumentMatchers.any; | ||
import static org.mockito.ArgumentMatchers.nullable; | ||
|
||
public class TestAbfsRestOperationMockFailures { | ||
// In these tests a request first fails with given exceptions and then succeed on retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test where number of time metrics are updated for say 500 kind of error code is 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testClientRequestIdFor500Retry
asserts this.
Here the call first fails with 500 and then succeed. CST metrics are updated only once after success.
...tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestRetryReason.java
Outdated
Show resolved
Hide resolved
🎊 +1 overall
This message was automatically generated. |
...ols/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsRestOperation.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Hi @steveloughran , @mukund-thakur Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anujmodi2021 for the patch. Added a few comments.
@@ -56,10 +58,14 @@ String getAbbreviation(final Integer statusCode, | |||
splitedServerErrorMessage)) { | |||
return EGRESS_LIMIT_BREACH_ABBREVIATION; | |||
} | |||
if (OPERATION_BREACH_MESSAGE.equalsIgnoreCase( | |||
if (TPS_OVER_ACCOUNT_LIMIT.getErrorMessage().equalsIgnoreCase( | |||
splitedServerErrorMessage)) { | |||
return OPERATION_LIMIT_BREACH_ABBREVIATION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please rename OPERATION_LIMIT_BREACH_ABBREVIATION
to TPS_LIMIT_BREACH_ABBREVIATION
, this would maintain consistency with the naming convention with other constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
|| INGRESS_LIMIT_BREACH_ABBREVIATION.equals(failureReason) // Case 4.a | ||
|| EGRESS_LIMIT_BREACH_ABBREVIATION.equals(failureReason) // Case 4.b | ||
|| OPERATION_LIMIT_BREACH_ABBREVIATION.equals(failureReason)) // Case 4.c | ||
&& !wasExceptionThrown; // Case 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please encapsulate the chain of conditions to a method #shouldUpdateCSTMetrics()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
// If no exception occurred till here it means http operation was successfully complete and | ||
// a response from server has been received which might be failure or success. | ||
// If any kind of exception has occurred it will be caught below. | ||
// If request failed determine failure reason and retry policy here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : failed to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
} catch (UnknownHostException ex) { | ||
wasExceptionThrown = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename this? because in case of other exceptions, we update the metric below in finally block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to wasKnownExceptionThrown.
So that if any other exception is thrown we will still update metrics
@@ -283,7 +285,7 @@ String getClientLatency() { | |||
private boolean executeHttpOperation(final int retryCount, | |||
TracingContext tracingContext) throws AzureBlobFileSystemException { | |||
AbfsHttpOperation httpOperation; | |||
boolean wasIOExceptionThrown = false; | |||
boolean wasExceptionThrown = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment on the usage of this,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment
Thanks for the review. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
thanks, in trunk. @anmolanmol1234 can you do a pr and retest for 3.4? I'll rebase my work on this and see how it goes |
That would be me, not anmol. |
…g Metrics Logic (apache#6276) ABFS has a client-side throttling mechanism which works on the metrics collected from past requests When requests are fail due to server-side throttling it updates its metrics and recalculates any client side backoff. The choice of which requests should be used to compute client side backoff interval is based on the http status code: - Status code in 2xx range: Successful Operations should contribute. - Status code in 3xx range: Redirection Operations should not contribute. - Status code in 4xx range: User Errors should not contribute. - Status code is 503: Throttling Error should contribute only if they are due to client limits breach as follows: * 503, Ingress Over Account Limit: Should Contribute * 503, Egress Over Account Limit: Should Contribute * 503, TPS Over Account Limit: Should Contribute * 503, Other Server Throttling: Should not Contribute. - Status code in 5xx range other than 503: Should not Contribute. - IOException and UnknownHostExceptions: Should not Contribute. Contributed by Anuj Modi
…g Metrics Logic (#6276) ABFS has a client-side throttling mechanism which works on the metrics collected from past requests When requests are fail due to server-side throttling it updates its metrics and recalculates any client side backoff. The choice of which requests should be used to compute client side backoff interval is based on the http status code: - Status code in 2xx range: Successful Operations should contribute. - Status code in 3xx range: Redirection Operations should not contribute. - Status code in 4xx range: User Errors should not contribute. - Status code is 503: Throttling Error should contribute only if they are due to client limits breach as follows: * 503, Ingress Over Account Limit: Should Contribute * 503, Egress Over Account Limit: Should Contribute * 503, TPS Over Account Limit: Should Contribute * 503, Other Server Throttling: Should not Contribute. - Status code in 5xx range other than 503: Should not Contribute. - IOException and UnknownHostExceptions: Should not Contribute. Contributed by Anuj Modi
Description of PR
Jira: https://issues.apache.org/jira/browse/HADOOP-19096
ABFS has a client-side throttling mechanism which works on the metrics collected from past requests made. I requests are getting failed due to throttling at server, we update our metrics and client side backoff is calculated based on those metrics.
This PR enhances the logic to decide which requests should be considered to compute client side backoff interval as follows:
For each request made by ABFS driver, we will determine if they should contribute to Client-Side Throttling based on the status code and result:
How was this patch tested?
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?