-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-16823. Manage S3 Throttling exclusively in S3A client. #1814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-16823. Manage S3 Throttling exclusively in S3A client. #1814
Conversation
not yet tested against anything |
might make sense to add throttling as part of the fault injection to the unreliable S3 client; we could test the counters and aid production testing |
style
|
Currently AWS S3 throttling is initially handled in the AWS SDK, only reaching the S3 client code after it has given up. This means we don't always directly observe when throttling is taking place. Proposed: * disable throttling retries in the AWS client Library * add a quantile for the S3 throttle events, as DDB has * isolate counters of s3 and DDB throttle events to classify issues better Because we are taking over the AWS retries, we will need to expand the initial delay en retries and the number of retries we should support before giving up. Also: should we log throttling events? It could be useful but there is a risk of logs overloading especially if many threads in the same process were triggering the problem. Change-Id: I386928cd478a6a9fbb91f15b9185a1ea91878680 Proposed: log at debug.
fix checkstyle Change-Id: I19f3848b298a8656ee5f986a2ba1cde50a106814
bfecd39
to
22708cd
Compare
Did manage to overload one of the DDB scale tests here -those throttling values are clearly too low to recover from a big mismatch of DDB capacity and load. We are going to need some bigger values
|
...oh, and note that failure was in the SCAN operation. I'm not sure we have enough wrapping there |
With retries on the scans in the test teardown (and dump/purge DDB), getting a failure in queryVersionMarker()
What is funny is that other tests are failing because they aren't detecting throttling:
|
Turning off throttling in the AWS client causes problems for the DDBMetastore; including showing where tests were making non-retrying operations against the table. Mostly addressed though ITestDynamoDBMetadataStoreScale is still petulant. Either it takes too long to finish or it doesn't throttle. Oh, and lag means that while a test may fail because throttling wasn't raised, the next IO may fail. Change-Id: I37bbcb67023f4cb3ebdcba978602be58099ad306
My "little" fix to turn off retries in the AWS client causes issues in the DDB clients where there's a significant mismatch between prepaid IO and load; ITestDynamoDBMetadataStoreScale is the example of this. Looking at the AWS metrics, part of the fun is that the way bursty traffic is handled, you may get your capacity at the time of the initial load, but get blocked after. That is: the throttling may not happen under load, but during the next time a low-load API call is made. Also, S3GuardTableAccess isn't retrying, and some code in tests and the purge/dump table entry points go on to fail when throttling happens when iterating through scans. Fix: you can ask a DDBMetastore to wrap your scan with one bonded to its retry and metrics...plus use of this where appropriate. ITestDynamoDBMetadataStoreScale is really slow; either the changes make it worse, or its always been really slow and we haven't noticed as it was happening during the (slow) parallel test runs. Proposed: we review it, look at what we want to show and then see if we can make things fail faster Latest Patch makes the SDK throttling disablement exclusive to S3, fixed up DDB clients to retry better and tries to make a better case for that ITestDynamoDBMetadataStoreScale suite. I think I'm going to tune those tests to always downgrade if none is detected.
For the setup failure (here in test_070_putDirMarker); not sure. We either skip the test or retry. It's always surfacing in test_070; test_060 tests list scale. Looking at that code, I think the retry logic is too coarse -it retries the entire list, when we may want to just retry on the hasnext/next calls. That is: push it down. This will avoid so much load on any retry. |
* Split out where/how we retry listchildren * trying to speed up the ddb scale tests (though the latest change there triggers an NPE...) For anyone curious why tests take so long -it's probably set up of the per-test-case FS instance, because that has full retry, and once one test has throttled, that spin/wait goes on until DDB is letting the client at it. Which is a PITA but it does at least mean that "usually" each test case is in a recovered state. Do we care? Should we just run them back to back and be happy overloading things? I think so Change-Id: Ib35d450449fffaa2379d62ca12180eaa70c38584
Quick followup: noticed there's no javadocs here. I know not all the existing constants have them, but they should. Not worth fixing now -but for future patches, can you add a javadoc with the {@value} reference -so that anyone looking at the javadocs gets to see what the constant does and what it is. thanks |
oops, accidentally closed this. Will resubmit |
Currently AWS S3 throttling is initially handled in the AWS SDK, only reaching the S3 client code after it has given up.
This means we don't always directly observe when throttling is taking place.
Proposed:
Because we are taking over the AWS retries, we will need to expand the initial delay en retries and the number of retries we should support before giving up.
Also: should we log throttling events? It could be useful but there is a risk of logs overloading especially if many threads in the same process were triggering the problem.
Change-Id: I386928cd478a6a9fbb91f15b9185a1ea91878680