Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-19221. S3A: Unable to recover from failure of multipart block upload attempt #6938

Merged

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Jul 10, 2024

HADOOP-19221

Adds custom set of content providers in UploadContentProviders which

  • restart on failures
  • do not copy buffers/byte buffers into new private byte arrays, so avoid exacerbating memory problems.

org.apache.hadoop.fs.store.ByteBufferInputStream has been pulled out of org.apache.hadoop.fs.store.DataBlocks to assist.

  • S3A fs collects statistics on http 400+ error codes received by sdk, implemented through the logging auditor. Note: 404s are not collected as they are so common during normal operation.
  • Improved handling of interrupt exceptions raised while waiting for block uploads to complete when spark wants to abort a speculative task.
  • fault injection test used to recreate the failure (could only do this in CommitOperations), and verify the fix is good.

Description of PR

How was this patch tested?

Fault injection through AWS SDK extension point which changes the status from 200 to 400 after the targeted operation completes. This puts the SDK into retry/recovery mode.

Some new unit tests

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@steveloughran steveloughran force-pushed the s3/HADOOP-19221-multipart-put-failures branch from bf5d5ec to a93afe6 Compare July 12, 2024 18:30
@steveloughran steveloughran marked this pull request as draft July 12, 2024 18:39
@steveloughran steveloughran force-pushed the s3/HADOOP-19221-multipart-put-failures branch from a93afe6 to 9b942c7 Compare July 12, 2024 18:47
@steveloughran
Copy link
Contributor Author

I believe I have a way to test this by injecting 500 exceptions with a custom execution interceptor added to the audit chain

@steveloughran steveloughran force-pushed the s3/HADOOP-19221-multipart-put-failures branch from 76deb75 to 65fd797 Compare July 18, 2024 18:37
@steveloughran
Copy link
Contributor Author

tested s3 london with -Dparallel-tests -DtestsThreadCount=12 -Dscale

prefetch failures, timing related (12 too big...)

[ERROR] Failures: 
[ERROR]   ITestS3APrefetchingInputStream.testReadLargeFileFully:136 [Maxiumum named action_executor_acquired.max] 
Expecting:
 <0L>
to be greater than:
 <0L> 
[ERROR] Errors: 
[ERROR]   ITestS3APrefetchingLruEviction.testSeeksWithLruEviction:162 » TestTimedOut tes...

@apache apache deleted a comment from hadoop-yetus Jul 22, 2024
@apache apache deleted a comment from hadoop-yetus Jul 22, 2024
@apache apache deleted a comment from hadoop-yetus Jul 22, 2024
@apache apache deleted a comment from hadoop-yetus Jul 22, 2024
@apache apache deleted a comment from hadoop-yetus Jul 22, 2024
@apache apache deleted a comment from hadoop-yetus Jul 22, 2024
@apache apache deleted a comment from hadoop-yetus Jul 22, 2024
@steveloughran steveloughran marked this pull request as ready for review July 22, 2024 17:37
@steveloughran steveloughran force-pushed the s3/HADOOP-19221-multipart-put-failures branch 2 times, most recently from 2790d56 to b95ee7c Compare July 23, 2024 16:09
@steveloughran
Copy link
Contributor Author

s3 london with: -Dparallel-tests -DtestsThreadCount=8 -Dscale

This is ready to be reviewed. @mukund-thakur, @HarshitGupta11 and @shameersss1 could you all look at this?

@steveloughran steveloughran force-pushed the s3/HADOOP-19221-multipart-put-failures branch from b187942 to 1fb04e9 Compare July 24, 2024 14:58
*/
public final class ByteBufferInputStream extends InputStream {
private static final Logger LOG =
LoggerFactory.getLogger(DataBlocks.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoudn't this be ByteBufferInputStream.class ?

} catch (ExecutionException ee) {
//there is no way of recovering so abort
//cancel all partUploads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we cancelling all the uploads here ?

Copy link
Contributor Author

@steveloughran steveloughran Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this. may need some more review to be confident we are doing abort here properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..done a lot more work into aborting.

*/
public class AWSStatus500Exception extends AWSServiceIOException {
public AWSStatus500Exception(String operation,
AwsServiceException cause) {
super(operation, cause);
}

@Override
public boolean retryable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this make all 500 retriable ? I mean if we S3 throws exception like 500 S3 Server Internal error. Do we need to retry from S3A client as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the latest change..I've made it an option

// there is specific handling for some 5XX codes (501, 503);
// this is for everything else
policyMap.put(AWSStatus500Exception.class, fail);
policyMap.put(AWSStatus500Exception.class, retryAwsClientExceptions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to selectively retry 500 exception? Say only when the cause is "Your socket connection......."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the full comment below. along with that I really don't like looking in error strings, way too brittle for production code. Even in tests I like to share the text across production and test classes as constants.

(yes, I know about org.apache.hadoop.fs.s3a.impl.ErrorTranslation ....doesn't mean I like it)

Copy link
Contributor

@shameersss1 shameersss1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks mostly good to me. I have left some minor comments and questions.

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production code looks good overall. Have to look at the tests.

@Override
protected ByteBufferInputStream createNewStream() {
// set the buffer up from reading from the beginning
blockBuffer.limit(initialPosition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why setting the limit is important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to start reading from the initial position every time the stream is opened.

Retrying _should_ make it go away.

The 500 error is considered retryable by the AWS SDK, which will have already
tried it `fs.s3a.attempts.maximum` times before reaching the S3A client -which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: retried?

@steveloughran
Copy link
Contributor Author

@shameersss1
I really don't know what best to do here.

We have massively cut back on the number of retries which take place in the V2 SDK compared to V1; even though we have discussed in the past turning it off completely and handling it all ourselves. However, that would break things the transfer manager does in separate threads.

The thing is, I do not know how often we see 500 errors against AWS S3 stores (rather than third party ones with unrecoverable issues) -and now we have seen them I don't know what the right policy should be. The only documentation on what to do seems more focused on 503s, and doesn't provide any hints about why a 500 could happen or what to do other than "keep trying maybe it'll go away": https://repost.aws/knowledge-center/http-5xx-errors-s3 . I do suspect it is very rare -otherwise the AWS team might have noticed their lack of resilience here, and we would've found it during our own testing. Any 500 error at any point other than multipart uploads probably gets recovered from nicely so that could've been a background noise of these which we have never noticed before. s3a FS stats will now track these, which may be informative.

I don't want to introduce another configuration switch if possible because that at more to documentation testing maintenance et cetera. One thing I was considering is should we treat this exactly the same as a throttling exception which has its own configuration settings for retries?

Anyway, if you could talk to your colleagues and make some suggestions based on real knowledge of what can happen that would be really nice. Note that we are treating 500 as idempotent, the way we do with all the other failures even though from a distributed computing purism perspective it is not in fact true.

Not looked at the other comments yet; will do later. Based on a code walk-through with Mukud, Harshit and Saikat, I've realised we should make absolutely sure that the stream providing a subset of file fails immediately if the read() goes past the allocated space. With tests, obviously.

@steveloughran
Copy link
Contributor Author

@shameersss1 here is what I propose:

add a boolean config option fs.s3a.retry.all.http.errors

retry on all "maybe unrecoverable" http errors; default is false.

I did think about a comma separated list "500, 4xx, 510" but decided it was overcomplex

@steveloughran
Copy link
Contributor Author

FYI, just reviewing the block output stream and simplifying it...no need to make single block PUT async and it simplifies that code once there's no need to worry about interruptions and cancelling. Also going up the execution chain to make sure CancellationException is processed in the output stream

Performing the PUT in the same thread as S3ABlockOutputStream.close()
simplifies it a lot (no cancel/interrupt).

+close() maps CancellationException to InterruptedIOException
+updated javadocs to emphasise CancellationException can be raised
+tweaked some problematic javadocs

Change-Id: I266697cd722fcfab0f9a98450d84abcdd38cb883
Change-Id: I42eabb4e9348c6c2c88a284a33b6230947914695
@steveloughran
Copy link
Contributor Author

+had to do a merge commit as all PRs whose dependencies on h-thirdparty are 1.3.0-SNAPSHOT will break

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 21s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 20 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 23s Maven dependency ordering for branch
+1 💚 mvninstall 20m 17s trunk passed
+1 💚 compile 9m 3s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 compile 8m 30s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 checkstyle 2m 8s trunk passed
+1 💚 mvnsite 1m 35s trunk passed
+1 💚 javadoc 1m 14s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 9s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 2m 20s trunk passed
+1 💚 shadedclient 21m 14s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 21s Maven dependency ordering for patch
+1 💚 mvninstall 0m 53s the patch passed
+1 💚 compile 8m 47s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javac 8m 47s the patch passed
+1 💚 compile 9m 1s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 javac 9m 1s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 2m 8s /results-checkstyle-root.txt root: The patch generated 3 new + 25 unchanged - 1 fixed = 28 total (was 26)
+1 💚 mvnsite 1m 21s the patch passed
+1 💚 javadoc 0m 58s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 0m 54s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 2m 19s the patch passed
+1 💚 shadedclient 23m 38s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 15m 38s hadoop-common in the patch passed.
+1 💚 unit 2m 0s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
153m 23s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/22/artifact/out/Dockerfile
GITHUB PR #6938
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux e1ade8fef27e 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c797a83
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/22/testReport/
Max. process+thread count 1276 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/22/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 18m 10s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 20 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 40s Maven dependency ordering for branch
+1 💚 mvninstall 32m 28s trunk passed
+1 💚 compile 17m 40s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 compile 16m 4s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 checkstyle 4m 18s trunk passed
+1 💚 mvnsite 2m 46s trunk passed
+1 💚 javadoc 1m 58s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 43s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 3m 56s trunk passed
+1 💚 shadedclient 35m 26s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 33s Maven dependency ordering for patch
+1 💚 mvninstall 1m 28s the patch passed
+1 💚 compile 17m 9s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javac 17m 9s the patch passed
+1 💚 compile 16m 18s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 javac 16m 18s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 23s /results-checkstyle-root.txt root: The patch generated 3 new + 25 unchanged - 1 fixed = 28 total (was 26)
+1 💚 mvnsite 2m 40s the patch passed
+1 💚 javadoc 1m 53s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 45s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 4m 20s the patch passed
+1 💚 shadedclient 35m 8s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 20m 33s hadoop-common in the patch passed.
+1 💚 unit 3m 3s hadoop-aws in the patch passed.
+1 💚 asflicense 1m 4s The patch does not generate ASF License warnings.
263m 47s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/21/artifact/out/Dockerfile
GITHUB PR #6938
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 42d0683f7cd5 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c797a83
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/21/testReport/
Max. process+thread count 3152 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/21/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

this is funny: placing the simple PUT in the same thread as close() breaks ITestS3AConcurrentOps.

Why so? that test looks for at least one thread called s3a-transfer then asserts that after the thread timeout that count goes to zero. It is meant to assert that after renames the pool is drained but we've made two changes this year to reduce the #of threads

  1. small file renames don't use copy manager. This invalidated the test -we just never noticed
  2. this PR: small PUT doesn't do async. this caused the regression caused by change 1 to surfacen

As a result: no threads to assert on.

I'm fixing it by shrinking the size of multipart uploads to their minimum -this seems to work, though if problems surface in future we should look at the test and decide whether or not it is obsolete -or whether we could redesign the tests to include more parallelized operations (tree renames?)

PUT changes caused latent regression in this test to surface

Also: more tuning of S3ABlockOutputStream, including
comments, logging and @RetryPolicy tags

Change-Id: I7c9920a3bb835d6993b5e1f84faf1286e2194fd0
* General review of S3ABlockOutputStream, inc javadocs
  and Retries attributes
* S3ADataBlocks also has its methods' exceptions reviewed and
  javadocs updated
* Default value of fs.s3a.retry.http.5xx.errors is true
* Troubleshooting and third-party docs updated

Change-Id: I764943cdc0c867875be807ee6f4bd27600aae275
@steveloughran
Copy link
Contributor Author

today's changes.

  • fix failing test ITestS3AConcurrentOps
  • General review of S3ABlockOutputStream, inc javadocs
    and Retries attributes
  • S3ADataBlocks also has its methods' exceptions reviewed and
    javadocs updated
  • Default value of fs.s3a.retry.http.5xx.errors is true
  • Troubleshooting and third-party docs updated

The javadocs are mainly @throws calls as I went down them all to see what could be thrown and why, just to reassure myself there's no other causes of cancellation.

The change of fs.s3a.retry.http.5xx.errors default to true is probably the most controversial as mukund felt the SDK retried anyway and it is so rare.
I think we should do it just in case -and I've updated the third party and troubleshooting
docs to match and discuss turning it off.

@steveloughran
Copy link
Contributor Author

test-wise, something transient

[ERROR] testDirProbes[keep-markers](org.apache.hadoop.fs.s3a.ITestS3AFileOperationCost)  Time elapsed: 0.856 s  <<< ERROR!
java.nio.file.AccessDeniedException: s3a://stevel-london/job-00-fork-0005/test/testDirProbes[keep-markers]: org.apache.hadoop.fs.s3a.audit.AuditFailureException: efb58954-16ca-40d2-8f6a-2aef61bba339-00000038 unaudited operation executing a request outside an audit span {action_http_head_request 'job-00-fork-0005/test/testDirProbes[keep-markers]' size=0, mutating=false}
        at org.apache.hadoop.fs.s3a.audit.AuditIntegration.translateAuditException(AuditIntegration.java:161)
        at org.apache.hadoop.fs.s3a.audit.AuditIntegration.maybeTranslateAuditException(AuditIntegration.java:175)
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:200)
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:157)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4102)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:4005)
        at org.apache.hadoop.fs.s3a.S3ATestUtils.innerGetFileStatus(S3ATestUtils.java:1641)
        at org.apache.hadoop.fs.s3a.performance.AbstractS3ACostTest.lambda$interceptGetFileStatusFNFE$5(AbstractS3ACostTest.java:459)
        at org.apache.hadoop.test.LambdaTestUtils.intercept(LambdaTestUtils.java:500)
        at org.apache.hadoop.test.LambdaTestUtils.intercept(LambdaTestUtils.java:386)
        at org.apache.hadoop.test.LambdaTestUtils.intercept(LambdaTestUtils.java:455)
        at org.apache.hadoop.fs.s3a.performance.OperationCostValidator.lambda$intercepting$1(OperationCostValidator.java:221)
        at org.apache.hadoop.fs.s3a.performance.OperationCostValidator.exec(OperationCostValidator.java:167)
        at org.apache.hadoop.fs.s3a.performance.OperationCostValidator.intercepting(OperationCostValidator.java:220)
        at org.apache.hadoop.fs.s3a.performance.AbstractS3ACostTest.verifyMetricsIntercepting(AbstractS3ACostTest.java:342)
        at org.apache.hadoop.fs.s3a.performance.AbstractS3ACostTest.interceptOperation(AbstractS3ACostTest.java:361)
        at org.apache.hadoop.fs.s3a.performance.AbstractS3ACostTest.interceptGetFileStatusFNFE(AbstractS3ACostTest.java:457)
        at org.apache.hadoop.fs.s3a.ITestS3AFileOperationCost.testDirProbes(ITestS3AFileOperationCost.java:339)

this says the call wasn't in a span, but it is unless the span source is null -and the span source is set to the FS in setup(), and the fs sets its audit span in initialize to a real or stub audit manager.

so I have no idea how this can be reached. Seen something like this before so I consider it completely unrelated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 53s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 21 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 27s Maven dependency ordering for branch
+1 💚 mvninstall 32m 39s trunk passed
+1 💚 compile 17m 24s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 compile 16m 22s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 checkstyle 4m 20s trunk passed
+1 💚 mvnsite 2m 45s trunk passed
+1 💚 javadoc 1m 57s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 44s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 3m 57s trunk passed
+1 💚 shadedclient 34m 56s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 32s Maven dependency ordering for patch
+1 💚 mvninstall 1m 26s the patch passed
+1 💚 compile 16m 57s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javac 16m 57s the patch passed
+1 💚 compile 16m 2s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 javac 16m 2s the patch passed
-1 ❌ blanks 0m 0s /blanks-eol.txt The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚 checkstyle 4m 17s root: The patch generated 0 new + 26 unchanged - 1 fixed = 26 total (was 27)
+1 💚 mvnsite 2m 39s the patch passed
+1 💚 javadoc 1m 54s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 44s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 4m 17s the patch passed
+1 💚 shadedclient 34m 58s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 20m 2s hadoop-common in the patch passed.
-1 ❌ unit 3m 27s /patch-unit-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch passed.
+1 💚 asflicense 1m 4s The patch does not generate ASF License warnings.
246m 23s
Reason Tests
Failed junit tests hadoop.fs.s3a.TestS3ABlockOutputStream
hadoop.fs.s3a.TestInvoker
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/23/artifact/out/Dockerfile
GITHUB PR #6938
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 05fb40c90697 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 299675d
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/23/testReport/
Max. process+thread count 1486 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/23/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Mockito test failed because mockito tests are so brittle.

Change-Id: Ia5e8a4fdb74b08a04af58f3b5c868392758cf9f3
Change-Id: I74994e4c41205db836df18ff53777219467a98cd
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 55s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 21 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 52s Maven dependency ordering for branch
-1 ❌ mvninstall 33m 6s /branch-mvninstall-root.txt root in trunk failed.
+1 💚 compile 17m 32s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 compile 16m 15s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 checkstyle 4m 21s trunk passed
+1 💚 mvnsite 2m 43s trunk passed
+1 💚 javadoc 2m 8s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 47s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 3m 55s trunk passed
-1 ❌ shadedclient 35m 53s branch has errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 34s Maven dependency ordering for patch
+1 💚 mvninstall 1m 28s the patch passed
+1 💚 compile 16m 50s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javac 16m 50s the patch passed
+1 💚 compile 16m 33s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 javac 16m 33s the patch passed
+1 💚 blanks 0m 1s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 16s /results-checkstyle-root.txt root: The patch generated 1 new + 26 unchanged - 1 fixed = 27 total (was 27)
+1 💚 mvnsite 2m 38s the patch passed
+1 💚 javadoc 1m 15s hadoop-common in the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
+1 💚 javadoc 0m 53s hadoop-tools_hadoop-aws-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 0 new + 0 unchanged - 2 fixed = 0 total (was 2)
+1 💚 javadoc 1m 46s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 4m 16s the patch passed
-1 ❌ shadedclient 35m 40s patch has errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 58s hadoop-common in the patch passed.
-1 ❌ unit 2m 57s /patch-unit-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch passed.
+1 💚 asflicense 1m 4s The patch does not generate ASF License warnings.
248m 13s
Reason Tests
Failed junit tests hadoop.fs.s3a.TestInvoker
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/24/artifact/out/Dockerfile
GITHUB PR #6938
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 4dec884b8597 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 5c915db
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/24/testReport/
Max. process+thread count 2147 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/24/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

[ERROR] Failures: 
[ERROR] org.apache.hadoop.fs.s3a.TestInvoker.test500isMappedTooAWSStatus500Exception(org.apache.hadoop.fs.s3a.TestInvoker)
[ERROR]   Run 1: TestInvoker.test500isMappedTooAWSStatus500Exception:191 [should retry org.apache.hadoop.fs.s3a.AWSStatus500Exception: test on /: software.amazon.awssdk.services.s3.model.S3Exception: We encountered an internal error. Please try again: We encountered an internal error. Please try again] expected:<[FAIL]> but was:<[RETRY]>
[ERROR]   Run 2: TestInvoker.test500isMappedTooAWSStatus500Exception:191 [should retry org.apache.hadoop.fs.s3a.AWSStatus500Exception: test on /: software.amazon.awssdk.services.s3.model.S3Exception: We encountered an internal error. Please try again: We encountered an internal error. Please try again] expected:<[FAIL]> but was:<[RETRY]>
[ERROR]   Run 3: TestInvoker.test500isMappedTooAWSStatus500Exception:191 [should retry org.apache.hadoop.fs.s3a.AWSStatus500Exception: test on /: software.amazon.awssdk.services.s3.model.S3Exception: We encountered an internal error. Please try again: We encountered an internal error. Please try again] expected:<[FAIL]> but was:<[RETRY]>
[INFO] 
[ERROR] org.apache.hadoop.fs.s3a.TestInvoker.test5xxRetriesDisabled(org.apache.hadoop.fs.s3a.TestInvoker)
[ERROR]   Run 1: TestInvoker.test5xxRetriesDisabled:240->assertRetryAction:367 500 Expected action RetryAction(action=FAIL, delayMillis=0, reason=null) from shouldRetry(software.amazon.awssdk.services.s3.model.S3Exception: We encountered an internal error. Please try again, 1, true), but got RETRY
[ERROR]   Run 2: TestInvoker.test5xxRetriesDisabled:240->assertRetryAction:367 500 Expected action RetryAction(action=FAIL, delayMillis=0, reason=null) from shouldRetry(software.amazon.awssdk.services.s3.model.S3Exception: We encountered an internal error. Please try again, 1, true), but got RETRY
[ERROR]   Run 3: TestInvoker.test5xxRetriesDisabled:240->assertRetryAction:367 500 Expected action RetryAction(action=FAIL, delayMillis=0, reason=null) from shouldRetry(software.amazon.awssdk.services.s3.model.S3Exception: We encountered an internal error. Please try again, 1, true), but got RETRY
[INFO] 

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 21 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 33s Maven dependency ordering for branch
-1 ❌ mvninstall 33m 3s /branch-mvninstall-root.txt root in trunk failed.
+1 💚 compile 17m 39s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 compile 16m 22s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 checkstyle 4m 18s trunk passed
+1 💚 mvnsite 2m 41s trunk passed
+1 💚 javadoc 2m 13s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 44s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 3m 56s trunk passed
-1 ❌ shadedclient 35m 9s branch has errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 33s Maven dependency ordering for patch
+1 💚 mvninstall 1m 31s the patch passed
+1 💚 compile 16m 51s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javac 16m 51s the patch passed
+1 💚 compile 16m 17s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 javac 16m 17s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 20s /results-checkstyle-root.txt root: The patch generated 1 new + 26 unchanged - 1 fixed = 27 total (was 27)
+1 💚 mvnsite 2m 41s the patch passed
+1 💚 javadoc 1m 13s hadoop-common in the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
+1 💚 javadoc 0m 54s hadoop-tools_hadoop-aws-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 0 new + 0 unchanged - 2 fixed = 0 total (was 2)
+1 💚 javadoc 1m 45s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 4m 17s the patch passed
-1 ❌ shadedclient 36m 24s patch has errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 53s hadoop-common in the patch passed.
-1 ❌ unit 2m 58s /patch-unit-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch passed.
+1 💚 asflicense 1m 6s The patch does not generate ASF License warnings.
247m 56s
Reason Tests
Failed junit tests hadoop.fs.s3a.TestInvoker
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/25/artifact/out/Dockerfile
GITHUB PR #6938
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 498c67c4f546 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / cc81630
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/25/testReport/
Max. process+thread count 1263 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/25/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Change-Id: Ib993974ce16df24a2da47223c3cc2e35336176a1
Changing the default retry policy failed a test
* implemented variants of the test case for policies with/without
  500 retry.
* little bit of cleanup on an old test suite.

Change-Id: Iefce2e9c7623f94644f1ad3f07bb581d4e707765
@steveloughran
Copy link
Contributor Author

tested s3 london -Dparallel-tests -DtestsThreadCount=8 -Dscale

core:
[WARNING] Tests run: 1360, Failures: 0, Errors: 0, Skipped: 102
root:
[INFO] Tests run: 131, Failures: 0, Errors: 0, Skipped: 0

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 22s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 21 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 4s Maven dependency ordering for branch
+1 💚 mvninstall 20m 4s trunk passed
+1 💚 compile 9m 6s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 compile 8m 23s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 checkstyle 2m 11s trunk passed
+1 💚 mvnsite 1m 30s trunk passed
+1 💚 javadoc 1m 19s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 4s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 2m 19s trunk passed
+1 💚 shadedclient 20m 9s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 21s Maven dependency ordering for patch
+1 💚 mvninstall 0m 52s the patch passed
+1 💚 compile 8m 49s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javac 8m 49s the patch passed
+1 💚 compile 8m 12s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 javac 8m 12s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 2m 5s root: The patch generated 0 new + 26 unchanged - 1 fixed = 26 total (was 27)
+1 💚 mvnsite 1m 36s the patch passed
+1 💚 javadoc 0m 46s hadoop-common in the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
+1 💚 javadoc 0m 33s hadoop-tools_hadoop-aws-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 0 new + 0 unchanged - 2 fixed = 0 total (was 2)
+1 💚 javadoc 1m 8s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 2m 38s the patch passed
+1 💚 shadedclient 20m 39s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 16m 59s hadoop-common in the patch passed.
+1 💚 unit 2m 11s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 43s The patch does not generate ASF License warnings.
151m 19s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/26/artifact/out/Dockerfile
GITHUB PR #6938
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 62e1e1f661a3 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 6780fa5
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/26/testReport/
Max. process+thread count 3153 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6938/26/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

PR all green; got Ahmar's approval. merging to trunk and 3.4, but not 3.4.1

@steveloughran steveloughran merged commit ea6e0f7 into apache:trunk Sep 13, 2024
4 checks passed
steveloughran added a commit to steveloughran/hadoop that referenced this pull request Sep 13, 2024
…upload attempt (apache#6938)


This is a major change which handles 400 error responses when uploading
large files from memory heap/buffer (or staging committer) and the remote S3
store returns a 500 response from a upload of a block in a multipart upload.

The SDK's own streaming code seems unable to fully replay the upload;
at attempts to but then blocks and the S3 store returns a 400 response

    "Your socket connection to the server was not read from or written to
     within the timeout period. Idle connections will be closed.
     (Service: S3, Status Code: 400...)"

There is an option to control whether or not the S3A client itself
attempts to retry on a 50x error other than 503 throttling events
(which are independently processed as before)

Option:  fs.s3a.retry.http.5xx.errors
Default: true

500 errors are very rare from standard AWS S3, which has a five nines
SLA. It may be more common against S3 Express which has lower
guarantees.

Third party stores have unknown guarantees, and the exception may
indicate a bad server configuration. Consider setting
fs.s3a.retry.http.5xx.errors to false when working with
such stores.

Signification Code changes:

There is now a custom set of implementations of
software.amazon.awssdk.http.ContentStreamProvidercontent in
the class org.apache.hadoop.fs.s3a.impl.UploadContentProviders.

These:

* Restart on failures
* Do not copy buffers/byte buffers into new private byte arrays,
  so avoid exacerbating memory problems..

There new IOStatistics for specific http error codes -these are collected
even when all recovery is performed within the SDK.
  
S3ABlockOutputStream has major changes, including handling of
Thread.interrupt() on the main thread, which now triggers and briefly
awaits cancellation of any ongoing uploads.

If the writing thread is interrupted in close(), it is mapped to
an InterruptedIOException. Applications like Hive and Spark must
catch these after cancelling a worker thread.

Contributed by Steve Loughran
steveloughran added a commit that referenced this pull request Sep 16, 2024
…upload attempt (#6938) (#7044)


This is a major change which handles 400 error responses when uploading
large files from memory heap/buffer (or staging committer) and the remote S3
store returns a 500 response from a upload of a block in a multipart upload.

The SDK's own streaming code seems unable to fully replay the upload;
at attempts to but then blocks and the S3 store returns a 400 response

    "Your socket connection to the server was not read from or written to
     within the timeout period. Idle connections will be closed.
     (Service: S3, Status Code: 400...)"

There is an option to control whether or not the S3A client itself
attempts to retry on a 50x error other than 503 throttling events
(which are independently processed as before)

Option:  fs.s3a.retry.http.5xx.errors
Default: true

500 errors are very rare from standard AWS S3, which has a five nines
SLA. It may be more common against S3 Express which has lower
guarantees.

Third party stores have unknown guarantees, and the exception may
indicate a bad server configuration. Consider setting
fs.s3a.retry.http.5xx.errors to false when working with
such stores.

Signification Code changes:

There is now a custom set of implementations of
software.amazon.awssdk.http.ContentStreamProvidercontent in
the class org.apache.hadoop.fs.s3a.impl.UploadContentProviders.

These:

* Restart on failures
* Do not copy buffers/byte buffers into new private byte arrays,
  so avoid exacerbating memory problems..

There new IOStatistics for specific http error codes -these are collected
even when all recovery is performed within the SDK.
  
S3ABlockOutputStream has major changes, including handling of
Thread.interrupt() on the main thread, which now triggers and briefly
awaits cancellation of any ongoing uploads.

If the writing thread is interrupted in close(), it is mapped to
an InterruptedIOException. Applications like Hive and Spark must
catch these after cancelling a worker thread.

Contributed by Steve Loughran
steveloughran added a commit to steveloughran/hadoop that referenced this pull request Oct 2, 2024
…upload attempt (apache#6938) (apache#7044)


This is a major change which handles 400 error responses when uploading
large files from memory heap/buffer (or staging committer) and the remote S3
store returns a 500 response from a upload of a block in a multipart upload.

The SDK's own streaming code seems unable to fully replay the upload;
at attempts to but then blocks and the S3 store returns a 400 response

    "Your socket connection to the server was not read from or written to
     within the timeout period. Idle connections will be closed.
     (Service: S3, Status Code: 400...)"

There is an option to control whether or not the S3A client itself
attempts to retry on a 50x error other than 503 throttling events
(which are independently processed as before)

Option:  fs.s3a.retry.http.5xx.errors
Default: true

500 errors are very rare from standard AWS S3, which has a five nines
SLA. It may be more common against S3 Express which has lower
guarantees.

Third party stores have unknown guarantees, and the exception may
indicate a bad server configuration. Consider setting
fs.s3a.retry.http.5xx.errors to false when working with
such stores.

Signification Code changes:

There is now a custom set of implementations of
software.amazon.awssdk.http.ContentStreamProvidercontent in
the class org.apache.hadoop.fs.s3a.impl.UploadContentProviders.

These:

* Restart on failures
* Do not copy buffers/byte buffers into new private byte arrays,
  so avoid exacerbating memory problems..

There new IOStatistics for specific http error codes -these are collected
even when all recovery is performed within the SDK.
  
S3ABlockOutputStream has major changes, including handling of
Thread.interrupt() on the main thread, which now triggers and briefly
awaits cancellation of any ongoing uploads.

If the writing thread is interrupted in close(), it is mapped to
an InterruptedIOException. Applications like Hive and Spark must
catch these after cancelling a worker thread.

Contributed by Steve Loughran
steveloughran added a commit that referenced this pull request Oct 3, 2024
…upload attempt (#6938) (#7044) (#7094)


This is a major change which handles 400 error responses when uploading
large files from memory heap/buffer (or staging committer) and the remote S3
store returns a 500 response from a upload of a block in a multipart upload.

The SDK's own streaming code seems unable to fully replay the upload;
at attempts to but then blocks and the S3 store returns a 400 response

    "Your socket connection to the server was not read from or written to
     within the timeout period. Idle connections will be closed.
     (Service: S3, Status Code: 400...)"

There is an option to control whether or not the S3A client itself
attempts to retry on a 50x error other than 503 throttling events
(which are independently processed as before)

Option:  fs.s3a.retry.http.5xx.errors
Default: true

500 errors are very rare from standard AWS S3, which has a five nines
SLA. It may be more common against S3 Express which has lower
guarantees.

Third party stores have unknown guarantees, and the exception may
indicate a bad server configuration. Consider setting
fs.s3a.retry.http.5xx.errors to false when working with
such stores.

Signification Code changes:

There is now a custom set of implementations of
software.amazon.awssdk.http.ContentStreamProvidercontent in
the class org.apache.hadoop.fs.s3a.impl.UploadContentProviders.

These:

* Restart on failures
* Do not copy buffers/byte buffers into new private byte arrays,
  so avoid exacerbating memory problems..

There new IOStatistics for specific http error codes -these are collected
even when all recovery is performed within the SDK.
  
S3ABlockOutputStream has major changes, including handling of
Thread.interrupt() on the main thread, which now triggers and briefly
awaits cancellation of any ongoing uploads.

If the writing thread is interrupted in close(), it is mapped to
an InterruptedIOException. Applications like Hive and Spark must
catch these after cancelling a worker thread.

Contributed by Steve Loughran
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants