-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19295. S3A: large uploads can timeout over slow links #7089
HADOOP-19295. S3A: large uploads can timeout over slow links #7089
Conversation
This sets a different timeout for put/post calls to all other requests, so that option fs.s3a.connection.part.upload.timeout default 15m Although itests show the option is being applied to put/part uploads, and interpreted by the SDK, a full command line tests is showing a failure at 60s, as before. fMrSmr7TgYCqPw1C2tI_tM0A_TzaYJWPfwVxnE9MyC on hadoop-3.4.1.tar.gz._COPYING_: Retried 1: org.apache.hadoop.fs.s3a.AWSApiCallTimeoutException: upload part apache#3 on hadoop-3.4.1.tar.gz._COPYING_: software.amazon.awssdk.core.exception.ApiCallAttemptTimeoutException: HTTP request execution did not complete before the specified timeout configuration: 60000 millis This actually validates that the upload recovery is good, which makes me happy 2024-10-01 18:30:50,287 [s3a-transfer-stevel-london-bounded-pool1-t1] INFO impl.UploadContentProviders (UploadContentProviders.java:newStream(278)) - Stream created more than once: FileWithOffsetContentProvider{file=/tmp/hadoop-stevel/s3a/s3ablock-0001-751923718162888182.tmp, offset=0} BaseContentProvider{size=67108864, streamCreationCount=7, currentStream=null} Change-Id: I84e594eae55746a85f58b05ad376173ddbbc3ad1
log of a failure with this PR it is timing out after 60s, just doing it differently from 3.4.1 as we log at info recoveries of the content provider (whose log message I'm now tuning) This is complex enough that I think #7087 should be the 3.4.1 solution; this will be the stable one.
|
🎊 +1 overall
This message was automatically generated. |
Changing the different timeouts from 60s shows that the request timeout is still the timeout problem; and that on trunk recovery is different from 3.4.1 as strem recreation is logged. going to improve diags there (start time, and at debug stack) |
next iteration will improve tostring of upload with start time.
and if debug log enabled, full stack is generated
|
This sets a different timeout for put/post calls to all other requests. This commit sets apiCallAttemptTimeout() as well as the apiCallTimeout(); both need to be set for the extended timeouts to get picked up. Change-Id: I30b3832c5240ba3d655c5bfd550aab18c5767b4f
working test run. note that even this was (accidentally) through a VPN, upload performance is closer to v1; will retest without the VPN after running the itests
|
💔 -1 overall
This message was automatically generated. |
testing s3 london; the unrelated failure from an assert we should cut
|
@@ -1295,6 +1302,7 @@ protected RequestFactory createRequestFactory() { | |||
.withContentEncoding(contentEncoding) | |||
.withStorageClass(storageClass) | |||
.withMultipartUploadEnabled(isMultipartUploadEnabled) | |||
.withPartUploadTimeout(partUploadTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this part upload timeout different that multipart upload timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its my new option, same value for simple PUT as multipart; we patch the individual requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyway
- will cut
- will modify info log to only print once per stream, to keep that log noise down.
* This will be set on data put/post operations only. | ||
* A zero value means "no custom timeout" | ||
*/ | ||
private Duration partUploadTimeout = Duration.ZERO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you have tested this explicitly. Hopefully zero doesn't mean infinite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I will set to the default...even though its set up properly in production, there are some test cases which didn't
LOG.info("Stream created more than once: {}", this); | ||
LOG.info("Stream recreated: {}", this); | ||
if (LOG.isDebugEnabled()) { | ||
LOG.debug("Stream creation stack", new Exception("here")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is for some testing,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good q. should we delete it? it's actually pretty handy for troubleshooting networking issues
* default timeout set in builder * tune logging of content provider on recovery * new tests to verify timeout propagation * discovered a new wrapping of failures in read(), so relaxed intercept exception class more Change-Id: I43e2822e4dbd684d2c0469650b07369b731a2e7c
7298e92
to
04a18d3
Compare
found that sometimes the timeout is wrapped in an UncheckedIOException; so relaxed the exception intercepted.
|
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM other than one question below
15 mins seems too much to me. Any specific reason for selecting this high value as opposed to 5 mins ?
no. just something to cope with a slow block upload with maybe a transient network error. we still have that retry count limit so repeated network errors will fail in less than 15 min. |
…7089) This sets a different timeout for data upload PUT/POST calls to all other requests, so that slow block uploads do not trigger timeouts as rapidly as normal requests. This was always the behavior in the V1 AWS SDK; for V2 we have to explicitly set it on the operations we want to give extended timeouts. Option: fs.s3a.connection.part.upload.timeout Default: 15m Contributed by Steve Loughran
…7100) This sets a different timeout for data upload PUT/POST calls to all other requests, so that slow block uploads do not trigger timeouts as rapidly as normal requests. This was always the behavior in the V1 AWS SDK; for V2 we have to explicitly set it on the operations we want to give extended timeouts. Option: fs.s3a.connection.part.upload.timeout Default: 15m Contributed by Steve Loughran
…7100) This sets a different timeout for data upload PUT/POST calls to all other requests, so that slow block uploads do not trigger timeouts as rapidly as normal requests. This was always the behavior in the V1 AWS SDK; for V2 we have to explicitly set it on the operations we want to give extended timeouts. Option: fs.s3a.connection.part.upload.timeout Default: 15m Contributed by Steve Loughran
Long term fix; #7087 is the quick one
This sets a different timeout for put/post calls to all other requests, so that they don't time out uploads.
option fs.s3a.connection.part.upload.timeout
default 15m (just a guess..we could make bigger)
Although itests show the option is being applied to put/part uploads, and interpreted by the SDK, a full command line tests is showing a failure at 60s, as before.
fMrSmr7TgYCqPw1C2tI_tM0A_TzaYJWPfwVxnE9MyC on hadoop-3.4.1.tar.gz.COPYING:
Retried 1: org.apache.hadoop.fs.s3a.AWSApiCallTimeoutException:
upload part #3
on hadoop-3.4.1.tar.gz.COPYING:
software.amazon.awssdk.core.exception.ApiCallAttemptTimeoutException:
HTTP request execution did not complete before the specified timeout
configuration: 60000 millis
This actually validates that the upload recovery is good, which makes me happy
2024-10-01 18:30:50,287 [s3a-transfer-stevel-london-bounded-pool1-t1] INFO impl.UploadContentProviders
(UploadContentProviders.java:newStream(278)) -
Stream created more than once: FileWithOffsetContentProvider{file=/tmp/hadoop-stevel/s3a/s3ablock-0001-751923718162888182.tmp, offset=0} BaseContentProvider{size=67108864, streamCreationCount=7, currentStream=null}
even though the upload doesn't actually take.
I've modified the different 60s timeouts to help identify which is causing http
request timeouts. hopefully it is something we are actually setting ourselves.
How was this patch tested?
new ITest and manual upload of a gigabyte file
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?