-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-16490. Improve S3Guard handling of FNFEs in copy #1229
HADOOP-16490. Improve S3Guard handling of FNFEs in copy #1229
Conversation
af4baa6
to
36d4d21
Compare
Full test run in progress. |
This currently conflicts with #1246; that should go in first as it is the basic resilience; here i want to eliminate that HEAD entirely |
9704f87
to
379778a
Compare
|
||
When the S3A connector attempts to open a file for which it has an entry in | ||
its database, it will retry if the desired file is not found. This is | ||
done if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
its database, it will retry if the desired file is not found. This is | ||
done if | ||
* No file is found in S3. | ||
* There is a file but its version or etag is not consistent with S3Guard table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
tested s3 ireland -Ds3guard -Ddynamo. all tests happy. |
|
||
When the S3A connector attempts to open a file for which it has an entry in | ||
its database, it will retry if the desired file is not found. This is | ||
done if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
its database, it will retry if the desired file is not found. This is | ||
done if | ||
* No file is found in S3. | ||
* There is a file but its version or etag is not consistent with S3Guard table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
Thinking about the new flag passed in to s3GetFileStatus to skip the object probe it should be an enumset of probes:
we can also optimise some directory ops by skipping the list and doing the real list if the (object, marker) calls failed |
Proposed changes to this PR then
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, I have some questions and notes with my review.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/RemoteFileChangedException.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java
Outdated
Show resolved
Hide resolved
|
||
When the S3A connector attempts to open a file for which it has an entry in | ||
its database, it will retry if the desired file is not found. This is | ||
done if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
its database, it will retry if the desired file is not found. This is | ||
done if | ||
* No file is found in S3. | ||
* There is a file but its version or etag is not consistent with S3Guard table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
💔 -1 overall
This message was automatically generated. |
|
||
When the S3A connector attempts to open a file for which it has an entry in | ||
its database, it will retry if the desired file is not found. This is | ||
done if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
its database, it will retry if the desired file is not found. This is | ||
done if | ||
* No file is found in S3. | ||
* There is a file but its version or etag is not consistent with S3Guard table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
218489e
to
a0cb226
Compare
last update moves to a probe set so you can choose what to look for; maybeCreateFakeDirectory uses this to only do the LIST call, as we dont care if a fake dir marker is there...saves one HEAD per operation. not tested; Gabors comments not addressed |
|
||
When the S3A connector attempts to open a file for which it has an entry in | ||
its database, it will retry if the desired file is not found. This is | ||
done if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:end of line
🎊 +1 overall
This message was automatically generated. |
S3Guard retry policy on failures in open/select/copy to be independently configurable from other retry policies, * and use a setting with exponential backoff * new config names * copy raises a RemoteFileChangedException which is *not* caught in rename() and downgraded to false. Thus: when a rename is unrecoverable, this fact is propagated * tests for this Also: tests turning auth mode on/off have to handle the auth state being set through an authoritative path, rather than a single flag. Caught me out as of course the first test I saw with this was the ITestS3ARemoteFileChanged rename ones, and I assumed that it was my new code. It was actually due to me setting an auth path last week. I'm unsure about the use of "RemoteFileChangedException", but it could be a remote file change, especially for an etag, where the file has been deleted since being recorded.I don't know if/how we should adapt to that in S3Guard if the problem persists. The file is either *no longer there* or *changes not yet visible*. Maybe consider marking parent dir as nonauth? Logging error, always? Change-Id: I7bb468aca0f4019537d82bc083f0a9887eaa282b
- different text when s3guard is off - better tostring values for logging Change-Id: I559d7a099e2720fee48d432c4fc7ae18c6af137f
…ently by not using deleteOnExit we save an RPC call on the normal path and reduce the risk of negative cached entries. Change-Id: Ie067c3b5179a1602488eadf35419805c089a3373
Change-Id: I3163cd3b561da0ab6bfb9f9d56d9097ec6d4438c
-initial delay = 2s -tweaked the new error messages to make clear this happened during rename and to highlight when a table is unguarded. -checkstyles Change-Id: Ibf71c433fc8342a20ffbf4d822cd9ea834a492a8
Change-Id: I1d08d64b27afd348261e0fa93444f14510b47746
no explicit testing of those probe switches -it will be straightforward Change-Id: I00f8987b445bfd192845a69d10a2ba6fd02bd2a1
…s at debug, rather than lose them. Change-Id: Icced1ac648c7a97e322362bbc03a245cf496edcd
Change-Id: Ie4a7c9aca2a505fce3b8eedd07fbb0127ad4379a
TestCopy was failing because of the removal of deleteOnExit broke the mock invocation count. -correct for current values -reinstate deleteOnExit before rename for the existing behaviour -add a few extra probes -and a test on what happens on a fail in write, rather than create. Change-Id: Ib8362ac3a5e36819255a35b8f54c8cfc523c1eb1
Gabor reported an NPE here on a test run of this PR only; added asserts that the lists from the MS are never null Change-Id: I9154eadabb486f601f86e51121d2f05d912c8e46
2d5d5a8
to
f7c878a
Compare
committed to trunk -thanks for all reviews |
🎊 +1 overall
This message was automatically generated. |
I closed the wrong PR! |
S3Guard retry policy independently configurable from other retry policies,
Also: tests turning auth mode on/off have to handle the auth state being set through an authoritative path, rather than a single flag. Caught me out as of course the first test I saw with this was the ITestS3ARemoteFileChanged rename ones, and I assumed that it was my new code. It was actually due to me setting an auth path last week.
I'm unsure about the use of "RemoteFileChangedException", but it could be a remote file change, especially for an etag, where the file has been deleted since being recorded.I don't know if/how we should adapt to that in S3Guard if the problem persists. The file is either no longer there or changes not yet visible. Maybe consider marking parent dir as nonauth?
Also, how about we log the RemoteFileChangedException error message before we throw it, so that there's more details in the log of the problem, irrespective of the action in the app calling it (i.e. fsshell discarding it, possibly)
Change-Id: I7bb468aca0f4019537d82bc083f0a9887eaa282b