Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-16490. Improve S3Guard handling of FNFEs in copy #1229

Closed

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Aug 5, 2019

S3Guard retry policy independently configurable from other retry policies,

  • and use a setting with exponential backoff
  • new config names
  • copy raises a RemoteFileChangedException which is not caught in rename() and downgraded to false. Thus: when a rename is unrecoverable, this fact is propagated
  • tests for this
  • More logging @ debug in change policies as to policy type and when options are not set, as well as being set. Currently to work out the policy involves looking for the absence of messages, not the presence. It makes the file more verbose, but will aid with debugging these problems.

Also: tests turning auth mode on/off have to handle the auth state being set through an authoritative path, rather than a single flag. Caught me out as of course the first test I saw with this was the ITestS3ARemoteFileChanged rename ones, and I assumed that it was my new code. It was actually due to me setting an auth path last week.

I'm unsure about the use of "RemoteFileChangedException", but it could be a remote file change, especially for an etag, where the file has been deleted since being recorded.I don't know if/how we should adapt to that in S3Guard if the problem persists. The file is either no longer there or changes not yet visible. Maybe consider marking parent dir as nonauth?

Also, how about we log the RemoteFileChangedException error message before we throw it, so that there's more details in the log of the problem, irrespective of the action in the app calling it (i.e. fsshell discarding it, possibly)

Change-Id: I7bb468aca0f4019537d82bc083f0a9887eaa282b

@steveloughran
Copy link
Contributor Author

Full test run in progress.

@steveloughran steveloughran added bug fs/s3 changes related to hadoop-aws; submitter must declare test endpoint work in progress PRs still Work in Progress; reviews not expected but still welcome labels Aug 8, 2019
@steveloughran
Copy link
Contributor Author

This currently conflicts with #1246; that should go in first as it is the basic resilience; here i want to eliminate that HEAD entirely

@steveloughran steveloughran force-pushed the s3/HADOOP-16490-s3guard-retry branch 2 times, most recently from 9704f87 to 379778a Compare August 9, 2019 18:02

When the S3A connector attempts to open a file for which it has an entry in
its database, it will retry if the desired file is not found. This is
done if

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

its database, it will retry if the desired file is not found. This is
done if
* No file is found in S3.
* There is a file but its version or etag is not consistent with S3Guard table

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@steveloughran steveloughran added enhancement fs/s3 changes related to hadoop-aws; submitter must declare test endpoint and removed work in progress PRs still Work in Progress; reviews not expected but still welcome fs/s3 changes related to hadoop-aws; submitter must declare test endpoint bug labels Aug 10, 2019
@steveloughran
Copy link
Contributor Author

tested s3 ireland -Ds3guard -Ddynamo. all tests happy.
this is ready for review


When the S3A connector attempts to open a file for which it has an entry in
its database, it will retry if the desired file is not found. This is
done if

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

its database, it will retry if the desired file is not found. This is
done if
* No file is found in S3.
* There is a file but its version or etag is not consistent with S3Guard table

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@steveloughran
Copy link
Contributor Author

Thinking about the new flag passed in to s3GetFileStatus to skip the object probe

it should be an enumset of probes:

  • object: check object
  • marker: look for directory marker
  • list: make a LIST call
  1. when we create a file, we only want (marker, list)
  2. for mkdir we would only do (object, list) and if those found nothing -create the marker, without ever looking for the marker. This will avoid creating a 404 on that probe.

we can also optimise some directory ops by skipping the list and doing the real list if the (object, marker) calls failed

@steveloughran
Copy link
Contributor Author

Proposed changes to this PR then

  • enum to control which probes to make
  • tests to verify that create() codepath doesn't do HEAD path when overwrite=true (different asserts for s3guard/s3guard auth)
  • not in scope for the initial PR: expanding to mkdir and other stat ops; keeping them separate PRs with their own tests makes the changes more incremental.

Copy link

@bgaborg bgaborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, I have some questions and notes with my review.


When the S3A connector attempts to open a file for which it has an entry in
its database, it will retry if the desired file is not found. This is
done if

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

its database, it will retry if the desired file is not found. This is
done if
* No file is found in S3.
* There is a file but its version or etag is not consistent with S3Guard table

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 44 Docker mode activated.
_ Prechecks _
+1 dupname 1 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 4 new or modified test files.
_ trunk Compile Tests _
0 mvndep 85 Maven dependency ordering for branch
+1 mvninstall 1447 trunk passed
+1 compile 1386 trunk passed
+1 checkstyle 184 trunk passed
+1 mvnsite 156 trunk passed
+1 shadedclient 1265 branch has no errors when building and testing our client artifacts.
+1 javadoc 113 trunk passed
0 spotbugs 93 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 241 trunk passed
_ Patch Compile Tests _
0 mvndep 24 Maven dependency ordering for patch
+1 mvninstall 101 the patch passed
+1 compile 1089 the patch passed
+1 javac 1089 the patch passed
+1 checkstyle 144 root: The patch generated 0 new + 46 unchanged - 2 fixed = 46 total (was 48)
+1 mvnsite 117 the patch passed
-1 whitespace 0 The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 716 patch has no errors when building and testing our client artifacts.
+1 javadoc 95 the patch passed
+1 findbugs 219 the patch passed
_ Other Tests _
-1 unit 576 hadoop-common in the patch failed.
+1 unit 74 hadoop-aws in the patch passed.
+1 asflicense 42 The patch does not generate ASF License warnings.
8097
Reason Tests
Failed junit tests hadoop.fs.viewfs.TestViewFsTrash
hadoop.ha.TestZKFailoverController
hadoop.fs.shell.TestCopy
Subsystem Report/Notes
Docker Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/10/artifact/out/Dockerfile
GITHUB PR #1229
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 45f30d875ee4 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 9b8359b
Default Java 1.8.0_222
whitespace https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/10/artifact/out/whitespace-eol.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/10/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/10/testReport/
Max. process+thread count 1499 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/10/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.


When the S3A connector attempts to open a file for which it has an entry in
its database, it will retry if the desired file is not found. This is
done if

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

its database, it will retry if the desired file is not found. This is
done if
* No file is found in S3.
* There is a file but its version or etag is not consistent with S3Guard table

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@steveloughran
Copy link
Contributor Author

last update moves to a probe set so you can choose what to look for; maybeCreateFakeDirectory uses this to only do the LIST call, as we dont care if a fake dir marker is there...saves one HEAD per operation.

not tested; Gabors comments not addressed


When the S3A connector attempts to open a file for which it has an entry in
its database, it will retry if the desired file is not found. This is
done if

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace:end of line

@apache apache deleted a comment from hadoop-yetus Aug 29, 2019
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 39 Docker mode activated.
_ Prechecks _
+1 dupname 1 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 5 new or modified test files.
_ trunk Compile Tests _
0 mvndep 65 Maven dependency ordering for branch
+1 mvninstall 1038 trunk passed
+1 compile 994 trunk passed
+1 checkstyle 145 trunk passed
+1 mvnsite 135 trunk passed
+1 shadedclient 1026 branch has no errors when building and testing our client artifacts.
+1 javadoc 114 trunk passed
0 spotbugs 69 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 186 trunk passed
_ Patch Compile Tests _
0 mvndep 25 Maven dependency ordering for patch
+1 mvninstall 79 the patch passed
+1 compile 932 the patch passed
+1 javac 932 the patch passed
+1 checkstyle 143 root: The patch generated 0 new + 97 unchanged - 2 fixed = 97 total (was 99)
+1 mvnsite 132 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 683 patch has no errors when building and testing our client artifacts.
+1 javadoc 117 the patch passed
+1 findbugs 208 the patch passed
_ Other Tests _
+1 unit 540 hadoop-common in the patch passed.
+1 unit 91 hadoop-aws in the patch passed.
+1 asflicense 52 The patch does not generate ASF License warnings.
6774
Subsystem Report/Notes
Docker Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/18/artifact/out/Dockerfile
GITHUB PR #1229
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 56aeb84dcda3 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 915cbc9
Default Java 1.8.0_222
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/18/testReport/
Max. process+thread count 1549 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/18/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
@apache apache deleted a comment from hadoop-yetus Sep 4, 2019
S3Guard retry policy on failures in open/select/copy to be independently configurable from other retry policies,

* and use a setting with exponential backoff
* new config names
* copy raises a RemoteFileChangedException which is *not* caught in rename() and downgraded to false. Thus: when a rename is unrecoverable, this fact is propagated
* tests for this

Also: tests turning auth mode on/off have to handle the auth state being set through an authoritative path, rather than a single flag. Caught me out as of course the first test I saw with this was the ITestS3ARemoteFileChanged rename ones, and I assumed that it was my new code. It was actually due to me setting an auth path last week.

I'm unsure about the use of "RemoteFileChangedException", but it could be a remote file change, especially for an etag, where the file has been deleted since being recorded.I don't know if/how we should adapt to that in S3Guard if the problem persists. The file is either *no longer there* or *changes not yet visible*. Maybe consider marking parent dir as nonauth? Logging error, always?

Change-Id: I7bb468aca0f4019537d82bc083f0a9887eaa282b
- different text when s3guard is off
- better tostring values for logging

Change-Id: I559d7a099e2720fee48d432c4fc7ae18c6af137f
…ently

by not using deleteOnExit we save an RPC call on the normal path and reduce the risk of negative cached entries.

Change-Id: Ie067c3b5179a1602488eadf35419805c089a3373
Change-Id: I3163cd3b561da0ab6bfb9f9d56d9097ec6d4438c
-initial delay = 2s
-tweaked the new error messages to make clear this happened during rename and to highlight when a table is unguarded.
-checkstyles

Change-Id: Ibf71c433fc8342a20ffbf4d822cd9ea834a492a8
Change-Id: I1d08d64b27afd348261e0fa93444f14510b47746
no explicit testing of those probe switches -it will be straightforward

Change-Id: I00f8987b445bfd192845a69d10a2ba6fd02bd2a1
…s at debug, rather than lose them.

Change-Id: Icced1ac648c7a97e322362bbc03a245cf496edcd
Change-Id: Ie4a7c9aca2a505fce3b8eedd07fbb0127ad4379a
TestCopy was failing because of the removal of deleteOnExit broke the mock invocation count.
-correct for current values
-reinstate deleteOnExit before rename for the existing behaviour
-add a few extra probes
-and a test on what happens on a fail in write, rather than create.

Change-Id: Ib8362ac3a5e36819255a35b8f54c8cfc523c1eb1
Gabor reported an NPE here on a test run of this PR only; added asserts that the lists from the MS are never null

Change-Id: I9154eadabb486f601f86e51121d2f05d912c8e46
@steveloughran
Copy link
Contributor Author

committed to trunk -thanks for all reviews

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
0 reexec 39 Docker mode activated.
_ Prechecks _
+1 dupname 1 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 6 new or modified test files.
_ trunk Compile Tests _
0 mvndep 65 Maven dependency ordering for branch
+1 mvninstall 1020 trunk passed
+1 compile 982 trunk passed
+1 checkstyle 140 trunk passed
+1 mvnsite 133 trunk passed
+1 shadedclient 991 branch has no errors when building and testing our client artifacts.
+1 javadoc 112 trunk passed
0 spotbugs 69 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 183 trunk passed
_ Patch Compile Tests _
0 mvndep 24 Maven dependency ordering for patch
+1 mvninstall 80 the patch passed
+1 compile 934 the patch passed
+1 javac 934 the patch passed
+1 checkstyle 149 root: The patch generated 0 new + 97 unchanged - 2 fixed = 97 total (was 99)
+1 mvnsite 129 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 684 patch has no errors when building and testing our client artifacts.
+1 javadoc 112 the patch passed
+1 findbugs 203 the patch passed
_ Other Tests _
+1 unit 550 hadoop-common in the patch passed.
+1 unit 94 hadoop-aws in the patch passed.
+1 asflicense 52 The patch does not generate ASF License warnings.
6724
Subsystem Report/Notes
Docker Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/19/artifact/out/Dockerfile
GITHUB PR #1229
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 9a81c091b006 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 172bcd8
Default Java 1.8.0_222
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/19/testReport/
Max. process+thread count 1399 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1229/19/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

I closed the wrong PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement fs/s3 changes related to hadoop-aws; submitter must declare test endpoint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants