Skip to content

HADOOP-16202 Enhance S3A openFile() #2046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

steveloughran
Copy link
Contributor

  • s3a fs will use any FileStatus passed in to openFile...needed for mount/wrapped FS
  • and supports an option "fs.s3a.open.option.length" to set the length -which will also trigger the HEAD being skipped
  • builder spec updated to say status path is ignored
  • add tracking of all option keynames passed to builder.

I had intended to support etag and version but its too tricky...you'd also need to supply the length or you issue HEAD with the given etag/version to extract that param too. The tracking of all option keys was for that. Not yet used -but left for others

needs new tests

-pass in status with different path (in contract)
-new option skips HEAD request (trivial test -open missing file, seek to below declared length, only expect failure on read())

@steveloughran
Copy link
Contributor Author

tested s3 london, unguarded

@steveloughran
Copy link
Contributor Author

I plan to make the length option a generic fs one so that other filesystems (abfs) can also skip a probe here. a standard option makes it easier to justify hive/orc/parquet switching to this api

@apache apache deleted a comment from hadoop-yetus Jun 30, 2020
@@ -68,6 +68,11 @@ This is relevant with those stores which return version/etag information,
including the S3A and ABFS connectors -they MAY use this to guarantee that
the file they opened is exactly the one returned in the listing.

There is no requirement for `status.getPath()` to the resolved path of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this cause uncertain failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is that viewfs and the hive fs wrappers have their own schemas/paths. If we insist on the paths matching, you woudn't be able to use the API through hive or mounted viewfs, which would be too brittle. Saying "it's just a hint" avoids this issue. Same reason I'm not worrying about type of the status.

s3a FS: use etag and version ID, if known
other status: use the length

@mukund-thakur
Copy link
Contributor

Overall this looks really good.

@steveloughran steveloughran force-pushed the s3/HADOOP-16202-enhance-openfile branch from 6233b54 to b16769f Compare July 13, 2020 19:54
@apache apache deleted a comment from hadoop-yetus Jul 18, 2020
1. openFile()(path).opt(fs.s3a.input.length, len) to let caller specify
file length. All we actually need

2. if a filestatus of a different type is passed in, we create an FS
instance from it.

3. use the path of the command, not that of the filestatus.

this helps viewfs support the API

Change-Id: Ib9f0f98f701176124436aeaaa0360a426c257bc6
Change-Id: I9f0c11e2fa5cc0dbb84115b6824caf160e1abb62
This has the tests for the changes in the specification
(path of supplied status is ignored),
and for S3A enhancements, where the cost of the
operations against the the raw store is used to determine whether or not
any probes for a file were performed when opening it.

The test also adds tests for where the file length is set in
the option fs.s3a.open.option.length.
-in a shorter file, read() will return -1 when the declared length is reached,
not the actual.

in a longer file, read() will return -1 once the real file length
is exceeded. readFully() will raise an EOFException.

+fix bug in CTU.readStream() which didn't close the stream after
reading.

Now, troublespot. I had to add new opt/must operations for long values.
this is bad as
-we already have an int one which we now overload. I'd delete that one except
nothing anyone has already compiled will link.
-you can't add a long value and have your code compile against an older version
 of the code.

Not sure what the best action is there.

Change-Id: I77ebaa07515f5e2e18fd610ce1f04ef286126f12
@steveloughran steveloughran force-pushed the s3/HADOOP-16202-enhance-openfile branch from b16769f to 0e6c3fa Compare July 23, 2020 08:52
* the file length/seek policy are now defined in hadoop common and
  the specification, including evolution policy for seek.
* S3A FS uses the new options, retaining support for the older s3a
  seek policy
* moved as much as possible into S3AOpenFileHelper
* which has unit tests to verify its extraction of properties and file status.

Change-Id: I6981394c48583c33b706f9b5eb6c58c5ca078715
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 9s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 5 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 0m 22s Maven dependency ordering for branch
+1 💚 mvninstall 21m 30s trunk passed
+1 💚 compile 21m 20s trunk passed with JDK Ubuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04
+1 💚 compile 17m 44s trunk passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+1 💚 checkstyle 2m 54s trunk passed
+1 💚 mvnsite 2m 7s trunk passed
+1 💚 shadedclient 20m 43s branch has no errors when building and testing our client artifacts.
-1 ❌ javadoc 0m 36s hadoop-common in trunk failed with JDK Ubuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.
-1 ❌ javadoc 0m 36s hadoop-aws in trunk failed with JDK Ubuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.
+1 💚 javadoc 1m 28s trunk passed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09
+0 🆗 spotbugs 1m 8s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 3m 18s trunk passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 22s Maven dependency ordering for patch
-1 ❌ mvninstall 0m 17s hadoop-aws in the patch failed.
-1 ❌ compile 19m 5s root in the patch failed with JDK Ubuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.
-1 ❌ javac 19m 5s root in the patch failed with JDK Ubuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.
-1 ❌ compile 16m 36s root in the patch failed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09.
-1 ❌ javac 16m 36s root in the patch failed with JDK Private Build-1.8.0_252-8u252-b09-1~18.04-b09.
-0 ⚠️ checkstyle 2m 50s root: The patch generated 6 new + 152 unchanged - 0 fixed = 158 total (was 152)
-1 ❌ mvnsite 0m 32s hadoop-aws in the patch failed.
-1 ❌ whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚 shadedclient 15m 41s patch has no errors when building and testing our client artifacts.
-1 ❌ javadoc 0m 35s hadoop-common in the patch failed with JDK Ubuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.
-1 ❌ javadoc 0m 30s hadoop-aws in the patch failed with JDK Ubuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.
-1 ❌ javadoc 0m 34s hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_252-8u252-b09-118.04-b09 with JDK Private Build-1.8.0_252-8u252-b09-118.04-b09 generated 1 new + 4 unchanged - 0 fixed = 5 total (was 4)
-1 ❌ findbugs 0m 30s hadoop-aws in the patch failed.
_ Other Tests _
-1 ❌ unit 9m 17s hadoop-common in the patch passed.
-1 ❌ unit 0m 32s hadoop-aws in the patch failed.
+1 💚 asflicense 0m 46s The patch does not generate ASF License warnings.
166m 58s
Reason Tests
Failed junit tests hadoop.fs.contract.rawlocal.TestRawlocalContractOpen
hadoop.fs.contract.localfs.TestLocalFSContractOpen
Subsystem Report/Notes
Docker ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/Dockerfile
GITHUB PR #2046
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle markdownlint
uname Linux 9cc9b3bb8291 4.15.0-91-generic #92-Ubuntu SMP Fri Feb 28 11:09:48 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / bfcd775
Default Java Private Build-1.8.0_252-8u252-b09-1~18.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_252-8u252-b09-1~18.04-b09
javadoc https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.txt
javadoc https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/branch-javadoc-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.txt
mvninstall https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-mvninstall-hadoop-tools_hadoop-aws.txt
compile https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-compile-root-jdkUbuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.txt
javac https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-compile-root-jdkUbuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.txt
compile https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-compile-root-jdkPrivateBuild-1.8.0_252-8u252-b09-1~18.04-b09.txt
javac https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-compile-root-jdkPrivateBuild-1.8.0_252-8u252-b09-1~18.04-b09.txt
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/diff-checkstyle-root.txt
mvnsite https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-mvnsite-hadoop-tools_hadoop-aws.txt
whitespace https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/whitespace-eol.txt
javadoc https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.txt
javadoc https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-javadoc-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.7+10-post-Ubuntu-2ubuntu218.04.txt
javadoc https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/diff-javadoc-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_252-8u252-b09-1~18.04-b09.txt
findbugs https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-findbugs-hadoop-tools_hadoop-aws.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/artifact/out/patch-unit-hadoop-tools_hadoop-aws.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/testReport/
Max. process+thread count 2883 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-2046/5/console
versions git=2.17.1 maven=3.6.0 findbugs=4.0.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Jul 24, 2020
@steveloughran
Copy link
Contributor Author

things aren't complaining as the patch chain is confused; a file which was added and then renamed is still present.

I'm going to have to close this PR and create a new one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants