Skip to content

HADOOP-18378. Implement lazy seek in S3A prefetching. #4955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 6, 2022

Conversation

passaro
Copy link
Contributor

@passaro passaro commented Sep 29, 2022

Description of PR

HADOOP-18378. Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called. This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions, the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read.

How was this patch tested?

Tested with mvn clean verify on a bucket in eu-west-2.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called.
This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read.
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks OK on a first read, but it is complex and goes near to off-by-one bugs in the past. would suggest review by @mukund-thakur , @mehakmeet and anyone else

assertEquals(0, inputStream.read());
assertEquals(bufferSize - 1, inputStream.available());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add string of what is being tested. the assertEquals prints the values, but it is important to know what is being validated. same below.

actually, given the #of calls, what about a method assertAvailable(InputStream, int) to do the probe everywhere

@@ -116,7 +116,7 @@ public void setData(BufferData bufferData,
readOffset,
"readOffset",
startOffset,
startOffset + bufferData.getBuffer().limit() - 1);
startOffset + bufferData.getBuffer().limit());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to trust you here. we have shipped the s3a input stream with an off by one error in the past. surfaced in parquet but not ORC, and only in some cases...

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 49s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 3 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 28s Maven dependency ordering for branch
+1 💚 mvninstall 28m 44s trunk passed
+1 💚 compile 25m 36s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 compile 22m 21s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 22s trunk passed
+1 💚 mvnsite 3m 0s trunk passed
+1 💚 javadoc 2m 9s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 52s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 26s trunk passed
+1 💚 shadedclient 24m 26s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 25s Maven dependency ordering for patch
+1 💚 mvninstall 1m 41s the patch passed
+1 💚 compile 24m 49s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javac 24m 49s the patch passed
+1 💚 compile 22m 11s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 javac 22m 11s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 18s /results-checkstyle-root.txt root: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
+1 💚 mvnsite 2m 57s the patch passed
+1 💚 javadoc 2m 0s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 50s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 37s the patch passed
+1 💚 shadedclient 24m 16s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 37s hadoop-common in the patch passed.
+1 💚 unit 2m 51s hadoop-aws in the patch passed.
+1 💚 asflicense 1m 9s The patch does not generate ASF License warnings.
248m 40s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/3/artifact/out/Dockerfile
GITHUB PR #4955
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 2cf76fb33e30 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 1f769cd
Default Java Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/3/testReport/
Max. process+thread count 3137 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 35s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 4s codespell was not available.
+0 🆗 detsecrets 0m 4s detect-secrets was not available.
+0 🆗 markdownlint 0m 4s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 3 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 16m 51s Maven dependency ordering for branch
+1 💚 mvninstall 41m 40s trunk passed
+1 💚 compile 37m 7s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
-1 ❌ compile 1m 15s /branch-compile-root-jdkPrivateBuild-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07.txt root in trunk failed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07.
+1 💚 checkstyle 5m 26s trunk passed
+1 💚 mvnsite 4m 36s trunk passed
+1 💚 javadoc 3m 15s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 2m 49s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 6m 17s trunk passed
+1 💚 shadedclient 31m 31s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 31s Maven dependency ordering for patch
+1 💚 mvninstall 2m 2s the patch passed
+1 💚 compile 25m 11s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javac 25m 11s the patch passed
+1 💚 compile 22m 8s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
-1 ❌ javac 22m 8s /results-compile-javac-root-jdkPrivateBuild-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07.txt root-jdkPrivateBuild-1.8.0_342-8u342-b07-0ubuntu120.04-b07 with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu120.04-b07 generated 2633 new + 0 unchanged - 0 fixed = 2633 total (was 0)
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 4m 13s the patch passed
+1 💚 mvnsite 2m 55s the patch passed
+1 💚 javadoc 2m 1s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 51s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 36s the patch passed
+1 💚 shadedclient 24m 58s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 59s hadoop-common in the patch passed.
+1 💚 unit 2m 58s hadoop-aws in the patch passed.
+1 💚 asflicense 1m 14s The patch does not generate ASF License warnings.
271m 48s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/4/artifact/out/Dockerfile
GITHUB PR #4955
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux ea05a1987fc9 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 7edb2f0
Default Java Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/4/testReport/
Max. process+thread count 3137 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/4/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Oct 4, 2022
@apache apache deleted a comment from hadoop-yetus Oct 4, 2022
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, one minor change for the new assertAvailable proposed

private static void assertAvailable(int expected, InputStream inputStream)
throws IOException {
assertThat(inputStream.available())
.describedAs("Check available bytes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in description, include inputStream string value, as it should be reporting its iostats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, S3ARemoteInputStream.toString() only reports things like nextReadPos and fpos (no iostats!). Still useful, so I've included it anyway. But happy to also review/improve toString() if required, here or on a separate PR.

Note that:

  • this PR already modifies toString(), but only to take the lazy seek logic into account (and make it consistent across S3ARemoteInputStream and its sub-classes)
  • S3AInputStream also does not seem to report iostats (unless I'm missing something)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3AInputStream.toString():

String s = streamStatistics.toString();

the cloudstore bandwidth command does the toString calls if you ask for a -verbose run, which lets it print stats while still compiling against older releases

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 17m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 3 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 27s Maven dependency ordering for branch
+1 💚 mvninstall 28m 49s trunk passed
+1 💚 compile 25m 25s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 compile 22m 13s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 23s trunk passed
+1 💚 mvnsite 2m 58s trunk passed
+1 💚 javadoc 2m 8s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 53s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 24s trunk passed
+1 💚 shadedclient 24m 27s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 25s Maven dependency ordering for patch
+1 💚 mvninstall 1m 44s the patch passed
+1 💚 compile 24m 41s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javac 24m 41s the patch passed
+1 💚 compile 22m 11s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 javac 22m 11s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 4m 25s the patch passed
+1 💚 mvnsite 2m 57s the patch passed
+1 💚 javadoc 2m 1s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 52s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 39s the patch passed
+1 💚 shadedclient 25m 1s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 40s hadoop-common in the patch passed.
+1 💚 unit 2m 54s hadoop-aws in the patch passed.
+1 💚 asflicense 1m 10s The patch does not generate ASF License warnings.
266m 22s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/5/artifact/out/Dockerfile
GITHUB PR #4955
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux d1a18663a363 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 42b51c4
Default Java Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/5/testReport/
Max. process+thread count 3137 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4955/5/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, merging.

congratulations on your first pr into the project

@steveloughran steveloughran merged commit 1675a28 into apache:trunk Oct 6, 2022
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called.

This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read.

Contributed by Alessandro Passaro.
asfgit pushed a commit that referenced this pull request Apr 26, 2023
Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called.

This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read.

Contributed by Alessandro Passaro.
asfgit pushed a commit that referenced this pull request Apr 28, 2023
Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called.

This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read.

Contributed by Alessandro Passaro.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants