-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-18378. Implement lazy seek in S3A prefetching. #4955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called. This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks OK on a first read, but it is complex and goes near to off-by-one bugs in the past. would suggest review by @mukund-thakur , @mehakmeet and anyone else
...project/hadoop-common/src/test/java/org/apache/hadoop/fs/impl/prefetch/TestFilePosition.java
Outdated
Show resolved
Hide resolved
...-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/prefetch/S3ACachingInputStream.java
Outdated
Show resolved
Hide resolved
...-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3APrefetchingInputStream.java
Outdated
Show resolved
Hide resolved
assertEquals(0, inputStream.read()); | ||
assertEquals(bufferSize - 1, inputStream.available()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add string of what is being tested. the assertEquals prints the values, but it is important to know what is being validated. same below.
actually, given the #of calls, what about a method assertAvailable(InputStream, int)
to do the probe everywhere
@@ -116,7 +116,7 @@ public void setData(BufferData bufferData, | |||
readOffset, | |||
"readOffset", | |||
startOffset, | |||
startOffset + bufferData.getBuffer().limit() - 1); | |||
startOffset + bufferData.getBuffer().limit()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to trust you here. we have shipped the s3a input stream with an off by one error in the past. surfaced in parquet but not ORC, and only in some cases...
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good, one minor change for the new assertAvailable proposed
private static void assertAvailable(int expected, InputStream inputStream) | ||
throws IOException { | ||
assertThat(inputStream.available()) | ||
.describedAs("Check available bytes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in description, include inputStream string value, as it should be reporting its iostats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, S3ARemoteInputStream.toString()
only reports things like nextReadPos
and fpos
(no iostats!). Still useful, so I've included it anyway. But happy to also review/improve toString()
if required, here or on a separate PR.
Note that:
- this PR already modifies
toString()
, but only to take the lazy seek logic into account (and make it consistent acrossS3ARemoteInputStream
and its sub-classes) S3AInputStream
also does not seem to report iostats (unless I'm missing something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3AInputStream.toString()
:
String s = streamStatistics.toString();
the cloudstore bandwidth command does the toString calls if you ask for a -verbose run, which lets it print stats while still compiling against older releases
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, merging.
congratulations on your first pr into the project
Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called. This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read. Contributed by Alessandro Passaro.
Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called. This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read. Contributed by Alessandro Passaro.
Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called. This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read. Contributed by Alessandro Passaro.
Description of PR
HADOOP-18378. Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called. This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions, the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read.
How was this patch tested?
Tested with
mvn clean verify
on a bucket ineu-west-2
.For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?