Skip to content

HADOOP-17250 Lot of short reads can be merged with readahead. #3110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 5, 2021

Conversation

mukund-thakur
Copy link
Contributor

Introducing fs.azure.readahead.range parameter which can be set by user.
Data will be populated in buffer for random reads as well which leads to lesser
remote calls.
This patch also changes the seek implementation to perform a lazy seek. Actual
seek is done when a read is initiated and data is not present in buffer else
date is returned from buffer thus reducing the number of remote calls.

Rebased with trunk. Base patch is #2307
Ran all tests including scale ones using us-east my bucket. All good.

@mukund-thakur
Copy link
Contributor Author

CC @mehakmeet @snvijaya @bilaharith Please review this. #2307 was already review long time ago. Thanks

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not look at this in any detail yet; made a few minor comments.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes to production code except some comment suggestions and one proposed variable just to make it slightly easer to read a very complex piece of code. Having got the S3A one wrong, I need these comments.

Tests all good too

+1 pending these

@apache apache deleted a comment from hadoop-yetus Jun 25, 2021
@mukund-thakur mukund-thakur requested review from steveloughran and ThomasMarquardt and removed request for ThomasMarquardt June 25, 2021 11:49
@hadoop-yetus

This comment has been minimized.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me; merge at your leisure

// fCursor is the current file pointer. Thus maximum we can
// go back and read from buffer is fCursor - limit.
// There maybe case that we read less than requested data.
long bytesPresentInBuffer = fCursor - limit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename bytesPresentInBuffer to filePosAtStartOfBuffer

Copy link
Contributor

@snvijaya snvijaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have just a a minor comment on a variable rename.
Thanks for the patch @mukund-thakur. LGTM.

@mukund-thakur mukund-thakur force-pushed the azure_readahead_rebased branch from e81dc7b to 0ba21aa Compare July 5, 2021 06:51
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 16s #3110 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.
Subsystem Report/Notes
GITHUB PR #3110
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3110/4/console
versions git=2.17.1
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

Mukund Madhav Thakur and others added 3 commits July 5, 2021 12:28
Introducing fs.azure.readahead.range parameter which can be set by user.
Data will be populated in buffer for random reads as well which leads to lesser
remote calls.
This patch also changes the seek implementation to perform a lazy seek. Actual
seek is done when a read is initiated and data is not present in buffer else
date is returned from buffer thus reducing the number of remote calls.
@mukund-thakur mukund-thakur force-pushed the azure_readahead_rebased branch from 0ba21aa to 5d44ac5 Compare July 5, 2021 06:58
@apache apache deleted a comment from hadoop-yetus Jul 5, 2021
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 30s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 4 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 31m 27s trunk passed
+1 💚 compile 0m 38s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 37s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 0m 28s trunk passed
+1 💚 mvnsite 0m 42s trunk passed
+1 💚 javadoc 0m 34s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 31s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 spotbugs 1m 0s trunk passed
+1 💚 shadedclient 14m 8s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 30s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 31s the patch passed
+1 💚 compile 0m 27s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 0m 27s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 18s /results-checkstyle-hadoop-tools_hadoop-azure.txt hadoop-tools/hadoop-azure: The patch generated 1 new + 2 unchanged - 0 fixed = 3 total (was 2)
+1 💚 mvnsite 0m 31s the patch passed
+1 💚 javadoc 0m 24s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 22s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 spotbugs 0m 59s the patch passed
+1 💚 shadedclient 13m 48s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 1m 56s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 33s The patch does not generate ASF License warnings.
72m 14s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3110/5/artifact/out/Dockerfile
GITHUB PR #3110
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux d613b60758af 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 5d44ac5
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3110/5/testReport/
Max. process+thread count 543 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3110/5/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@mukund-thakur mukund-thakur merged commit 93ad7c3 into apache:trunk Jul 5, 2021
asfgit pushed a commit that referenced this pull request Jul 5, 2021
Introducing fs.azure.readahead.range parameter which can be set by the user.
Data will be populated in buffer for random reads as well which leads to fewer
remote calls.

This patch also changes the seek implementation to perform a lazy seek. The
actual seek is done when a read is initiated and data is not present in the buffer else
data is returned from the buffer thus reducing the number of remote storage calls.

Contributed By: Mukund Thakur

Change-Id: Ib920eedd0087caa150afa4d4c23e89df56b29e83
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 31s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 4 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 31m 31s trunk passed
+1 💚 compile 0m 38s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 33s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 0m 27s trunk passed
+1 💚 mvnsite 0m 40s trunk passed
+1 💚 javadoc 0m 30s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 28s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 spotbugs 1m 2s trunk passed
+1 💚 shadedclient 13m 57s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 30s the patch passed
+1 💚 compile 0m 30s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 30s the patch passed
+1 💚 compile 0m 26s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 0m 26s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 19s /results-checkstyle-hadoop-tools_hadoop-azure.txt hadoop-tools/hadoop-azure: The patch generated 1 new + 2 unchanged - 0 fixed = 3 total (was 2)
+1 💚 mvnsite 0m 30s the patch passed
+1 💚 javadoc 0m 23s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 21s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 spotbugs 1m 1s the patch passed
+1 💚 shadedclient 13m 53s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 1m 56s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 33s The patch does not generate ASF License warnings.
71m 57s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3110/6/artifact/out/Dockerfile
GITHUB PR #3110
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 3e167b4cf1d2 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / cc79131
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3110/6/testReport/
Max. process+thread count 545 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3110/6/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

kiran-maturi pushed a commit to kiran-maturi/hadoop that referenced this pull request Nov 24, 2021
…#3110)


Introducing fs.azure.readahead.range parameter which can be set by the user.
Data will be populated in buffer for random reads as well which leads to fewer
remote calls.

This patch also changes the seek implementation to perform a lazy seek. The 
actual seek is done when a read is initiated and data is not present in the buffer else
data is returned from the buffer thus reducing the number of remote storage calls.

Contributed By: Mukund Thakur
jojochuang pushed a commit to jojochuang/hadoop that referenced this pull request May 23, 2023
…ad. (apache#3110)

Introducing fs.azure.readahead.range parameter which can be set by the user.
Data will be populated in buffer for random reads as well which leads to fewer
remote calls.

This patch also changes the seek implementation to perform a lazy seek. The
actual seek is done when a read is initiated and data is not present in the buffer else
data is returned from the buffer thus reducing the number of remote storage calls.

Contributed By: Mukund Thakur
 Conflicts:
	hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AbfsConfiguration.java
	hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java
	hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStreamContext.java
	hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestAbfsInputStreamStatistics.java
	hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/TestAbfsConfigurationFieldsValidation.java

Change-Id: I89d44ee72bf65c410a3e72fb7cce15c545d9de41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants