HBASE-26273 Force ReadType.STREAM when the user does not explicitly s… #3675

joshelser · 2021-09-10T21:00:32Z

…et a ReadType on the Scan for a Snapshot-based Job

HBase 2 moved over Scans to use PREAD by default instead of STREAM like
HBase 1. In the context of a MapReduce job, we can generally expect that
clients using the InputFormat (batch job) would be reading most of the
data for a job. Cater to them, but still give users who want PREAD the
ability to do so.

Apache-HBase · 2021-09-10T21:48:00Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 6s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	4m 2s	master passed
+1 💚	compile	0m 50s	master passed
+1 💚	checkstyle	0m 20s	master passed
+1 💚	spotbugs	0m 46s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 45s	the patch passed
+1 💚	compile	0m 48s	the patch passed
+1 💚	javac	0m 48s	the patch passed
-0 ⚠️	checkstyle	0m 18s	hbase-mapreduce: The patch generated 6 new + 2 unchanged - 0 fixed = 8 total (was 2)
-0 ⚠️	whitespace	0m 0s	The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚	hadoopcheck	18m 29s	Patch does not cause any errors with Hadoop 3.1.2 3.2.1 3.3.0.
+1 💚	spotbugs	0m 54s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 14s	The patch does not generate ASF License warnings.
		39m 55s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#3675
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname	Linux b6a812e631b1 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ee632bd`
Default Java	AdoptOpenJDK-1.8.0_282-b08
checkstyle	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/artifact/yetus-general-check/output/diff-checkstyle-hbase-mapreduce.txt
whitespace	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/artifact/yetus-general-check/output/whitespace-eol.txt
Max. process+thread count	96 (vs. ulimit of 30000)
modules	C: hbase-mapreduce U: hbase-mapreduce
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/console
versions	git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-09-10T21:53:07Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 5s	Docker mode activated.
-0 ⚠️	yetus	0m 3s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	4m 23s	master passed
+1 💚	compile	0m 26s	master passed
+1 💚	shadedjars	9m 5s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 19s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 5s	the patch passed
+1 💚	compile	0m 26s	the patch passed
+1 💚	javac	0m 26s	the patch passed
+1 💚	shadedjars	9m 5s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 17s	the patch passed
		_ Other Tests _
+1 💚	unit	14m 34s	hbase-mapreduce in the patch passed.
		45m 2s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR	#3675
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux d3ce3b616276 4.15.0-142-generic #146-Ubuntu SMP Tue Apr 13 01:11:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ee632bd`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/testReport/
Max. process+thread count	2775 (vs. ulimit of 30000)
modules	C: hbase-mapreduce U: hbase-mapreduce
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-09-10T21:54:37Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 6s	Docker mode activated.
-0 ⚠️	yetus	0m 5s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	5m 6s	master passed
+1 💚	compile	0m 30s	master passed
+1 💚	shadedjars	9m 6s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 22s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 45s	the patch passed
+1 💚	compile	0m 30s	the patch passed
+1 💚	javac	0m 30s	the patch passed
+1 💚	shadedjars	8m 57s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 20s	the patch passed
		_ Other Tests _
+1 💚	unit	14m 35s	hbase-mapreduce in the patch passed.
		46m 33s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#3675
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 84296853c838 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ee632bd`
Default Java	AdoptOpenJDK-11.0.10+9
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/testReport/
Max. process+thread count	3357 (vs. ulimit of 30000)
modules	C: hbase-mapreduce U: hbase-mapreduce
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/1/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

taklwu

LGTM,
[nit] if possible, can you fix the checkstyle and whitespace ?

anoopsjohn

+1

anoopsjohn · 2021-09-11T12:27:44Z

...-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java

+  /**
+   * The {@link ReadType} which should be set on the {@link Scan} to read the HBase Snapshot, default STREAM.
+   */
+  public static final String SNAPSHOT_INPUTFORMAT_SCANNER_READTYPE = "hbase.TableSnapshotinputFormat.scanner.readtype";


TableSnapshotinputFormat => TableSnapshotInputFormat ?

hah! Thanks for noticing.

Apache9

I think this is an improvement so give a +1 first.

But just curious, how much we could gain from this change?

By default, we will switch to stream after reading a small amount of data, several hundreds of KBs? If we read several hundreds of MBs of data in a map reduce job, I do not think it will effect the performance too much?

joshelser · 2021-09-13T20:31:26Z

By default, we will switch to stream after reading a small amount of data, several hundreds of KBs? If we read several hundreds of MBs of data in a map reduce job, I do not think it will effect the performance too much?

For a standalone Java program reading a ~5G file in a single JVM (... using the mapreduce snapshot APIs), this change improved run time from 90s to 30s. In a distributed system, it only had about 15% improvement (network became the bottleneck -- that's where HBASE-26274 came into play).

[nit] if possible, can you fix the checkstyle and whitespace ?

You bet. I hadn't looked at that yet. Thanks for calling it out!

…et a ReadType on the Scan for a Snapshot-based Job HBase 2 moved over Scans to use PREAD by default instead of STREAM like HBase 1. In the context of a MapReduce job, we can generally expect that clients using the InputFormat (batch job) would be reading most of the data for a job. Cater to them, but still give users who want PREAD the ability to do so.

joshelser · 2021-09-13T20:39:30Z

a808a2d has the fixes requested. If QA comes back happy, I'll just merge this. No need to bother y'all for a re-review. Thanks for the eyes!

Apache-HBase · 2021-09-13T21:18:15Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 1s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	3m 43s	master passed
+1 💚	compile	0m 47s	master passed
+1 💚	checkstyle	0m 18s	master passed
+1 💚	spotbugs	0m 43s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 40s	the patch passed
+1 💚	compile	0m 47s	the patch passed
+1 💚	javac	0m 47s	the patch passed
+1 💚	checkstyle	0m 18s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	18m 25s	Patch does not cause any errors with Hadoop 3.1.2 3.2.1 3.3.0.
+1 💚	spotbugs	0m 55s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 12s	The patch does not generate ASF License warnings.
		38m 27s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#3675
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname	Linux 837b06e74a8c 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `d26bcaa`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count	95 (vs. ulimit of 30000)
modules	C: hbase-mapreduce U: hbase-mapreduce
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/2/console
versions	git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-09-13T21:18:23Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 26s	Docker mode activated.
-0 ⚠️	yetus	0m 3s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	4m 5s	master passed
+1 💚	compile	0m 27s	master passed
+1 💚	shadedjars	8m 12s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 22s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 43s	the patch passed
+1 💚	compile	0m 28s	the patch passed
+1 💚	javac	0m 28s	the patch passed
+1 💚	shadedjars	8m 11s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 19s	the patch passed
		_ Other Tests _
+1 💚	unit	11m 20s	hbase-mapreduce in the patch passed.
		38m 40s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/2/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR	#3675
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 70f7e983688f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `d26bcaa`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/2/testReport/
Max. process+thread count	3688 (vs. ulimit of 30000)
modules	C: hbase-mapreduce U: hbase-mapreduce
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/2/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-09-13T21:27:05Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 8s	Docker mode activated.
-0 ⚠️	yetus	0m 2s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	5m 10s	master passed
+1 💚	compile	0m 31s	master passed
+1 💚	shadedjars	9m 7s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 24s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 52s	the patch passed
+1 💚	compile	0m 31s	the patch passed
+1 💚	javac	0m 31s	the patch passed
+1 💚	shadedjars	9m 18s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 21s	the patch passed
		_ Other Tests _
+1 💚	unit	14m 57s	hbase-mapreduce in the patch passed.
		47m 17s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#3675
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux ed5a24165ba9 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `d26bcaa`
Default Java	AdoptOpenJDK-11.0.10+9
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/2/testReport/
Max. process+thread count	3396 (vs. ulimit of 30000)
modules	C: hbase-mapreduce U: hbase-mapreduce
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3675/2/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

huaxiangsun · 2021-09-13T23:06:47Z

By default, we will switch to stream after reading a small amount of data, several hundreds of KBs? If we read several hundreds of MBs of data in a map reduce job, I do not think it will effect the performance too much?

For a standalone Java program reading a ~5G file in a single JVM (... using the mapreduce snapshot APIs), this change improved run time from 90s to 30s. In a distributed system, it only had about 15% improvement (network became the bottleneck -- that's where HBASE-26274 came into play).

The number is impressive. For standalone Java program, is it hdfs local read with short circuit read enabled or through local tcp connection?

joshelser · 2021-09-13T23:35:03Z

For standalone Java program, is it hdfs local read with short circuit read enabled or through local tcp connection?

This should have been a remote TCP connection. I ran these numbers on a multiple-node cluster. It's possible that part of the data was hosted by a local Datanode, but, if memory serves, it was largely remote reads.

I have it in a private Git repository with the steps I was doing to test. I can post it if you're curious to reproduce what I did.

joshelser · 2021-09-13T23:35:48Z

Merged to master, branch-2, and branch-2.4.

huaxiangsun · 2021-09-13T23:55:11Z

This should have been a remote TCP connection. I ran these numbers on a multiple-node cluster. It's possible that part of the data was hosted by a local Datanode, but, if memory serves, it was largely remote reads.

I have it in a private Git repository with the steps I was doing to test. I can post it if you're curious to reproduce what I did.

Thanks @joshelser. Was trying to understand how much improvement it does for regions with locality. If it is ok to share your private Git repo, I'd like to run the test on regions with 100% locality and share the numbers on the jira.

joshelser · 2021-09-14T00:25:35Z

@huaxiangsun https://github.com/joshelser/stream-repro this is the rough outline of what I was doing. Pretty straightforward (hbase pe to make data in a table, take a snapshot, and a java -cp to just read all the data as one input split).

Apache9 · 2021-09-14T01:22:43Z

For a standalone Java program reading a ~5G file in a single JVM (... using the mapreduce snapshot APIs), this change improved run time from 90s to 30s. In a distributed system, it only had about 15% improvement (network became the bottleneck -- that's where HBASE-26274 came into play).

It is a bit surprise to me that there could a 15% impact on performance. I was suppose that there should be little differences as we only read a very small amount of data with pread. Mind sharing more details here? Such as the HFile block size or something else? IIRC, the default config is to switch to stream after reading 4 HFile block size. And I saw you have already provide the test code, let me also take a look. Maybe we should file an issue about the performance issue with pread switching to stream.

Merged to master, branch-2, and branch-2.4.

I think we also need this on branch-2.3? It has not been EOL yet.

Thanks.

Apache9 · 2021-09-14T05:16:58Z

Oh, just notice that this is for reading snapshot...

Let me take a look at the input format implementation.

joshelser · 2021-09-14T16:43:33Z

Oh, just notice that this is for reading snapshot...

Yup! Just for reading HFiles directly from the filesystem in a local JVM.

I think we also need this on branch-2.3? It has not been EOL yet.

My bad. i'll apply there too.

huaxiangsun · 2021-09-14T18:17:41Z

@huaxiangsun https://github.com/joshelser/stream-repro this is the rough outline of what I was doing. Pretty straightforward (hbase pe to make data in a table, take a snapshot, and a java -cp to just read all the data as one input split).

Thanks @joshelser, will report back.

taklwu approved these changes Sep 10, 2021

View reviewed changes

anoopsjohn reviewed Sep 11, 2021

View reviewed changes

Apache9 approved these changes Sep 12, 2021

View reviewed changes

joshelser added 2 commits September 13, 2021 16:35

Whitespace, checkstyle, and captialization on config key

a808a2d

joshelser force-pushed the 26273-snapshot-inputformat-stream branch from 1d00647 to a808a2d Compare September 13, 2021 20:38

joshelser closed this Sep 13, 2021

HBASE-26273 Force ReadType.STREAM when the user does not explicitly s… #3675

HBASE-26273 Force ReadType.STREAM when the user does not explicitly s… #3675

Uh oh!

Conversation

joshelser commented Sep 10, 2021

Uh oh!

Apache-HBase commented Sep 10, 2021

Uh oh!

Apache-HBase commented Sep 10, 2021

Uh oh!

Apache-HBase commented Sep 10, 2021

Uh oh!

taklwu left a comment

Choose a reason for hiding this comment

Uh oh!

anoopsjohn left a comment

Choose a reason for hiding this comment

Uh oh!

anoopsjohn Sep 11, 2021

Choose a reason for hiding this comment

Uh oh!

joshelser Sep 13, 2021

Choose a reason for hiding this comment

Uh oh!

Apache9 left a comment

Choose a reason for hiding this comment

Uh oh!

joshelser commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshelser commented Sep 13, 2021

Uh oh!

Apache-HBase commented Sep 13, 2021

Uh oh!

Apache-HBase commented Sep 13, 2021

Uh oh!

Apache-HBase commented Sep 13, 2021

Uh oh!

huaxiangsun commented Sep 13, 2021

Uh oh!

joshelser commented Sep 13, 2021

Uh oh!

joshelser commented Sep 13, 2021

Uh oh!

huaxiangsun commented Sep 13, 2021

Uh oh!

joshelser commented Sep 14, 2021

Uh oh!

Apache9 commented Sep 14, 2021

Uh oh!

Apache9 commented Sep 14, 2021

Uh oh!

joshelser commented Sep 14, 2021

Uh oh!

huaxiangsun commented Sep 14, 2021

Uh oh!

Uh oh!

joshelser commented Sep 13, 2021 •

edited

Loading