HDFS-16479. EC: NameNode should not send a reconstruction work when the source datanodes are insufficient #4138

tasanuma · 2022-04-05T03:04:02Z

Description of PR

NameNode should not send a reconstruction work when the source datanodes are insufficient.
Otherwise, DataNodes receive the order and throw the following exception.

java.lang.IllegalArgumentException: No enough live striped blocks.
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
        at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.<init>(StripedReader.java:128)
        at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReconstructor.<init>(StripedReconstructor.java:135)
        at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.<init>(StripedBlockReconstructor.java:41)
        at org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker.processErasureCodingTasks(ErasureCodingWorker.java:133)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:796)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1314)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1360)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1287)

How was this patch tested?

unit test

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

…he source datanodes are insufficient

hadoop-yetus · 2022-04-05T11:17:25Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 54s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	42m 24s		trunk passed
+1 💚	compile	1m 34s		trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚	compile	1m 22s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 5s		trunk passed
+1 💚	mvnsite	1m 36s		trunk passed
+1 💚	javadoc	1m 9s		trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 29s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 43s		trunk passed
+1 💚	shadedclient	26m 23s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 22s		the patch passed
+1 💚	compile	1m 26s		the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚	javac	1m 26s		the patch passed
+1 💚	compile	1m 15s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 15s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 52s		the patch passed
+1 💚	mvnsite	1m 27s		the patch passed
+1 💚	javadoc	0m 56s		the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 25s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 29s		the patch passed
+1 💚	shadedclient	25m 58s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	374m 24s	/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch passed.
+1 💚	asflicense	1m 3s		The patch does not generate ASF License warnings.
		492m 1s

Reason	Tests
Failed junit tests	hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor
	hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics
	hadoop.hdfs.TestDecommissionWithStriped

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/1/artifact/out/Dockerfile
GITHUB PR	#4138
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux f4fc1a63c310 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `8a4daad`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/1/testReport/
Max. process+thread count	1961 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

ayushtkn · 2022-04-05T12:36:27Z

...ct/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java

+    // skip if source datanodes for reconstructing ec block are not enough
+    if (block.isStriped()) {
+      BlockInfoStriped stripedBlock = (BlockInfoStriped) block;
+      if (stripedBlock.getDataBlockNum() > srcNodes.length) {


Had a very quick look.
Just thinking about a scenario with say RS-6-3-1024k, and we just write 1 mb, in that case the total number of blocks available will be 1 Datablock + 3 Parity. In that case BG itself will have total 4 Blocks. Will this code start returning null? Not sure if getRealDataBlockNum helps here or not. If it is actually a problem

@ayushtkn Thanks for your review. You're right, it's a problem.
I updated the PR to calculate the real data block number. It is the same logic used in StripedReader. I also added one more unit test to cover the case.

hadoop-yetus · 2022-04-06T01:38:15Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 51s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	41m 41s		trunk passed
+1 💚	compile	1m 31s		trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚	compile	1m 21s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 3s		trunk passed
+1 💚	mvnsite	1m 29s		trunk passed
+1 💚	javadoc	1m 6s		trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 33s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 40s		trunk passed
+1 💚	shadedclient	25m 57s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 17s		the patch passed
+1 💚	compile	1m 25s		the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚	javac	1m 25s		the patch passed
+1 💚	compile	1m 16s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 16s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 53s		the patch passed
+1 💚	mvnsite	1m 21s		the patch passed
+1 💚	javadoc	0m 54s		the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 24s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 25s		the patch passed
+1 💚	shadedclient	25m 24s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	336m 0s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 42s		The patch does not generate ASF License warnings.
		451m 2s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/2/artifact/out/Dockerfile
GITHUB PR	#4138
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux 1820a20e528f 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `1b40cf5`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/2/testReport/
Max. process+thread count	2331 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

ayushtkn

Thanx @tasanuma for the fix, the changes makes sense to me, dropped some minor comments.

ayushtkn · 2022-04-10T08:51:44Z

...ct/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java

+      int cellsNum = (int) ((stripedBlock.getNumBytes() - 1) / stripedBlock.getCellSize() + 1);
+      int minRequiredSources = Math.min(cellsNum, stripedBlock.getDataBlockNum());


Is this logic same as BlockInfoStriped.getRealDataBlockNum() can we use or extract the logic from there? or do some refactoring there, just trying if we can keep the logic at one place, in case there is some issue in the logic changing at one places fixes all the places..

ayushtkn · 2022-04-10T08:52:26Z

...ct/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java

+      int minRequiredSources = Math.min(cellsNum, stripedBlock.getDataBlockNum());
+      if (minRequiredSources > srcNodes.length) {
+        LOG.debug("Block {} cannot be reconstructed due to shortage of source datanodes ", block);
+        return null;


Should we increment the metrics before returning null

NameNode.getNameNodeMetrics().incNumTimesReReplicationNotScheduled();

ayushtkn · 2022-04-10T09:11:35Z

...adoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockManager.java

+
+    // striped blockInfo: 2 data blocks + 2 paritys
+    Block aBlock = new Block(blockId, ecPolicy.getCellSize() * (ecPolicy.getNumDataUnits() - 1), 0);
+    BlockInfoStriped aBlockInfoStriped = new BlockInfoStriped(aBlock, ecPolicy);


nit: Can you use a better variable name, couldn't decode what does a stands for, or drop a comment above.

I updated the variable name. I want to keep the comment to clarify the difference between testSkipReconstructionWithManyBusyNodes and testSkipReconstructionWithManyBusyNodes2.

ayushtkn · 2022-04-10T09:12:06Z

...adoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockManager.java

+    ErasureCodingPolicy ecPolicy =
+        SystemErasureCodingPolicies.getPolicies().get(1);
+
+    // striped blockInfo: 2 data blocks + 2 paritys


typo paritys

tasanuma · 2022-04-12T01:57:00Z

@ayushtkn Thanks for your reviews. I update the PR addressing your comments.

hadoop-yetus · 2022-04-12T09:53:13Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	17m 30s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 1s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	42m 14s		trunk passed
+1 💚	compile	1m 31s		trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚	compile	1m 21s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 3s		trunk passed
+1 💚	mvnsite	1m 31s		trunk passed
+1 💚	javadoc	1m 8s		trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚	javadoc	1m 36s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 43s		trunk passed
+1 💚	shadedclient	26m 17s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 21s		the patch passed
+1 💚	compile	1m 26s		the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚	javac	1m 26s		the patch passed
+1 💚	compile	1m 16s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 16s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 52s		the patch passed
+1 💚	mvnsite	1m 22s		the patch passed
+1 💚	javadoc	0m 54s		the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚	javadoc	1m 28s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 29s		the patch passed
+1 💚	shadedclient	25m 46s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	343m 15s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 42s		The patch does not generate ASF License warnings.
		476m 25s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/3/artifact/out/Dockerfile
GITHUB PR	#4138
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux 50fed07d0f97 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `a0d5756`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/3/testReport/
Max. process+thread count	1962 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/3/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

ayushtkn

LGTM

tasanuma · 2022-04-14T02:23:29Z

@ayushtkn Thanks for your review! I'll merge it.

…he source datanodes are insufficient (#4138) (cherry picked from commit 2efab92)

…he source datanodes are insufficient (apache#4138)

Upstream JIRA ID: HDFS-16479. EC: NameNode should not send a reconstruction work when the source datanodes are insufficient (apache#4138) (cherry picked from commit 2efab92) Change-Id: Icadc749f06d350255ed9297eb7183db2dcfa08a5

zhengchenyu · 2024-06-07T06:59:54Z

@tasanuma @ayushtkn
I think after this PR, the simple copy for decommissioning ec block will be ingored.
For example, we have 6 + 3 storage. If the one storage is decommissioning, and the other storage is busy. The simple copy from decommissioning storage will be ignored.

tasanuma · 2024-06-17T07:58:12Z

@zhengchenyu Could you elaborate on the situation a bit more? In your example, do you mean you have only 9 storage units?

zhengchenyu · 2024-07-02T13:31:02Z

@tasanuma Sorry for miss your comment.

In the case of a 6+3 ec policy, if 4 blocks are unavailable due to busy, the size of srcNodes is 5. If one of these 5 blocks is in the decommissioning state, I think block copy for the decommissioning block should be triggered. However, this simple block copy cannot be triggered now.

I create PR HDFS-17542. The test Case 1.7 show the problem I described. For current trunk, scheduleReconstruction will return null. But after this PR HDFS-17542, will return a work for copy.

HDFS-17542 reorganized the code structure. Would you be interested in taking a look at HDFS-17542?

HDFS-16479. EC: NameNode should not send a reconstruction work when t…

8a4daad

…he source datanodes are insufficient

ayushtkn reviewed Apr 5, 2022

View reviewed changes

addressing short data blocks

1b40cf5

ayushtkn reviewed Apr 10, 2022

View reviewed changes

tasanuma added 4 commits April 12, 2022 10:31

use BlockInfoStriped.getRealDataBlockNum()

7784d02

add NameNode.getNameNodeMetrics().incNumTimesReReplicationNotScheduled()

f6c9d6b

fix typo comment

494a921

update comments and variable names

a0d5756

ayushtkn approved these changes Apr 12, 2022

View reviewed changes

tasanuma merged commit 2efab92 into apache:trunk Apr 14, 2022

tasanuma deleted the HDFS-16479 branch April 14, 2022 02:23

tasanuma added a commit that referenced this pull request Apr 14, 2022

HDFS-16479. EC: NameNode should not send a reconstruction work when t…

52abc9f

…he source datanodes are insufficient (#4138) (cherry picked from commit 2efab92)

tasanuma added a commit that referenced this pull request Apr 14, 2022

HDFS-16479. EC: NameNode should not send a reconstruction work when t…

b8c6ba6

…he source datanodes are insufficient (#4138) (cherry picked from commit 2efab92)

tasanuma mentioned this pull request Apr 22, 2022

HDFS-16552. Fix NPE for TestBlockManager #4210

Merged

HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022

HDFS-16479. EC: NameNode should not send a reconstruction work when t…

1e35313

…he source datanodes are insufficient (apache#4138)

		int cellsNum = (int) ((stripedBlock.getNumBytes() - 1) / stripedBlock.getCellSize() + 1);
		int minRequiredSources = Math.min(cellsNum, stripedBlock.getDataBlockNum());

HDFS-16479. EC: NameNode should not send a reconstruction work when the source datanodes are insufficient #4138

HDFS-16479. EC: NameNode should not send a reconstruction work when the source datanodes are insufficient #4138

Uh oh!

Conversation

tasanuma commented Apr 5, 2022

Description of PR

How was this patch tested?

For code changes:

Uh oh!

hadoop-yetus commented Apr 5, 2022

Uh oh!

ayushtkn Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tasanuma Apr 5, 2022

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Apr 6, 2022

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

ayushtkn Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

ayushtkn Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

ayushtkn Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

tasanuma Apr 12, 2022

Choose a reason for hiding this comment

Uh oh!

ayushtkn Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

tasanuma commented Apr 12, 2022

Uh oh!

hadoop-yetus commented Apr 12, 2022

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

tasanuma commented Apr 14, 2022

Uh oh!

zhengchenyu commented Jun 7, 2024

Uh oh!

tasanuma commented Jun 17, 2024

Uh oh!

zhengchenyu commented Jul 2, 2024

Uh oh!

Uh oh!

ayushtkn Apr 5, 2022 •

edited

Loading