Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-16479. EC: NameNode should not send a reconstruction work when the source datanodes are insufficient #4138

Merged
merged 6 commits into from
Apr 14, 2022

Conversation

tasanuma
Copy link
Member

@tasanuma tasanuma commented Apr 5, 2022

Description of PR

NameNode should not send a reconstruction work when the source datanodes are insufficient.
Otherwise, DataNodes receive the order and throw the following exception.

java.lang.IllegalArgumentException: No enough live striped blocks.
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
        at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.<init>(StripedReader.java:128)
        at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReconstructor.<init>(StripedReconstructor.java:135)
        at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.<init>(StripedBlockReconstructor.java:41)
        at org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker.processErasureCodingTasks(ErasureCodingWorker.java:133)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:796)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1314)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1360)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1287)

How was this patch tested?

unit test

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 42m 24s trunk passed
+1 💚 compile 1m 34s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 1m 22s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 5s trunk passed
+1 💚 mvnsite 1m 36s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 29s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 43s trunk passed
+1 💚 shadedclient 26m 23s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 22s the patch passed
+1 💚 compile 1m 26s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 1m 26s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 15s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 52s the patch passed
+1 💚 mvnsite 1m 27s the patch passed
+1 💚 javadoc 0m 56s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 25s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 29s the patch passed
+1 💚 shadedclient 25m 58s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 374m 24s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 3s The patch does not generate ASF License warnings.
492m 1s
Reason Tests
Failed junit tests hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor
hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics
hadoop.hdfs.TestDecommissionWithStriped
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/1/artifact/out/Dockerfile
GITHUB PR #4138
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux f4fc1a63c310 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 8a4daad
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/1/testReport/
Max. process+thread count 1961 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

// skip if source datanodes for reconstructing ec block are not enough
if (block.isStriped()) {
BlockInfoStriped stripedBlock = (BlockInfoStriped) block;
if (stripedBlock.getDataBlockNum() > srcNodes.length) {
Copy link
Member

@ayushtkn ayushtkn Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a very quick look.
Just thinking about a scenario with say RS-6-3-1024k, and we just write 1 mb, in that case the total number of blocks available will be 1 Datablock + 3 Parity. In that case BG itself will have total 4 Blocks. Will this code start returning null? Not sure if getRealDataBlockNum helps here or not. If it is actually a problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushtkn Thanks for your review. You're right, it's a problem.
I updated the PR to calculate the real data block number. It is the same logic used in StripedReader. I also added one more unit test to cover the case.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 51s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 41s trunk passed
+1 💚 compile 1m 31s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 1m 21s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 3s trunk passed
+1 💚 mvnsite 1m 29s trunk passed
+1 💚 javadoc 1m 6s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 33s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 40s trunk passed
+1 💚 shadedclient 25m 57s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 17s the patch passed
+1 💚 compile 1m 25s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 1m 25s the patch passed
+1 💚 compile 1m 16s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 16s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 53s the patch passed
+1 💚 mvnsite 1m 21s the patch passed
+1 💚 javadoc 0m 54s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 24s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 25s the patch passed
+1 💚 shadedclient 25m 24s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 336m 0s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 42s The patch does not generate ASF License warnings.
451m 2s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/2/artifact/out/Dockerfile
GITHUB PR #4138
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 1820a20e528f 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 1b40cf5
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/2/testReport/
Max. process+thread count 2331 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx @tasanuma for the fix, the changes makes sense to me, dropped some minor comments.

Comment on lines 2169 to 2170
int cellsNum = (int) ((stripedBlock.getNumBytes() - 1) / stripedBlock.getCellSize() + 1);
int minRequiredSources = Math.min(cellsNum, stripedBlock.getDataBlockNum());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic same as BlockInfoStriped.getRealDataBlockNum() can we use or extract the logic from there? or do some refactoring there, just trying if we can keep the logic at one place, in case there is some issue in the logic changing at one places fixes all the places..

int minRequiredSources = Math.min(cellsNum, stripedBlock.getDataBlockNum());
if (minRequiredSources > srcNodes.length) {
LOG.debug("Block {} cannot be reconstructed due to shortage of source datanodes ", block);
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we increment the metrics before returning null

    NameNode.getNameNodeMetrics().incNumTimesReReplicationNotScheduled();


// striped blockInfo: 2 data blocks + 2 paritys
Block aBlock = new Block(blockId, ecPolicy.getCellSize() * (ecPolicy.getNumDataUnits() - 1), 0);
BlockInfoStriped aBlockInfoStriped = new BlockInfoStriped(aBlock, ecPolicy);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you use a better variable name, couldn't decode what does a stands for, or drop a comment above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the variable name. I want to keep the comment to clarify the difference between testSkipReconstructionWithManyBusyNodes and testSkipReconstructionWithManyBusyNodes2.

ErasureCodingPolicy ecPolicy =
SystemErasureCodingPolicies.getPolicies().get(1);

// striped blockInfo: 2 data blocks + 2 paritys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo paritys

@tasanuma
Copy link
Member Author

@ayushtkn Thanks for your reviews. I update the PR addressing your comments.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 17m 30s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 42m 14s trunk passed
+1 💚 compile 1m 31s trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 compile 1m 21s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 3s trunk passed
+1 💚 mvnsite 1m 31s trunk passed
+1 💚 javadoc 1m 8s trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 36s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 43s trunk passed
+1 💚 shadedclient 26m 17s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 21s the patch passed
+1 💚 compile 1m 26s the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javac 1m 26s the patch passed
+1 💚 compile 1m 16s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 16s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 52s the patch passed
+1 💚 mvnsite 1m 22s the patch passed
+1 💚 javadoc 0m 54s the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 28s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 29s the patch passed
+1 💚 shadedclient 25m 46s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 343m 15s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 42s The patch does not generate ASF License warnings.
476m 25s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/3/artifact/out/Dockerfile
GITHUB PR #4138
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 50fed07d0f97 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / a0d5756
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/3/testReport/
Max. process+thread count 1962 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4138/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tasanuma
Copy link
Member Author

@ayushtkn Thanks for your review! I'll merge it.

@tasanuma tasanuma merged commit 2efab92 into apache:trunk Apr 14, 2022
@tasanuma tasanuma deleted the HDFS-16479 branch April 14, 2022 02:23
tasanuma added a commit that referenced this pull request Apr 14, 2022
…he source datanodes are insufficient (#4138)

(cherry picked from commit 2efab92)
tasanuma added a commit that referenced this pull request Apr 14, 2022
…he source datanodes are insufficient (#4138)

(cherry picked from commit 2efab92)
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
jojochuang pushed a commit to jojochuang/hadoop that referenced this pull request May 23, 2023
Upstream JIRA ID:
HDFS-16479. EC: NameNode should not send a reconstruction work when the source datanodes are insufficient (apache#4138)

(cherry picked from commit 2efab92)
Change-Id: Icadc749f06d350255ed9297eb7183db2dcfa08a5
@zhengchenyu
Copy link
Contributor

@tasanuma @ayushtkn
I think after this PR, the simple copy for decommissioning ec block will be ingored.
For example, we have 6 + 3 storage. If the one storage is decommissioning, and the other storage is busy. The simple copy from decommissioning storage will be ignored.

@tasanuma
Copy link
Member Author

@zhengchenyu Could you elaborate on the situation a bit more? In your example, do you mean you have only 9 storage units?

@zhengchenyu
Copy link
Contributor

@tasanuma Sorry for miss your comment.

In the case of a 6+3 ec policy, if 4 blocks are unavailable due to busy, the size of srcNodes is 5. If one of these 5 blocks is in the decommissioning state, I think block copy for the decommissioning block should be triggered. However, this simple block copy cannot be triggered now.

I create PR HDFS-17542. The test Case 1.7 show the problem I described. For current trunk, scheduleReconstruction will return null. But after this PR HDFS-17542, will return a work for copy.

HDFS-17542 reorganized the code structure. Would you be interested in taking a look at HDFS-17542?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants