Skip to content

HDFS-17886. Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion failed because of doCheckpoint thread interrupted when standby namenode ha failover to active#8277

Open
LiuGuH wants to merge 2 commits intoapache:trunkfrom
LiuGuH:trunk-nn-failover-storage-errors

Conversation

@LiuGuH
Copy link
Contributor

@LiuGuH LiuGuH commented Feb 25, 2026

Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion failed because of doCheckpoint thread interrupted when standby namenode ha failover to active

Description of PR

As Discribe of HDFS-17886

When namenode ha failover occurs, the standby namenode convert to active namenode,it will interrupt doCheckpoint thread.  There is an extremely small probability that doCheckpoint updateStorageVersion() will throw java.nio.channels.ClosedByInterruptException. It will lead to the storage directory errors and remove from available list.

The relevant error log is as follows:

2026-01-29 20:13:38,234 WARN org.apache.hadoop.hdfs.server.common.Storage: Error during write properties to the VERSION file to Storage Directory root= /data/hadoop/hdfs/namenode; location= null
java.nio.channels.ClosedByInterruptException
        at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
        at java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
        at java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:342)
        at org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1284)
        at org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1263)
        at org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1254)
        at org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1169)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
2026-01-29 20:13:38,238 ERROR org.apache.hadoop.hdfs.server.common.Storage: Error reported on storage directory Storage Directory root= /data/hadoop/hdfs/namenode; location= null
2026-01-29 20:13:38,238 WARN org.apache.hadoop.hdfs.server.common.Storage: About to remove corresponding storage: /data/hadoop/hdfs/namenode
2026-01-29 20:13:38,245 ERROR org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in doCheckpoint
java.io.IOException: All the storage failed while writing properties to VERSION file
        at org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1175)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)

And java.nio.channels.ClosedByInterruptException is not a disk errors , so  it should not remove from available storage list.

…ersion failed because of doCheckpoint thread interrupted when standby namenode ha failover to active
@LiuGuH LiuGuH changed the title Fix namenode storageDirectory errors when doCheckpoint updateStorageV… Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion failed because of doCheckpoint thread interrupted when standby namenode ha failover to active Feb 25, 2026
@LiuGuH LiuGuH changed the title Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion failed because of doCheckpoint thread interrupted when standby namenode ha failover to active HDFS-17886. Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion failed because of doCheckpoint thread interrupted when standby namenode ha failover to active Feb 25, 2026
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 35s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 43m 59s trunk passed
+1 💚 compile 1m 48s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 48s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 1m 50s trunk passed
+1 💚 mvnsite 1m 55s trunk passed
+1 💚 javadoc 1m 33s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 29s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 4m 4s trunk passed
+1 💚 shadedclient 31m 9s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 21s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 1m 15s the patch passed
+1 💚 compile 1m 17s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 1m 17s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 13s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs-project/hadoop-hdfs: The patch generated 6 new + 13 unchanged - 0 fixed = 19 total (was 13)
+1 💚 mvnsite 1m 28s the patch passed
+1 💚 javadoc 0m 58s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 0s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 3m 48s the patch passed
+1 💚 shadedclient 30m 0s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 217m 29s hadoop-hdfs in the patch passed.
-1 ❌ asflicense 0m 49s /results-asflicense.txt The patch generated 1 ASF License warnings.
349m 8s
Subsystem Report/Notes
Docker ClientAPI=1.53 ServerAPI=1.53 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8277/1/artifact/out/Dockerfile
GITHUB PR #8277
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux a3acc2c9fe43 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 69962a6
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8277/1/testReport/
Max. process+thread count 3847 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8277/1/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 37s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 49s trunk passed
+1 💚 compile 1m 51s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 45s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 1m 50s trunk passed
+1 💚 mvnsite 1m 58s trunk passed
+1 💚 javadoc 1m 32s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 31s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 4m 15s trunk passed
+1 💚 shadedclient 31m 15s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 22s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 1m 15s the patch passed
+1 💚 compile 1m 19s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 1m 19s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 13s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 13 unchanged - 0 fixed = 15 total (was 13)
+1 💚 mvnsite 1m 29s the patch passed
+1 💚 javadoc 0m 59s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 0s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 3m 56s the patch passed
+1 💚 shadedclient 30m 15s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 216m 35s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 44s The patch does not generate ASF License warnings.
346m 12s
Subsystem Report/Notes
Docker ClientAPI=1.53 ServerAPI=1.53 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8277/2/artifact/out/Dockerfile
GITHUB PR #8277
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 589d9ae89b19 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 5829291
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8277/2/testReport/
Max. process+thread count 3670 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8277/2/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@LiuGuH
Copy link
Contributor Author

LiuGuH commented Feb 26, 2026

@Hexiaoqiao @hfutatzhanghb Hello, Sir. Review this PR if you have time. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants