Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-17250. EditLogTailer#triggerActiveLogRoll should handle thread Interrupted #6266

Merged
merged 2 commits into from
Dec 4, 2023

Conversation

haiyang1987
Copy link
Contributor

@haiyang1987 haiyang1987 commented Nov 11, 2023

Description of PR

https://issues.apache.org/jira/browse/HDFS-17250

issue:
When the NameNode attempts to trigger a log roll and the cachedActiveProxy is a "the machine has been shut down of the namenode" it is unable to establish a network connection. This results in a timeout during the socket connection phase, which has a set timeout of 90 seconds. Since the asynchronous call for "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout and initiates a "cancel" operation, causing the executing thread to receive an "Interrupted" signal and throwing a "java.io.InterruptedIOException" exception.

Currently, the logic not to handle interrupted signal, and the "getActiveNodeProxy" method hasn't reached the maximum retry limit, the overall execution process doesn't exit and it continues to attempt to
call the "rollEditLog" on the next NameNode in the list. However when a socket connection is established, it throws a "java.nio.channels.ClosedByInterruptException" exception due to the thread being in an "Interrupted" state.
this cycle repeats until it reaches the maximum retry limit (nnCount * maxRetries) will exits.

However in the next cycle of "Triggering log roll," it continues to traverse the NameNode list and encounters the same issue and the cachedActiveProxy is still a "shut down NameNode."
This eventually results in the NameNode being unable to successfully complete the "Triggering log roll" operation.

To optimize this, we need to handle the thread being interrupted and exit the execution.

Detailed logs such as:
the Observer node "ob1" will execute "Triggering log roll" is as follows:
nns list is [ob2(shut down machine),nn1(active),nn2(standy)]

  • The Observer node "ob1" periodically executes "triggerActiveLogRoll" and asynchronously calls "getNameNodeProxy" to request "ob2" for the "rollEditLog" operation, since the "ob2" machine is shut down, it cannot establish a network connection, this results in a timeout during the socket connection phase (here set timeout is 90 seconds).
2023-11-03 10:27:41,734 INFO ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering log roll on remote NameNode

2023-11-03 10:28:41,734 WARN ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to finish rolling edits in 60000 ms

As the asynchronous call for "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout and initiates a "cancel" operation, causing the executing thread to receive an "Interrupted" signal and will throw "java.io.InterruptedIOException".

2023-11-03 10:28:41,753 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:41,972 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total timeout mills is 90000, 30038 millis timeout left.
at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
at org.apache.hadoop.ipc.Client.call(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total timeout mills is 90000, 30038 millis timeout left.
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)

  • Since current logic not to handle interrupted signal and hasn't reached the maximum retry limit, the overall execution process doesn't exit and it continues to attempt to call the "rollEditLog" on the next NameNode in the list. However when a socket connection is established, it throws a "java.nio.channels.ClosedByInterruptException" exception due to the thread being in an "Interrupted" state. this cycle repeats until it reaches 9 retry limit (3*3) and then exits.
2023-11-03 10:28:41,991 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:41,991 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=nn1, ipcAddress=xxx:8040, httpAddress=http://xxx:50070], try next.
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:656)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
at org.apache.hadoop.ipc.Client.call(Client.java:1449)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

2023-11-03 10:28:42,095 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:42,095 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
java.nio.channels.ClosedByInterruptException

2023-11-03 10:28:42,104 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:42,104 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
java.nio.channels.ClosedByInterruptException

2023-11-03 10:28:42,118 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:42,118 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.

2023-11-03 10:28:42,119 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:42,119 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
java.nio.channels.ClosedByInterruptException

2023-11-03 10:28:42,148 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:42,148 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
java.nio.channels.ClosedByInterruptException

2023-11-03 10:28:42,148 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:42,148 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
java.nio.channels.ClosedByInterruptException

2023-11-03 10:28:42,149 WARN ipc.Client (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted while trying for connection
2023-11-03 10:28:42,149 WARN ha.EditLogTailer (EditLogTailer.java:call(618)) [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
java.nio.channels.ClosedByInterruptException
  • In the next cycle of "Triggering log roll," it continues to traverse the NameNode list and proceeds to "ob2," causing the same issue to recur. This eventually results in the NameNode being unable to successfully complete the "Triggering log roll" operation.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 49m 22s trunk passed
+1 💚 compile 1m 24s trunk passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚 compile 1m 14s trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚 checkstyle 1m 11s trunk passed
+1 💚 mvnsite 1m 24s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 36s trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚 spotbugs 3m 17s trunk passed
+1 💚 shadedclient 40m 44s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 12s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚 javac 1m 15s the patch passed
+1 💚 compile 1m 7s the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚 javac 1m 7s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 1s the patch passed
+1 💚 mvnsite 1m 13s the patch passed
+1 💚 javadoc 0m 57s the patch passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 30s the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚 spotbugs 3m 21s the patch passed
+1 💚 shadedclient 40m 30s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 240m 56s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 42s The patch does not generate ASF License warnings.
397m 27s
Reason Tests
Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner
hadoop.hdfs.TestDFSUtil
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6266/1/artifact/out/Dockerfile
GITHUB PR #6266
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 91bee368e523 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 4774224
Default Java Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6266/1/testReport/
Max. process+thread count 2363 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6266/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 18m 11s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 49m 12s trunk passed
+1 💚 compile 1m 22s trunk passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚 compile 1m 15s trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚 checkstyle 1m 11s trunk passed
+1 💚 mvnsite 1m 22s trunk passed
+1 💚 javadoc 1m 10s trunk passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 36s trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚 spotbugs 3m 18s trunk passed
+1 💚 shadedclient 40m 28s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 10s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚 javac 1m 15s the patch passed
+1 💚 compile 1m 8s the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚 javac 1m 8s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 2s the patch passed
+1 💚 mvnsite 1m 14s the patch passed
+1 💚 javadoc 0m 56s the patch passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 32s the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚 spotbugs 3m 21s the patch passed
+1 💚 shadedclient 40m 33s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 239m 56s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 43s The patch does not generate ASF License warnings.
412m 54s
Reason Tests
Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner
hadoop.hdfs.TestDFSUtil
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6266/2/artifact/out/Dockerfile
GITHUB PR #6266
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux f5319d6216b0 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 27e67f4
Default Java Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6266/2/testReport/
Max. process+thread count 2354 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6266/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@haiyang1987
Copy link
Contributor Author

Hi @Hexiaoqiao @ZanderXu @ayushtkn @zhangshuyan0 Would you mind to take a review this pr when you have free time? thank you very much~

@haiyang1987
Copy link
Contributor Author

Hi @tomscut @xinglin Would you mind to also take a review this pr when you have free time? thank you very much~

Copy link
Contributor

@Hexiaoqiao Hexiaoqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haiyang1987 for your good work here. LGTM. +1 from my side. Let's wait if other folks would like to be involved here. Thanks.

@haiyang1987
Copy link
Contributor Author

Thanks @haiyang1987 for your good work here. LGTM. +1 from my side. Let's wait if other folks would like to be involved here. Thanks.

Thanks @Hexiaoqiao for your review it.

@xinglin
Copy link
Contributor

xinglin commented Nov 18, 2023

qq: which version of hadoop did you run? Does triggerActiveLogRoll() actually try to connect to other NNs as well or only activeNN? Is your issue happening for trunk? By reading the source code, that does not seem to be the case: It seems triggerActiveLogRoll should only contact activeNN.

log line of "Triggering log rolling to the remote NameNode, ": is this truncated in your paste? seems to be incomplete compared to trunk.

  /**
   * Trigger the active node to roll its logs.
   */
  @VisibleForTesting
  void triggerActiveLogRoll() {
    Future<Void> future = null;
    try {
     ...
   }

  Callable<Void> getNameNodeProxy() {
    return new MultipleNameNodeProxy<Void>() {
      @Override
      protected Void doWork() throws IOException {
        LOG.info("Triggering log rolling to the remote NameNode, " +
            "active NameNode = {}", currentNN.getIpcAddress());
        cachedActiveProxy.rollEditLog();
        return null;
      }
    };

@haiyang1987
Copy link
Contributor Author

qq: which version of hadoop did you run? Does triggerActiveLogRoll() actually try to connect to other NNs as well or only activeNN? Is your issue happening for trunk? By reading the source code, that does not seem to be the case: It seems triggerActiveLogRoll should only contact activeNN.

log line of "Triggering log rolling to the remote NameNode, ": is this truncated in your paste? seems to be incomplete compared to trunk.

  /**
   * Trigger the active node to roll its logs.
   */
  @VisibleForTesting
  void triggerActiveLogRoll() {
    Future<Void> future = null;
    try {
     ...
   }

  Callable<Void> getNameNodeProxy() {
    return new MultipleNameNodeProxy<Void>() {
      @Override
      protected Void doWork() throws IOException {
        LOG.info("Triggering log rolling to the remote NameNode, " +
            "active NameNode = {}", currentNN.getIpcAddress());
        cachedActiveProxy.rollEditLog();
        return null;
      }
    };

Thanks @xinglin for your comment.
The current version is hadoop 3.3, triggerActiveLogRoll() will actually try to connect to other NNs until active nn is requested, and recorded as cachedActiveProxy

for this log

LOG.info("Triggering log rolling to the remote NameNode, " +  "active NameNode = {}", currentNN.getIpcAddress());

It is from by PR: HDFS-1630, it only record the remote NameNode address in logs. this PR is not introduced in our current version.
trunk branch still has the problem described in the current issue.

@xinglin
Copy link
Contributor

xinglin commented Nov 18, 2023

thanks @haiyang1987. The name ofgetActiveNodeProxy() caused the confusion. It should probably have been named as getAliveNodeProxy. ActiveNN has a specific meaning in the context of HA-NN. getActiveNodeProxy indeed iterates over all NNs.

Neither MultipleNameNodeProxy.call() or MultipleNameNodeProxy.getActiveNodeProxy() catch InterruptedExecution in their try_catch block.

have two dummy questions

Question 1:
when we submit a task to an executorService and later on we cancel it due to timeout, and we don't capture interrupted exception in our task, what state will the thread from executorService be left? Do we have to capture interrupted exception in our task in this case?

ob1 trying to connect to ob2. we timed out after 60 sec. Are you saying the thread in the executorService would still be in Interrupted state, even though we have throw up the Interrupted Execution, as showed in the log (also because it is not captured within MultipleNameNodeProxy)? This seems to be contradicting to a statement I found from below blog, which says when an interruptedExecution is threw, the interruption status of that thread will be cleared.

"Before a blocking code throws an InterruptedException, it marks the interruption status as false."
https://praveer09.github.io/technology/2015/12/06/understanding-thread-interruption-in-java/

Another question: assuming the thread in the executorService is still in interrupted state, then how is it gotten cleared by your PR? Thread.currentThread().isInterrupted() will not clear the interrupt flag for a thread, as the Oracle Java doc says.

When a thread checks for an interrupt by invoking the static method Thread.interrupted, interrupt status is cleared. The non-static isInterrupted method, which is used by one thread to query the interrupt status of another, does not change the interrupt status flag.
https://docs.oracle.com/javase/tutorial/essential/concurrency/interrupt.html#:~:text=When%20a%20thread%20checks%20for,change%20the%20interrupt%20status%20flag.

@xinglin
Copy link
Contributor

xinglin commented Nov 19, 2023

InterruptedIOException is a subClass of IOException. In the try_catch block in getActiveNodeProxy, we do catch for IOException. So, interruptedIOException will be captured there.

https://docs.oracle.com/javase/8/docs/api/java/io/InterruptedIOException.html

What I think should be happening is as following.

main_Thread calling `triggerActiveLogRoll`, wait for 60 secs, timeout, cancel this task, and return. 

MultipleNameNodeProxy.call() thread: 
   -> getActiveNodeProxy()
        -> nnLookup.next = ob2 (down node)
        -> RPC.waitForProxy(ob2)
        -> after 60 secs, interrupted.  
        -> output "Failed to reach ob2", increment nnLoopCount. Ideally, we should just stop here, since we already time out.
        -> nnLookup.next = n1 (live node). 
             then, it should succeed to connect to n1.

This does not seem to be the case from the logs you shared. thoughts?

A possible fix might be to explicitly capture InterruptedIOException in getActiveNodeProxy, and just finish for this thread (assuming all InterruptedIOExceptions are invoked from triggerActiveLogRoll). For the following triggerActiveLogRoll calls, we should be good, since we will move the nnLookup to next one.

Copy link
Contributor

@ZanderXu ZanderXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haiyang1987 for your report. LGTM +1

@ZanderXu
Copy link
Contributor

@xinglin Thanks for your review.

when we submit a task to an executorService and later on we cancel it due to timeout, and we don't capture interrupted exception in our task, what state will the thread from executorService be left? Do we have to capture interrupted exception in our task in this case?

This is very nice question, you can refer to the source code of ThreadPoolExecutor#Worker#runWorker(), the worker in the ThreadPoolExecutor will clear the Interrupt flag.

@haiyang1987
Copy link
Contributor Author

Thanks @xinglin @ZanderXu for your detailed review.

when we submit a task to an executorService and later on we cancel it due to timeout, and we don't capture interrupted exception in our task, what state will the thread from executorService be left? Do we have to capture interrupted exception in our task in this case?

This is very nice question, you can refer to the source code of ThreadPoolExecutor#Worker#runWorker(), the worker in the ThreadPoolExecutor will clear the Interrupt flag.

Yeah, totally agree with this.

@haiyang1987
Copy link
Contributor Author

haiyang1987 commented Nov 20, 2023

InterruptedIOException is a subClass of IOException. In the try_catch block in getActiveNodeProxy, we do catch for IOException. So, interruptedIOException will be captured there.

https://docs.oracle.com/javase/8/docs/api/java/io/InterruptedIOException.html

What I think should be happening is as following.

main_Thread calling `triggerActiveLogRoll`, wait for 60 secs, timeout, cancel this task, and return. 

MultipleNameNodeProxy.call() thread: 
   -> getActiveNodeProxy()
        -> nnLookup.next = ob2 (down node)
        -> RPC.waitForProxy(ob2)
        -> after 60 secs, interrupted.  
        -> output "Failed to reach ob2", increment nnLoopCount. Ideally, we should just stop here, since we already time out.
        -> nnLookup.next = n1 (live node). 
             then, it should succeed to connect to n1.

This does not seem to be the case from the logs you shared. thoughts?

A possible fix might be to explicitly capture InterruptedIOException in getActiveNodeProxy, and just finish for this thread (assuming all InterruptedIOExceptions are invoked from triggerActiveLogRoll). For the following triggerActiveLogRoll calls, we should be good, since we will move the nnLookup to next one.

Yeah, ForMultipleNameNodeProxy#call()explicitly capture InterruptedIOException and then exit execution is also a solution

@xinglin
Copy link
Contributor

xinglin commented Nov 22, 2023

Feel free to commit this PR. did not intend to block here. I will spend more time on my own trying to understand the code/and the change .

@haiyang1987
Copy link
Contributor Author

Hi @Hexiaoqiao @ZanderXu could you mind to push this modification forward when you have free time ? Thank you very much.

@Hexiaoqiao
Copy link
Contributor

Hi @xinglin Do you have anymore concerns? If not, I will to push this RP forwards. Thanks.

@xinglin
Copy link
Contributor

xinglin commented Dec 3, 2023

@Hexiaoqiao, no, feel free to merge this PR.

@Hexiaoqiao Hexiaoqiao merged commit 1fb80ef into apache:trunk Dec 4, 2023
1 of 4 checks passed
@Hexiaoqiao
Copy link
Contributor

Committed to trunk. Thanks @haiyang1987 @ZanderXu and @xinglin .

@haiyang1987
Copy link
Contributor Author

Thanks @Hexiaoqiao @ZanderXu @xinglin for your review and merge.

jiajunmao pushed a commit to jiajunmao/hadoop-MLEC that referenced this pull request Feb 6, 2024
…nterrupted (apache#6266). Contributed by Haiyang Hu.

Reviewed-by: ZanderXu <zanderxu@apache.org>
Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants