-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-17250. EditLogTailer#triggerActiveLogRoll should handle thread Interrupted #6266
Conversation
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Hi @Hexiaoqiao @ZanderXu @ayushtkn @zhangshuyan0 Would you mind to take a review this pr when you have free time? thank you very much~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @haiyang1987 for your good work here. LGTM. +1 from my side. Let's wait if other folks would like to be involved here. Thanks.
Thanks @Hexiaoqiao for your review it. |
qq: which version of hadoop did you run? Does log line of "Triggering log rolling to the remote NameNode, ": is this truncated in your paste? seems to be incomplete compared to trunk.
|
Thanks @xinglin for your comment. for this log
It is from by PR: HDFS-1630, it only record the remote NameNode address in logs. this PR is not introduced in our current version. |
thanks @haiyang1987. The name of Neither have two dummy questions Question 1: ob1 trying to connect to ob2. we timed out after 60 sec. Are you saying the thread in the executorService would still be in Interrupted state, even though we have throw up the Interrupted Execution, as showed in the log (also because it is not captured within MultipleNameNodeProxy)? This seems to be contradicting to a statement I found from below blog, which says when an interruptedExecution is threw, the interruption status of that thread will be cleared. "Before a blocking code throws an InterruptedException, it marks the interruption status as false." Another question: assuming the thread in the executorService is still in interrupted state, then how is it gotten cleared by your PR?
|
https://docs.oracle.com/javase/8/docs/api/java/io/InterruptedIOException.html What I think should be happening is as following.
This does not seem to be the case from the logs you shared. thoughts? A possible fix might be to explicitly capture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @haiyang1987 for your report. LGTM +1
@xinglin Thanks for your review.
This is very nice question, you can refer to the source code of |
Thanks @xinglin @ZanderXu for your detailed review.
Yeah, totally agree with this. |
Yeah, For |
Feel free to commit this PR. did not intend to block here. I will spend more time on my own trying to understand the code/and the change . |
Hi @Hexiaoqiao @ZanderXu could you mind to push this modification forward when you have free time ? Thank you very much. |
Hi @xinglin Do you have anymore concerns? If not, I will to push this RP forwards. Thanks. |
@Hexiaoqiao, no, feel free to merge this PR. |
Committed to trunk. Thanks @haiyang1987 @ZanderXu and @xinglin . |
Thanks @Hexiaoqiao @ZanderXu @xinglin for your review and merge. |
…nterrupted (apache#6266). Contributed by Haiyang Hu. Reviewed-by: ZanderXu <zanderxu@apache.org> Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
Description of PR
https://issues.apache.org/jira/browse/HDFS-17250
issue:
When the NameNode attempts to trigger a log roll and the cachedActiveProxy is a "the machine has been shut down of the namenode" it is unable to establish a network connection. This results in a timeout during the socket connection phase, which has a set timeout of 90 seconds. Since the asynchronous call for "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout and initiates a "cancel" operation, causing the executing thread to receive an "Interrupted" signal and throwing a "java.io.InterruptedIOException" exception.
Currently, the logic not to handle interrupted signal, and the "getActiveNodeProxy" method hasn't reached the maximum retry limit, the overall execution process doesn't exit and it continues to attempt to
call the "rollEditLog" on the next NameNode in the list. However when a socket connection is established, it throws a "java.nio.channels.ClosedByInterruptException" exception due to the thread being in an "Interrupted" state.
this cycle repeats until it reaches the maximum retry limit (nnCount * maxRetries) will exits.
However in the next cycle of "Triggering log roll," it continues to traverse the NameNode list and encounters the same issue and the cachedActiveProxy is still a "shut down NameNode."
This eventually results in the NameNode being unable to successfully complete the "Triggering log roll" operation.
To optimize this, we need to handle the thread being interrupted and exit the execution.
Detailed logs such as:
the Observer node "ob1" will execute "Triggering log roll" is as follows:
nns list is [ob2(shut down machine),nn1(active),nn2(standy)]
As the asynchronous call for "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout and initiates a "cancel" operation, causing the executing thread to receive an "Interrupted" signal and will throw "java.io.InterruptedIOException".