-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-16703. Enable RPC Timeout for some protocols of NameNode. #4660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
💔 -1 overall
This message was automatically generated. |
@ZanderXu I feel that this change is a bit risky, will this lead to instability of the service? Sometimes it is reasonable to configure without timeout. The main question is is it reasonable to use the same timeout configuration for different protocols? |
@slfan1989 Thanks for your review.
Sorry, I didn't get your idea. Can you share more detailed information or cases?
According to my practical experience, using one configuration |
I mean you have increased the timeout configuration, will some protocols timeout frequently? Frequent timeouts cause some operations to fail. I personally think that the performance of NN RPC is limited by NN FsSystemLock, which causes Rpc to not necessarily return at a regular time. Setting this may lead to a large number of RPC failures in some cases, resulting in overall instability. let's see what other people think. |
Thanks @slfan1989 for your explain.
If timeout is 0, Client will be blocked for a very long time, such as 15min, due to NN crash or bad network or other reasons. |
Thanks for the explanation, from my personal point of view, I don't really agree with this change, but I still listen to other partners' suggestions. I always feel that there is a reason for not setting timeout before. |
@goiri Sir, can you help me review this patch? Thanks |
I tend to agree that we should avoid not changing the previous behavior. |
@goiri Thanks for your review.
Sure, I will modify it with different configuration for different protocol. cc @slfan1989 |
💔 -1 overall
This message was automatically generated. |
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
Show resolved
Hide resolved
@goiri Master, thanks for your helping review. |
@ZanderXu Do we need to update the md file to explain what these parameters do? |
@slfan1989 Thanks for your review.
Do you mean which MD file? |
💔 -1 overall
This message was automatically generated. |
I'm not sure what is the best place to document. |
@slfan1989 Master, ping. Could you review it and make some suggestions or approval? |
@slfan1989 Master, ping. |
@ZanderXu Sorry for the late reply, the code looks fine. Thank you very much for your contribution. |
I am more worried about the use of these timeout configurations, because from my personal point of view, I don't know how to configure it. Can we provide a guide for configuration? Such As |
@slfan1989 Master, thanks for your review.
I have added some suggestion values in hdfs-default.xml, such as:
What do you think of this? |
💔 -1 overall
This message was automatically generated. |
@slfan1989 Sir, can help take a look about this PR? It's helpful for hadoop admin. |
@slfan1989 Sir, sorry to ping you again. I still think this is a useful PR for Hadoop Admin. So I hope you can give me some more suggestions and push this PR forward. Thanks again. |
💔 -1 overall
This message was automatically generated. |
@slfan1989 Sir, sorry to ping you again. I'm looking forward to your feedback. RPC Timeout is needed to prevent long-term blocking when the network is bad or machine crash. Thanks again. |
@slfan1989 Sir, about this PR, if you thinks it's difficult to configure it, how about just enabling a configurable timeout for NamenodeProtocolPB? Because we encountered many times this problem in our prod environment that RBF can not sense the crashed namenode in time, because the |
Description of PR
When I read some code about protocol, I found that only ClientNamenodeProtocolPB proxy with RPC timeout, other protocolPB proxy without RPC timeout, such as RefreshAuthorizationPolicyProtocolPB, RefreshUserMappingsProtocolPB, RefreshCallQueueProtocolPB, GetUserMappingsProtocolPB and NamenodeProtocolPB.
If proxy without rpc timeout, it will be blocked for a long time if the NN machine crash or bad network during writing or reading with NN.
So I feel that we should enable RPC timeout for all ProtocolPB.