-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-15957. The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang #2878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
…tLogAsync can cause the HDFS client to hang
💔 -1 overall
This message was automatically generated. |
call.sendResponse(); | ||
} else { | ||
call.abortResponse(syncEx); | ||
for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a "solution" to the contrived fault injection. If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response. In either case, retrying is not going to help.
Have you observed a real problem or just noticed this by casual inspection of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daryn-sharp, in our report (HDFS-15957), we are doing fault injection testing, and we inject an IOException in call.sendResponse()
, and then we observe the symptom that the client gets stuck. In this scenario, retry can help.
If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response.
Within call.sendResponse()
, there are lots of code; some of them is related to the OS kernel I/O, and some of them may get changed in the future, so I think (now & future) there should be multiple reasons for an IOException, besides the two you list here. For example, for transient connection issues, retry would help. Furthermore, in the scenarios you describe, retry wouldn't make it worst.
Actually, it's not like retry won't help with connection issue, it's just a matter of whether our fix implements the retry correctly or not, i.e., whether we should re-create a connection object or not. Therefore, I think it's still worth adding the retry logic here, although it might not be able to handle the two scenarios you describe here. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@functioner Thanks for the explanation.
Adding an explicit IOException alters the purpose of the implementation. In order to observe the symptom of client, the faultInjection needs to be closer to what the code was implemented for. For example, instead of throwing an exception, a faultijection should be added to the rpc-call so that call.sendResponse()/call.abortResponse(syncEx)
would throw the appropriate exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, for transient connection issues, retry would help ... it's not like retry won't help with connection issue, it's just a matter of whether our fix implements the retry correctly or not, i.e., whether we should re-create a connection object or not.
There is no scenario in which an IOE is recoverable except in an artificial fault injection condition. There is no "transient" failure writing to a socket – it's game over. An IOE bubbling up to this point means the underlying low-level ipc code has already closed the connection and simply re-thrown the error to inform the caller.
Re-creating a connection object is fundamentally impossible. Ignoring that impossibility, even if a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client.
Therefore, I think it's still worth adding the retry logic here, although it might not be able to handle the two scenarios you describe here. Do you agree?
I 100% disagree with a broken solution in search of a non-existent problem. Any retry for any reason is inherently a bug.
Which brings us back to:
we are doing fault injection testing, and we inject an IOException in call.sendResponse(), and then we observe the symptom that the client gets stuck. In this scenario, retry can help.
This is a perfect illustration of the danger of using fault injection to create an artificial bug, and then "solving" that artificial bug with a real bug.
-1 on this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daryn-sharp Thanks for your explanation!
Now I understand that:
- If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response.
- If a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client.
And I have more questions:
- What kind of messages will be sent in this specific
call.sendResponse()
? - Among these messages, are there critical ones? I mean, maybe some messages are important to the protocol and we don't want to lose them. If we delegate them to a dedicated RPC thread/service/framework, the disconnection/exception can be automatically handled so that those important messages can reliably reach the receiver's side.
The call.sendResponse()
is invoked after the edit log sync finishes, so the transaction has been done. However, it swallows the possible exception so it doesn't care whether it successfully replies (to datanode/client). Even if we have the implication of disconnection in this scenario, I was wondering if it's possible that a bug is hidden by this possibly ignored message?
For example, If the receiver (datanode/client) is doing some end-to-end retry (e.g., can't get the reply and request the same transaction again), then this identical transaction may get rejected because it's already committed in the edit log. Probably we should at least add some warning/logging when call.sendResponse()
fails to send the message.
I propose a fix for HDFS-15957. And probably we should make
RESPONSE_SEND_RETRIES
configurable.