-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-15957. The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang #2878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
functioner
wants to merge
1
commit into
apache:trunk
Choose a base branch
from
functioner:HDFS-15957
base: trunk
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a "solution" to the contrived fault injection. If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response. In either case, retrying is not going to help.
Have you observed a real problem or just noticed this by casual inspection of the code?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daryn-sharp, in our report (HDFS-15957), we are doing fault injection testing, and we inject an IOException in
call.sendResponse()
, and then we observe the symptom that the client gets stuck. In this scenario, retry can help.Within
call.sendResponse()
, there are lots of code; some of them is related to the OS kernel I/O, and some of them may get changed in the future, so I think (now & future) there should be multiple reasons for an IOException, besides the two you list here. For example, for transient connection issues, retry would help. Furthermore, in the scenarios you describe, retry wouldn't make it worst.Actually, it's not like retry won't help with connection issue, it's just a matter of whether our fix implements the retry correctly or not, i.e., whether we should re-create a connection object or not. Therefore, I think it's still worth adding the retry logic here, although it might not be able to handle the two scenarios you describe here. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@functioner Thanks for the explanation.
Adding an explicit IOException alters the purpose of the implementation. In order to observe the symptom of client, the faultInjection needs to be closer to what the code was implemented for. For example, instead of throwing an exception, a faultijection should be added to the rpc-call so that
call.sendResponse()/call.abortResponse(syncEx)
would throw the appropriate exception.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no scenario in which an IOE is recoverable except in an artificial fault injection condition. There is no "transient" failure writing to a socket – it's game over. An IOE bubbling up to this point means the underlying low-level ipc code has already closed the connection and simply re-thrown the error to inform the caller.
Re-creating a connection object is fundamentally impossible. Ignoring that impossibility, even if a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client.
I 100% disagree with a broken solution in search of a non-existent problem. Any retry for any reason is inherently a bug.
Which brings us back to:
This is a perfect illustration of the danger of using fault injection to create an artificial bug, and then "solving" that artificial bug with a real bug.
-1 on this patch.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daryn-sharp Thanks for your explanation!
Now I understand that:
And I have more questions:
call.sendResponse()
?The
call.sendResponse()
is invoked after the edit log sync finishes, so the transaction has been done. However, it swallows the possible exception so it doesn't care whether it successfully replies (to datanode/client). Even if we have the implication of disconnection in this scenario, I was wondering if it's possible that a bug is hidden by this possibly ignored message?For example, If the receiver (datanode/client) is doing some end-to-end retry (e.g., can't get the reply and request the same transaction again), then this identical transaction may get rejected because it's already committed in the edit log. Probably we should at least add some warning/logging when
call.sendResponse()
fails to send the message.