Skip to content

HDFS-15957. The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang #2878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ class FSEditLogAsync extends FSEditLog implements Runnable {
private Thread syncThread;
private static final ThreadLocal<Edit> THREAD_EDIT = new ThreadLocal<Edit>();

// number of retry times for sending edit log sync response.
private static final int RESPONSE_SEND_RETRIES = 3;

// requires concurrent access from caller threads and syncing thread.
private final BlockingQueue<Edit> editPendingQ;

Expand Down Expand Up @@ -378,13 +381,18 @@ public void logSyncWait() {

@Override
public void logSyncNotify(RuntimeException syncEx) {
try {
if (syncEx == null) {
call.sendResponse();
} else {
call.abortResponse(syncEx);
for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a "solution" to the contrived fault injection. If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response. In either case, retrying is not going to help.

Have you observed a real problem or just noticed this by casual inspection of the code?

Copy link
Contributor Author

@functioner functioner Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daryn-sharp, in our report (HDFS-15957), we are doing fault injection testing, and we inject an IOException in call.sendResponse(), and then we observe the symptom that the client gets stuck. In this scenario, retry can help.

If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response.

Within call.sendResponse(), there are lots of code; some of them is related to the OS kernel I/O, and some of them may get changed in the future, so I think (now & future) there should be multiple reasons for an IOException, besides the two you list here. For example, for transient connection issues, retry would help. Furthermore, in the scenarios you describe, retry wouldn't make it worst.

Actually, it's not like retry won't help with connection issue, it's just a matter of whether our fix implements the retry correctly or not, i.e., whether we should re-create a connection object or not. Therefore, I think it's still worth adding the retry logic here, although it might not be able to handle the two scenarios you describe here. Do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functioner Thanks for the explanation.

Adding an explicit IOException alters the purpose of the implementation. In order to observe the symptom of client, the faultInjection needs to be closer to what the code was implemented for. For example, instead of throwing an exception, a faultijection should be added to the rpc-call so that call.sendResponse()/call.abortResponse(syncEx) would throw the appropriate exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for transient connection issues, retry would help ... it's not like retry won't help with connection issue, it's just a matter of whether our fix implements the retry correctly or not, i.e., whether we should re-create a connection object or not.

There is no scenario in which an IOE is recoverable except in an artificial fault injection condition. There is no "transient" failure writing to a socket –  it's game over. An IOE bubbling up to this point means the underlying low-level ipc code has already closed the connection and simply re-thrown the error to inform the caller.

Re-creating a connection object is fundamentally impossible. Ignoring that impossibility, even if a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client.

Therefore, I think it's still worth adding the retry logic here, although it might not be able to handle the two scenarios you describe here. Do you agree?

I 100% disagree with a broken solution in search of a non-existent problem. Any retry for any reason is inherently a bug.

Which brings us back to:

we are doing fault injection testing, and we inject an IOException in call.sendResponse(), and then we observe the symptom that the client gets stuck. In this scenario, retry can help.

This is a perfect illustration of the danger of using fault injection to create an artificial bug, and then "solving" that artificial bug with a real bug.

-1 on this patch.

Copy link
Contributor Author

@functioner functioner Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daryn-sharp Thanks for your explanation!
Now I understand that:

  1. If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response.
  2. If a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client.

And I have more questions:

  1. What kind of messages will be sent in this specific call.sendResponse()?
  2. Among these messages, are there critical ones? I mean, maybe some messages are important to the protocol and we don't want to lose them. If we delegate them to a dedicated RPC thread/service/framework, the disconnection/exception can be automatically handled so that those important messages can reliably reach the receiver's side.

The call.sendResponse() is invoked after the edit log sync finishes, so the transaction has been done. However, it swallows the possible exception so it doesn't care whether it successfully replies (to datanode/client). Even if we have the implication of disconnection in this scenario, I was wondering if it's possible that a bug is hidden by this possibly ignored message?

For example, If the receiver (datanode/client) is doing some end-to-end retry (e.g., can't get the reply and request the same transaction again), then this identical transaction may get rejected because it's already committed in the edit log. Probably we should at least add some warning/logging when call.sendResponse() fails to send the message.

try {
if (syncEx == null) {
call.sendResponse();
} else {
call.abortResponse(syncEx);
}
break;
} catch (Exception e) {
LOG.info("Error in sending log sync response, retry " + retries, e);
}
} catch (Exception e) {} // don't care if not sent.
}
}

@Override
Expand Down