Skip to content

HDFS-15957. The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang #2878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

functioner
Copy link
Contributor

I propose a fix for HDFS-15957. And probably we should make RESPONSE_SEND_RETRIES configurable.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 31m 39s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 36m 10s trunk passed
+1 💚 compile 1m 19s trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 compile 1m 13s trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 checkstyle 1m 1s trunk passed
+1 💚 mvnsite 1m 24s trunk passed
+1 💚 javadoc 0m 53s trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 27s trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 spotbugs 3m 17s trunk passed
+1 💚 shadedclient 18m 54s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 11s the patch passed
+1 💚 compile 1m 17s the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javac 1m 17s the patch passed
+1 💚 compile 1m 12s the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 javac 1m 12s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 59s the patch passed
+1 💚 mvnsite 1m 12s the patch passed
+1 💚 javadoc 0m 48s the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 15s the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 spotbugs 3m 20s the patch passed
+1 💚 shadedclient 18m 55s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 382m 4s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
507m 32s
Reason Tests
Failed junit tests hadoop.hdfs.server.namenode.ha.TestBootstrapStandby
hadoop.hdfs.TestDFSShell
hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
hadoop.hdfs.TestLeaseRecovery
hadoop.hdfs.server.namenode.ha.TestEditLogTailer
hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList
hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots
hadoop.hdfs.server.datanode.TestBlockScanner
hadoop.hdfs.TestViewDistributedFileSystemWithMountLinks
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor
hadoop.hdfs.TestSnapshotCommands
hadoop.hdfs.server.datanode.TestDirectoryScanner
hadoop.hdfs.TestPersistBlocks
hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys
hadoop.hdfs.TestStateAlignmentContextWithHA
hadoop.hdfs.server.namenode.TestDecommissioningStatus
hadoop.hdfs.server.diskbalancer.TestDiskBalancer
hadoop.hdfs.TestLeaseRecovery2
hadoop.hdfs.server.namenode.TestProcessCorruptBlocks
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2878/1/artifact/out/Dockerfile
GITHUB PR #2878
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux ba58ce5c8970 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 794ee75
Default Java Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2878/1/testReport/
Max. process+thread count 2036 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2878/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

call.sendResponse();
} else {
call.abortResponse(syncEx);
for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a "solution" to the contrived fault injection. If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response. In either case, retrying is not going to help.

Have you observed a real problem or just noticed this by casual inspection of the code?

Copy link
Contributor Author

@functioner functioner Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daryn-sharp, in our report (HDFS-15957), we are doing fault injection testing, and we inject an IOException in call.sendResponse(), and then we observe the symptom that the client gets stuck. In this scenario, retry can help.

If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response.

Within call.sendResponse(), there are lots of code; some of them is related to the OS kernel I/O, and some of them may get changed in the future, so I think (now & future) there should be multiple reasons for an IOException, besides the two you list here. For example, for transient connection issues, retry would help. Furthermore, in the scenarios you describe, retry wouldn't make it worst.

Actually, it's not like retry won't help with connection issue, it's just a matter of whether our fix implements the retry correctly or not, i.e., whether we should re-create a connection object or not. Therefore, I think it's still worth adding the retry logic here, although it might not be able to handle the two scenarios you describe here. Do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functioner Thanks for the explanation.

Adding an explicit IOException alters the purpose of the implementation. In order to observe the symptom of client, the faultInjection needs to be closer to what the code was implemented for. For example, instead of throwing an exception, a faultijection should be added to the rpc-call so that call.sendResponse()/call.abortResponse(syncEx) would throw the appropriate exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for transient connection issues, retry would help ... it's not like retry won't help with connection issue, it's just a matter of whether our fix implements the retry correctly or not, i.e., whether we should re-create a connection object or not.

There is no scenario in which an IOE is recoverable except in an artificial fault injection condition. There is no "transient" failure writing to a socket –  it's game over. An IOE bubbling up to this point means the underlying low-level ipc code has already closed the connection and simply re-thrown the error to inform the caller.

Re-creating a connection object is fundamentally impossible. Ignoring that impossibility, even if a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client.

Therefore, I think it's still worth adding the retry logic here, although it might not be able to handle the two scenarios you describe here. Do you agree?

I 100% disagree with a broken solution in search of a non-existent problem. Any retry for any reason is inherently a bug.

Which brings us back to:

we are doing fault injection testing, and we inject an IOException in call.sendResponse(), and then we observe the symptom that the client gets stuck. In this scenario, retry can help.

This is a perfect illustration of the danger of using fault injection to create an artificial bug, and then "solving" that artificial bug with a real bug.

-1 on this patch.

Copy link
Contributor Author

@functioner functioner Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daryn-sharp Thanks for your explanation!
Now I understand that:

  1. If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response.
  2. If a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client.

And I have more questions:

  1. What kind of messages will be sent in this specific call.sendResponse()?
  2. Among these messages, are there critical ones? I mean, maybe some messages are important to the protocol and we don't want to lose them. If we delegate them to a dedicated RPC thread/service/framework, the disconnection/exception can be automatically handled so that those important messages can reliably reach the receiver's side.

The call.sendResponse() is invoked after the edit log sync finishes, so the transaction has been done. However, it swallows the possible exception so it doesn't care whether it successfully replies (to datanode/client). Even if we have the implication of disconnection in this scenario, I was wondering if it's possible that a bug is hidden by this possibly ignored message?

For example, If the receiver (datanode/client) is doing some end-to-end retry (e.g., can't get the reply and request the same transaction again), then this identical transaction may get rejected because it's already committed in the edit log. Probably we should at least add some warning/logging when call.sendResponse() fails to send the message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants