-
Notifications
You must be signed in to change notification settings - Fork 7.3k
ZOOKEEPER-4074: Network issue while Learner is executing writePacket can cause the follower to hang #1582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…can cause the follower to hang
|
@maoling I've fixed the failing test cases in my local machine where all test cases pass. But there are still some failing test cases in the CI server, and I was wondering what could be the reason? Do you have any suggestion? |
@functioner The CI failure may not your blame. We're looking at this patch. Thanks for your contribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. However, could you add a test for the specific issue?
Also it is a major config change which deserve a release note IMO.
@@ -1149,6 +1149,11 @@ property, when available, is noted below. | |||
**New in 3.6.2:** | |||
When enabled, a learner will close the quorum socket asynchronously. This is useful for TLS connections where closing a socket might take a long time, block the shutdown process, potentially delay a new leader election, and leave the quorum unavailabe. Closing the socket asynchronously avoids blocking the shutdown process despite the long socket closing time and a new leader election can be started while the socket being closed. The default is false. | |||
|
|||
* *learner.asyncSending* | |||
(Jave system property only: **learner.closeSocketAsync**) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, it's a typo. I'll fix it.
@tisonkun To add a test for this issue, we may consider adding a fault injection tool, e.g., Byteman (ZOOKEEPER-3601 by @maoling ). I have provided the script to reproduce this bug (ZOOKEEPER-4074) using Byteman. |
I am not sure whether we should enable this functionality by default in branch-3.6. @functioner what you say shall we keep disabled it by default ? |
The asynchronous sending feature has been introduced to the codebase for a while with proper testing. So my thinking is that it might be mainly out of temporary development purposes that the property was not enabled by default. If that's the case, it might be useful to keep it enabled by default now because this feature makes the system more resilient to network failures. Users may not change default parameters (especially given that this is a Java system property), so they may encounter the failures; and it may take a while for them to realize that surprisingly the system actually has a resilient feature that could have prevented that failure. So I'm inclined to advocate changing the default to be on. But I understand your concern that in general changing default parameter should be done carefully. |
There is no reason to keep the configuration only a java system property. It should be configurable through zoo.cfg. As the changes are not released yet we can change and make it configurable through zoo.cfg . I will create a bug |
I provide a fix for the issue ZOOKEEPER-4074 that I proposed.
I basically by default enable learner.asyncSending introduced by ZOOKEEPER-3575, which is already tested. I also update the document for the learner.asyncSending configuration.
I modify a test case in LearnerTest a little bit because enabling learner.asyncSending may require more initialization in Learner for testing.
ZOOKEEPER-3575 will not take effect until 3.7.0, so I am also wondering if we need to backport this patch to 3.6.x?