Skip to content

ZOOKEEPER-4074: Network issue while Learner is executing writePacket can cause the follower to hang #1582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

functioner
Copy link

@functioner functioner commented Jan 18, 2021

I provide a fix for the issue ZOOKEEPER-4074 that I proposed.
I basically by default enable learner.asyncSending introduced by ZOOKEEPER-3575, which is already tested. I also update the document for the learner.asyncSending configuration.
I modify a test case in LearnerTest a little bit because enabling learner.asyncSending may require more initialization in Learner for testing.
ZOOKEEPER-3575 will not take effect until 3.7.0, so I am also wondering if we need to backport this patch to 3.6.x?

@maoling maoling self-requested a review January 22, 2021 08:43
@functioner
Copy link
Author

I provide a fix for the issue ZOOKEEPER-4074 that I proposed.
I basically by default enable learner.asyncSending introduced by ZOOKEEPER-3575, which is already tested. I also update the document for the learner.asyncSending configuration.
I modify a test case in LearnerTest a little bit because enabling learner.asyncSending may require more initialization in Learner for testing.
ZOOKEEPER-3575 will not take effect until 3.7.0, so I am also wondering if we need to backport this patch to 3.6.x?

@functioner functioner closed this Feb 4, 2021
@functioner functioner reopened this Feb 4, 2021
@functioner
Copy link
Author

@maoling I've fixed the failing test cases in my local machine where all test cases pass. But there are still some failing test cases in the CI server, and I was wondering what could be the reason? Do you have any suggestion?

@maoling
Copy link
Member

maoling commented Feb 6, 2021

@functioner The CI failure may not your blame. We're looking at this patch. Thanks for your contribution.

Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. However, could you add a test for the specific issue?

Also it is a major config change which deserve a release note IMO.

@@ -1149,6 +1149,11 @@ property, when available, is noted below.
**New in 3.6.2:**
When enabled, a learner will close the quorum socket asynchronously. This is useful for TLS connections where closing a socket might take a long time, block the shutdown process, potentially delay a new leader election, and leave the quorum unavailabe. Closing the socket asynchronously avoids blocking the shutdown process despite the long socket closing time and a new leader election can be started while the socket being closed. The default is false.

* *learner.asyncSending*
(Jave system property only: **learner.closeSocketAsync**)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, it's a typo. I'll fix it.

@functioner
Copy link
Author

functioner commented Feb 24, 2021

@tisonkun To add a test for this issue, we may consider adding a fault injection tool, e.g., Byteman (ZOOKEEPER-3601 by @maoling ). I have provided the script to reproduce this bug (ZOOKEEPER-4074) using Byteman.

@arshadmohammad
Copy link
Contributor

I am not sure whether we should enable this functionality by default in branch-3.6.
This functionality is new and it has changed core ZK code, I think it is better not to enable it by default to avoid impact on other functionalities.
But I am sure we should backport to branch-3.6.

@functioner what you say shall we keep disabled it by default ?

@functioner
Copy link
Author

I am not sure whether we should enable this functionality by default in branch-3.6.
This functionality is new and it has changed core ZK code, I think it is better not to enable it by default to avoid impact on other functionalities.
But I am sure we should backport to branch-3.6.

@functioner what you say shall we keep disabled it by default ?

The asynchronous sending feature has been introduced to the codebase for a while with proper testing. So my thinking is that it might be mainly out of temporary development purposes that the property was not enabled by default. If that's the case, it might be useful to keep it enabled by default now because this feature makes the system more resilient to network failures. Users may not change default parameters (especially given that this is a Java system property), so they may encounter the failures; and it may take a while for them to realize that surprisingly the system actually has a resilient feature that could have prevented that failure. So I'm inclined to advocate changing the default to be on. But I understand your concern that in general changing default parameter should be done carefully.

@arshadmohammad
Copy link
Contributor

Users may not change default parameters (especially given that this is a Java system property),

There is no reason to keep the configuration only a java system property. It should be configurable through zoo.cfg. As the changes are not released yet we can change and make it configurable through zoo.cfg . I will create a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants