-
Notifications
You must be signed in to change notification settings - Fork 25.3k
[Zen2] Introduce PreVoteCollector #32847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Introduce PreVoteCollector #32847
Conversation
An election requires a node to select a term that is higher than all previously-seen terms. If nodes are too enthusiastic about starting elections then they can effectively excludes itself from the cluster until the leader can bump to a still-higher term, and if this process repeats then a single faulty node can prevent the cluster from making useful progress. The solution is to start the election with a pre-voting round to ensure that there is at least a quorum of nodes who believe there to be no leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some initial feedback. I'm missing a test that checks that a subsequent election (simulated on coordinationstate) will succeed given a successful prevoting round.
|
||
final long currentMaxTermSeen = updateMaxTermSeen(response.getCurrentTerm()); | ||
|
||
final PreVoteResponse currentPreVoteResponse = preVoteResponse; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should capture this in the PreVotingRound constructor. This should very rarely be changing (similarly to cluster state) if there is an ongoing prevoting round. This makes this instance also internally more consistent. Finally, I would like to see if we can make this inner class static, so that we know exactly about all the dependencies it has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the current (i.e. last-accepted) cluster state for this, I think.
The inner class uses these from the outer class:
logger
transportService
startElection.run()
toString()
preVoteResponse.getCurrentTerm()
(in the constructor)
I got rid of the getCurrentTerm()
call in the constructor, but I would prefer to use the others as-is rather than making it static.
|
||
public static final String REQUEST_PRE_VOTE_ACTION_NAME = "internal:cluster/request_pre_vote"; | ||
|
||
private final AtomicLong maxTermSeen = new AtomicLong(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should leave this out of the PR for now. I'm expecting this "maxTermSeen" thingy to appear in more components, and would like to treat it uniformly (maybe just a callback).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
|
||
void start(final Iterable<DiscoveryNode> broadcastNodes) { | ||
|
||
final boolean isRunningChanged = isRunning.compareAndSet(false, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this becomes obsolete by changing to isClosed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
final long currentMaxTermSeen = updateMaxTermSeen(response.getCurrentTerm()); | ||
|
||
final PreVoteResponse currentPreVoteResponse = preVoteResponse; | ||
if (response.getLastAcceptedTerm() > currentPreVoteResponse.getLastAcceptedTerm() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can do something similar as for the isElectionQuorum
trick where we share implementation of these checks with CoordinationState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to do so. The machinery to share these few lines of code obscures what they do, and the extra abstraction in CoordinationState
takes it further away from the formal model.
import static org.elasticsearch.cluster.coordination.CoordinationState.isElectionQuorum; | ||
import static org.elasticsearch.common.util.concurrent.ConcurrentCollections.newConcurrentSet; | ||
|
||
public class PreVoteCollectorFactory extends AbstractComponent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe omit the "Factory" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits, looks good o.w.
this.transportService = transportService; | ||
this.startElection = startElection; | ||
|
||
transportService.registerRequestHandler(REQUEST_PRE_VOTE_ACTION_NAME, Names.GENERIC, false, false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can later on think whether we even need to go on the generic thread pool if there's no blocking action and only cheap computations here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 68c4b84 to note this.
* @return the pre-voting round, which can be closed to end the round early. | ||
*/ | ||
public Releasable start(final ClusterState clusterState, final Iterable<DiscoveryNode> broadcastNodes) { | ||
logger.debug("{} starting", this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe sufficient to log this in the start method of PreVotingRound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved and expanded in 63b221c
public static final String REQUEST_PRE_VOTE_ACTION_NAME = "internal:cluster/request_pre_vote"; | ||
|
||
private final TransportService transportService; | ||
private final Runnable startElection; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure yet if it's sufficient to have a unique runnable for all voting rounds or whether we want specific ones, different for each voting round. We can check these though once we integrate these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will become clear later...
public void update(final PreVoteResponse preVoteResponse, final DiscoveryNode currentLeader) { | ||
logger.trace("updating with preVoteResponse={}, currentLeader={}", preVoteResponse, currentLeader); | ||
this.preVoteResponse = preVoteResponse; | ||
this.currentLeader = currentLeader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to do some lightweight synchronization around these two variables so that you always have a consistent snapshot of those? Alternatively, we could store them in a tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really necessary, but might help a bit. I used a Tuple in 232b4e9
// TODO if we are a leader and the max term seen exceeds our term then we need to bump our term | ||
|
||
final DiscoveryNode currentLeader = this.currentLeader; | ||
if (currentLeader == null || currentLeader.equals(request.getSourceNode())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment while we treat the currentLeader.equals(request.getSourceNode())
situation specially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 232b4e9
if (response.getLastAcceptedTerm() > clusterState.term() | ||
|| (response.getLastAcceptedTerm() == clusterState.term() | ||
&& response.getLastAcceptedVersion() > clusterState.version())) { | ||
logger.debug("{} ignoring {} from {} as it is fresher", this, response, sender); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why debug instead of trace logging everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep trace logging for things that are expected to happen in a stable cluster, and anything unexpected to appear at debug or higher. In a stable cluster, any election activity is unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok makes sense
|
||
@Override | ||
public void close() { | ||
final boolean isClosedChanged = isClosed.compareAndSet(false, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe name this isNotAlreadyClosed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4f61eff
public Releasable start(final ClusterState clusterState, final Iterable<DiscoveryNode> broadcastNodes) { | ||
logger.debug("{} starting", this); | ||
PreVotingRound currentRound = new PreVotingRound(clusterState, preVoteResponse.getCurrentTerm()); | ||
currentRound.start(broadcastNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of the "current"Round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in 4ec40e1
An election requires a node to select a term that is higher than all
previously-seen terms. If nodes are too enthusiastic about starting elections
then they can effectively excludes itself from the cluster until the leader can
bump to a still-higher term, and if this process repeats then a single faulty
node can prevent the cluster from making useful progress.
The solution is to start the election with a pre-voting round to ensure that
there is at least a quorum of nodes who believe there to be no leader.