Description
In what version(s) of Spring Integration are you seeing this issue?
6.3.2.RELEASE
Describe the bug
A clear and concise description of what the bug is.
This problem arises when a Hazelcast distributed fenced lock is destroyed in the Hazelcast cluster and a new lock is created.
The problem lies in the class org.springframework.integration.hazelcast.leader.LeaderInitiator.LeaderSelector in the call() method.
When a node tries to acquire the lock it tries to first get the lock at row 297 if it already have the lock, the node is the leader and it ends up at row 303, "sleeping" for 500ms. All other nodes fails the getLock().isLockedByCurrentThread() check and ends up at row 313 were the node tries to acquire the lock for 500ms.
The race condition arises when the distributed lock is destroyed. All nodes that are not the leader will all have the getLock.tryLock() fail with an exception, it then goes through the exception handling at row 322-338, including a sleep for 50ms. Then the nodes goes back into the loop and tries to acquire the lock again. One of the nodes will succeed and becomes the new leader.
However, the old leader does not get interrupted by the exception following the lock destruction. Since it's "sleeping" trying to acquire the yieldSign. So next time the old leader goes through the loop it fails the check at row 297 and goes to row 313 to try to acquire the lock, keep in mind it's leader status saved in the local field leader at row 288 will still have the value true. This means the old leader can't get the lock but have the local field value of leader set to true and haven't run the revokeLeadership() method. The cluster will now have two nodes thinking they are the leader, but only one (new one) is the "true leader" since it is holding the Hazelcast lock.
We noticed this problem in a production environment where the locks were destroyed and we had the heartBeatMillis value overridden and set to 100 seconds, because... yeah, idk, someone did it two years prior to this error happen. I'm not sure if this problem can arise with a low heartBeat at the default value of 500ms. But it could certainly happen with a larger one.
To Reproduce
This is kind of hard, since it depends on the cluster to destroy the lock
Expected behavior
When a leader no longer can acquire the lock it should run the revokeLeadership() method and no longer run code only the "true leader" should do.
Sample
N/A
Proposed solution
The else case on row 311 could add a check if the node still have the leader field set to true, but no longer can acquire the lock.
else {
// We try to acquire the lock
boolean acquired = getLock()
.tryLock(LeaderInitiator.this.heartBeatMillis, TimeUnit.MILLISECONDS);
if (acquired && !this.leader) {
// Success: we are now leader
this.leader = true;
handleGranted();
}
if(!acquired) {
revokeLeadership(); //revokeLeadership also makes the check to make sure the leader field is set to true before
// changing anything and call for the revoke events and such.
}
}
What are your thoughts on this?