Possibility to have multiple leaders in Hazelcast - LeaderInitiator - LeaderSelector

**In what version(s) of Spring Integration are you seeing this issue?**

6.3.2.RELEASE

**Describe the bug**

A clear and concise description of what the bug is.
This problem arises when a Hazelcast distributed fenced lock is destroyed in the Hazelcast cluster and a new lock is created.  

The problem lies in the class org.springframework.integration.hazelcast.leader.LeaderInitiator.LeaderSelector in the [call()](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L291) method. 


When a node tries to acquire the lock it tries to first get the lock at row [297](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L297) if it already have the lock, the node is the leader and it ends up at row [303](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L303), "sleeping" for 500ms. All other nodes fails the getLock().isLockedByCurrentThread() check and ends up at row [313](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L313) were the node tries to acquire the lock for 500ms. 

The race condition arises when the distributed lock is destroyed. All nodes that are not the leader will all have the [getLock.tryLock()](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L313) fail with an exception, it then goes through the exception handling at row [322-338](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L322), including a sleep for 50ms. Then the nodes goes back into the loop and tries to acquire the lock again. One of the nodes will succeed and becomes the new leader. 

However, the old leader does not get interrupted by the exception following the lock destruction. Since it's "sleeping" trying to acquire the yieldSign. So next time the old leader goes through the loop it fails the check at row 297 and goes to row 313 to try to acquire the lock, keep in mind it's leader status saved in the local field leader at row [288](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L288) will still have the value true. This means the old leader can't get the lock but have the local field value of leader set to true and haven't run the [revokeLeadership() ](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L348) method. The cluster will now have two nodes thinking they are the leader, but only one (new one) is the "true leader" since it is holding the Hazelcast lock. 

We noticed this problem in a production environment where the locks were destroyed and we had the heartBeatMillis value overridden and set to 100 seconds, because... yeah, idk, someone did it two years prior to this error happen. I'm not sure if this problem can arise with a low heartBeat at the default value of 500ms. But it could certainly happen with a larger one. 

**To Reproduce**

This is kind of hard, since it depends on the cluster to destroy the lock

**Expected behavior**

When a leader no longer can acquire the lock it should run the [revokeLeadership()](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L348) method and no longer run code only the "true leader" should do. 

**Sample**

N/A

**Proposed solution**

The else case on row[ 311](https://github.com/spring-projects/spring-integration/blob/e90fd686a0e047354bd6c5a5677117d4f720fcc6/spring-integration-hazelcast/src/main/java/org/springframework/integration/hazelcast/leader/LeaderInitiator.java#L311) could add a check if the node still have the leader field set to true, but no longer can acquire the lock. 

```
else {
	// We try to acquire the lock
	boolean acquired = getLock()
	.tryLock(LeaderInitiator.this.heartBeatMillis, TimeUnit.MILLISECONDS);
	if (acquired && !this.leader) {
		// Success: we are now leader
		this.leader = true;
		handleGranted();
	}
	if(!acquired) {
		revokeLeadership(); //revokeLeadership also makes the check to make sure the leader field is set to true before 
						//	changing anything and call for the revoke events and such.
	}
}
```


What are your thoughts on this? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to have multiple leaders in Hazelcast - LeaderInitiator - LeaderSelector #9398

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possibility to have multiple leaders in Hazelcast - LeaderInitiator - LeaderSelector #9398

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions