Skip to content

Possibility to have multiple leaders in Hazelcast - LeaderInitiator - LeaderSelector #9398

Closed
@c11epm

Description

@c11epm

In what version(s) of Spring Integration are you seeing this issue?

6.3.2.RELEASE

Describe the bug

A clear and concise description of what the bug is.
This problem arises when a Hazelcast distributed fenced lock is destroyed in the Hazelcast cluster and a new lock is created.

The problem lies in the class org.springframework.integration.hazelcast.leader.LeaderInitiator.LeaderSelector in the call() method.

When a node tries to acquire the lock it tries to first get the lock at row 297 if it already have the lock, the node is the leader and it ends up at row 303, "sleeping" for 500ms. All other nodes fails the getLock().isLockedByCurrentThread() check and ends up at row 313 were the node tries to acquire the lock for 500ms.

The race condition arises when the distributed lock is destroyed. All nodes that are not the leader will all have the getLock.tryLock() fail with an exception, it then goes through the exception handling at row 322-338, including a sleep for 50ms. Then the nodes goes back into the loop and tries to acquire the lock again. One of the nodes will succeed and becomes the new leader.

However, the old leader does not get interrupted by the exception following the lock destruction. Since it's "sleeping" trying to acquire the yieldSign. So next time the old leader goes through the loop it fails the check at row 297 and goes to row 313 to try to acquire the lock, keep in mind it's leader status saved in the local field leader at row 288 will still have the value true. This means the old leader can't get the lock but have the local field value of leader set to true and haven't run the revokeLeadership() method. The cluster will now have two nodes thinking they are the leader, but only one (new one) is the "true leader" since it is holding the Hazelcast lock.

We noticed this problem in a production environment where the locks were destroyed and we had the heartBeatMillis value overridden and set to 100 seconds, because... yeah, idk, someone did it two years prior to this error happen. I'm not sure if this problem can arise with a low heartBeat at the default value of 500ms. But it could certainly happen with a larger one.

To Reproduce

This is kind of hard, since it depends on the cluster to destroy the lock

Expected behavior

When a leader no longer can acquire the lock it should run the revokeLeadership() method and no longer run code only the "true leader" should do.

Sample

N/A

Proposed solution

The else case on row 311 could add a check if the node still have the leader field set to true, but no longer can acquire the lock.

else {
	// We try to acquire the lock
	boolean acquired = getLock()
	.tryLock(LeaderInitiator.this.heartBeatMillis, TimeUnit.MILLISECONDS);
	if (acquired && !this.leader) {
		// Success: we are now leader
		this.leader = true;
		handleGranted();
	}
	if(!acquired) {
		revokeLeadership(); //revokeLeadership also makes the check to make sure the leader field is set to true before 
						//	changing anything and call for the revoke events and such.
	}
}

What are your thoughts on this?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions