Skip to content

Conversation

@ssalinas
Copy link
Member

@ssalinas ssalinas commented Feb 10, 2021

Realized there were a few cases in here where we could possibly still have blips of multiple leaders or no leaders at a time. In particular the following set of events:

  • Old leader gets a TERM signal
  • Old leader fails to properly wait for all scheduled pollers to exit
  • Old leader relinquishes leadership
  • New leader gains leadership and starts loading leader cache
  • Old leader pollers finish their run, possibly flush more state to zk
  • state flushed by old leader is now not in leader cache on new leader

Need to test this a bit, but this adds:

  • A specific pre-jetty-shutdown hook so that the leader cleanly moves from the leading instance to a read-only/standby before actually fully shutting down (so that we don't have weird blips in serving writes when leader changes). All pollers should stop in this step
  • poll for leadership change to make sure it has propagated
  • better/longer wait on running pollers
  • log when pollers aren't shut down cleanly

@ssalinas ssalinas added the staging Merged to staging branch label Feb 10, 2021

if (!service.awaitTermination(timeoutLeftInMillis, TimeUnit.MILLISECONDS)) {
return;
LOG.warn("Eeecutor service tasks did not exit in time");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo on executor

@rosalind210
Copy link
Contributor

🚢

@ssalinas ssalinas merged commit 98512dd into master Mar 4, 2021
@ssalinas ssalinas deleted the leader_cache_race_conditions branch March 4, 2021 18:44
@ssalinas ssalinas added this to the 1.5.0 milestone May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

staging Merged to staging branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants