Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace bookie support downgrade to replace bookie with itself. #4013

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

horizonzy
Copy link
Member

@horizonzy horizonzy commented Jul 3, 2023

Descriptions of the changes in this PR:

Fixes #4012

This pr behavior:
When we want to replace a failed bookie with other bookies, if there are enough bookies, we can choose a new bookie to replace the failed bookie. But if there are no more bookies, we will check if the failed bookie is still alive. If alive, we can downgrade to replace the failed bookie with itself.

Example:
There are 4 bookies [0, 1, 2, 3]in the cluster, and the ledger ensemble is [0, 1, 2].
Case 1: We want to replace 0, the [0, 1, 2] will be excluded, then it will pick 3. The ledger new ensemble is [3, 1, 2].
Case 2: If the bookie 3 shutdown, there are only [0, 1, 2] in the cluster, we want to replace 0, the [0, 1, 2] will be excluded, there are no more bookies to select. Then we found the bookie 0 is still in the cluster, we pick 0. The ledger new ensemble is still [0, 1, 2].
Case 3: if the bookie 0, 2 shutdown, there are only [1, 2] in the cluster, we want to replace 0, the [0, 1, 2] will be excluded, there are no more bookies to select. Then we found the bookie 0 is not in the cluster, throw an exception.

This can bring benefits to the following cases:

  1. Fix problem in ReplicationWorker not work problem. #4012
  2. When writing failed due to request timeout, we will replace the failed bookie and switch to a new segment, then continue to write data. If there are no more bookies, it will throw exception and close the ledger before. After this pr, if the failed bookie is still alive, we will switch a new segment, and write data to this ledger continually.

Copy link
Member

@zymap zymap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just thinking, this issue is related to how we exclude bookies. For example, we have 3 bookies and the E,W,Q is 3,3,2. So if we have lost a replica, then we exclude all bookies in the ensemble, we have no bookies to write. You want to choose the bookie from the failure bookie itself, so the simple way should be don't remove itself from the bookies. Then the remaining thing will complete by the existing logic, including the bookie writable check.

@horizonzy horizonzy closed this Jul 6, 2023
@horizonzy horizonzy reopened this Jul 6, 2023
@eolivelli eolivelli modified the milestones: 4.17.0, 4.18.0 Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ReplicationWorker not work problem.
4 participants