Skip to content

Improve: Ensure uniform membership config during config changes #1351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

drmingdrmer
Copy link
Member

@drmingdrmer drmingdrmer commented Apr 2, 2025

Changelog

Improve: Ensure uniform membership config during config changes

This commit improves the handling of membership configuration changes to
ensure that the system reliably transitions to a uniform configuration
in the second step of a joint membership change.

Problem:

During a membership config change, there are two steps:

  1. Transition to a joint configuration containing both Cold (current
    config) and Cnew (new config).
  2. Transition to a uniform configuration containing only Cnew.

Previously, the second step attempted to apply the same change on the
current membership state. If multiple membership change requests were
processed in parallel, this approach could result in the system being
left in a joint configuration. For example:

  • Initial config: {a, b, c}.

  • Task 1: AddVoterIds(x). After the first step:
    [{a, b, c}, {a, b, c, x}].

  • Task 2: RemoveVoters(x). After the first step:
    [{a, b, c, x}, {a, b, c}] (applied on the last config {a, b, c, x}).

  • Task 1 proceeds to the second step, re-applies AddVoterIds(x), and
    the result is [{a, b, c}, {a, b, c, x}].

  • The system remains in a joint configuration, which is unintuitive and
    contradicts standard Raft expectations.

Solution:

The second step now applies an empty change request
(AddVoterIds({})) to the last configuration in the current joint
config. This ensures that the system always transitions to a uniform
configuration in the second step.

Impact:

  • No behavior changes occur if only one membership change request is in
    progress.

  • If multiple requests are processed concurrently, the application must
    still verify the result, and the new behavior ensures the system
    transitions to a uniform state.

This fix prevents the system from being left in an unintended joint
configuration, improving consistency and adherence to Raft principles.

Thanks to @tvsfx for providing feedback on this issue and offering a
detailed explanation of the solution.


This change is Reviewable

@tvsfx
Copy link
Contributor

tvsfx commented Apr 2, 2025

That's a nice, minimal way to implement this flattening operation :) Now we just need to think of a way to make the retain parameter work better

@drmingdrmer drmingdrmer force-pushed the 195-change-membership-batch branch 3 times, most recently from 27dbd28 to 1e5aee2 Compare April 2, 2025 16:11
@tvsfx
Copy link
Contributor

tvsfx commented Apr 2, 2025

Oh, I just had a thought: I don't think this implementation works for removing Nodes and Voters as part of a batch at the same time: nodes can only be removed once the corresponding voter has been removed from the config already.

@drmingdrmer drmingdrmer marked this pull request as draft April 2, 2025 16:29
@drmingdrmer
Copy link
Member Author

Oh, I just had a thought: I don't think this implementation works for removing Nodes and Voters as part of a batch at the same time: nodes can only be removed once the corresponding voter has been removed from the config already.

Hmm... correct

@drmingdrmer
Copy link
Member Author

@tvsfx
It looks like this PR should be closed?

@tvsfx
Copy link
Contributor

tvsfx commented Jul 10, 2025

#1351 (comment)

I've had some more time to think about this PR just now, and what I said above isn't correct. Imagine that we have a batch as follows (with retain = true) : RemoveVoters(n1), RemoveNode(n1). Then the following scenario's are possible (not even considering concurrent membership configuration changes for now):

  1. n1 is either not in the starting config at all (e.g. joint config [{n2}]), or is not present in the last subconfig of the joint starting config (e.g. [{n1}, {n3}]). In this case, the configuration change works out, because the node (and, if present, its corresponding voter) can safely be removed immediately and we transition to a uniform config after one step (the flatten step is not necessary).
  2. If n1 is in the last subconfig of the current config (e.g. [{n3,n1}, {n2,n1}]), then we first have to transition to a joint config, whereas the node removal will happen immediately. We get an Error because the resulting config does not satisfy ensure_voter_nodes.

So unlike what my comment above said, this seems like it would "work", in the sense that we're not left in a bad state but error out. The reason I got confused is that I forgot that changes to Learners always happen in a single step, so if an error occurs, it should also happen in step 1 already (i.e. before the flatten step). That being said, I think the semantics of mixing Learner and Voter changes within a single batch ends up being very confusing, since Learner changes happen immediately, whereas Voter changes do not. This can cause surprising errors, or make it so changes to learners pass changes to voters by. Additionally, the situation with retain is still suboptimal in the implementation, since different membership changes can end up using each other's retain setting.:

Let me repeat an old example of mine:

Starting cluster membership: [{n1,n2,n3}]
Task 1: RemoveVoter(n1), retain = true
Task 1: RemoveVoter(n2), retain = false
Interleaving:

  • First Task1: [{n1,n2,n3}] -> [{n1,n2,n3}, {n2,n3}]
  • Then Task2: [{n1,n2,n3}, {n2,n3}] -> [{n2,n3}, {n3}]. At this point, n1 is removed as a learner, because for task 2, retain = false
  • Then either Task1 or Task2: [{n2,n3}, {n3}] -> [{n3}]. If Task1 runs first, then n2 is kept as a learner, whereas if Task2 is run first, then n2 is removed as a learner.

So when changing cluster membership concurrently, the retain parameters of different invocations can still "pollute" each other.

@tvsfx
Copy link
Contributor

tvsfx commented Jul 11, 2025

So what to do about retain? I've thought about one proposal that should be compatible with this PR, which Ill write out below. (I know you might not love this proposal since you want to be able to support permanent joint membership, but here goes anyway 😛)

We need to solve the issue of not knowing which change causes voters to be deleted. The idea is that if retain=false for the first stage of a membership change (i.e. not for the flatten stage this PR introduces), we mark the resulting config with a flag NoRetain (or a flag Retain if retain=true) if it is a joint config. 1 When NoRetain is set on a joint config [A, B], then once a membership transition occurs (to [B, C], [B], or [A] - see 1 for this last case), the nodes in A \ B are removed from the set of learners. This boolean flag would ideally have to be stored with the membership in the Raft logs if we want other nodes to be able to complete membership transitions correctly, but we could skip that if we expect new leaders to fix the membership on their own.
Notes:

  1. The flag is only present if the current membership is a Joint membership.
  2. In principle, there could be n-1 flags for n sets of nodes within a joint membership (or one flag for all of them?), but currently openraft never allows having more than 2 sets of members, thanks to the generalized membership transitions.
  3. Note that if we were to allow Batches of membership changes to have different retain parameters for different sub-operations, the story would be much more complicated and we would have to keep track of a set of IntendedDeletions, i.e. the nodes that should be deleted upon the next config change.

Examples of how this works (see also 1):

  • If we start from [{n1, n2}, {n2, n3}] with Retain and make the change RemoveVoters({n3}) with NoRetain , then we will have the transitions [{n1, n2}, {n2, n3}] - Retain -> [{n2, n3}, {n2}] - NoRetain -> [{n2}] . During the first transition, no nodes are deleted, since Retain is set. During this first transition, we store NoRetain with the new joint membership, so that when we finally perform the second transition, we know to remove n3. During the second transition, the voter n3 and its corresponding node will be removed.
  • The same situation, but now NoRetain is set at the start. This leads to the same transitions, but after the first transition, the node for n1 is deleted.
  • Note that tracking (No)Retain does change the (imo previously incorrect) semantics of some membership changes. Starting membership: [{n1, n2, n3}, {n2}] with NoRetain , and change AddVoterIds(n1). This would have previously caused the transitions: [{n1, n2, n3}, {n2}] -> [{n2}, {n1, n2}] -> [{n1, n2}], but will now fail because the node n1 should get deleted during the first transition.
  • Tracking (No)Retain does change the (imo previously incorrect) semantics of some membership changes (take 2). Let's look again at my old example:
    Starting cluster membership: [{n1,n2,n3}]
    Task 1: RemoveVoter(n1), retain = true
    Task 1: RemoveVoter(n2), retain = false
    Interleaving:
    First Task1: [{n1,n2,n3}] -> [{n1,n2,n3}, {n2,n3}] - Retain
    Then Task2: [{n1,n2,n3}, {n2,n3}] - Retain -> [{n2,n3}, {n3}] - Noretain. Contrary to before, n1 is not removed, because it is not part of the last membership, and Retain is set. Now we set NoRetain, because these are the nodes that will be deleted from the last config
    Then either Task1 or Task2: [{n2,n3}, {n3}] -> [{n3}]. We remove n2 as a Node because NoRetain is set.

Footnotes

  1. If it is a uniform (non-joint) config, you might think we can just retain all voters, because a single-step config change implies that no voters needed to be removed, but there's a tricky corner case to consider, due to the following optimization: if starting config is [{n1}, {n1, n2}] then the request RemoveVoters(n2), NoRetain will take place in a single step, because the resulting config {n1} is already part of the joint config. In this case, the voter {n2} should (imo) still be removed. This creates some trickiness when a NoRetain flag is already present on a joint config: starting config is [{n3}, {n1, n2}] - NoRetain. Then the call ReplaceAllVoters({n3}) - (No)Retain which would have previously (imo wrongly) succeeded in a single step, will now fail because NoRetain requires n3 to be removed from the nodes first. OTOH, ReplaceAllVoters({n1, n2}) - (No)Retain will complete in a single step and only cause {n3} to be removed. 2 3

@drmingdrmer
Copy link
Member Author

Hmm... Can I just simplify the above logic to:

Change membership from ver-1 to ver-2, by adding explicit learner ids:

// ver-1 membership

type VoterId = NodeId;

struct Membership {
    configs: Vec<BTreeSet<VoterId>>,
    nodes: BTreeMap<NodeId, Node>, 
}

ver-1 has a constraint that each VoterId that present in configs must present in nodes.

// ver-2 membership

type VoterId = NodeId;
type LearnerId = NodeId;

struct Membership {
    configs: Vec<(BTreeSet<VoterIds>, BTreeSet<LearnerId>)>, // add explicit learner ids for each step.
    nodes: BTreeMap<NodeId, Node>, 
}

ver-2 constraint: in each tuple in configs, BTreeSet<VoterIds> does not intersect with BTreeSet<LearnerId>,
and the set of VoterId and LearnerId that present in configs equals nodes.keys().

In short, learner ids is a config for each step config but is not for the entire joint config.

@tvsfx
Copy link
Contributor

tvsfx commented Jul 22, 2025

I was trying to change the Membership structs as little as possible in my solution above (i.e. add just a single [vec of] flag[s]). However, if explicitly adding the set of learners to each sub-membership in a joint membership is a change you're willing to make, I think it's much more elegant than what I was proposing :)

At first, I didn't really see what the advantage would be of tracking Learners explicitly like this (apart from the fact that it made me happy, intuitively), as opposed to just the retain flags I proposed, but it actually makes certain changes possible that didn't use to be possible. Consider the example I gave above:

(with retain = true) : RemoveVoters(n1), RemoveNode(n1)
If n1 is in the last subconfig of the current config (e.g. [{n3,n1}, {n2,n1}]), then we first have to transition to a joint config, whereas the node removal will happen immediately. We get an Error because the resulting config does not satisfy ensure_voter_nodes.

Explicitly representing the learners in the above example, we can now have the transition: [({n3,n1}, {n3,n1}), ({n2, n1}, {n2, n1})] -> [({n2, n1}, {n2, n1}), ({n2}, {n2})] - > [({n2}, {n2})]. In other words, because we track more state, we are not forced to remove learners immediately, and don't get counterintuitive errors when trying to remove voters and learners that overlap simultaneously.
On the other hand, when we have RemoveNode(n1) for an already-removed voter, this step can happen immediately, since there is no need to go through a joint config. E.g. [({n2}, {n2, n1})] - > [({n2}, {n2})] happens immediately.
In other words, when we have changes to only nodes and/or learners, we can still do changes immediately. It's only when we change voters as well that (I think) we should always use a joint config, as my feeling is that this captures the semantics of batching better.

I would consider:

  • renaming AddNodes and RemoveNodes to AddLearners and RemoveLearners in that case.
  • (not sure about this one) the semantics of SetNodes anReplaceAllNodes are now somewhat strange again. Imagine you call RemoveVoter({n1}) , ReplaceAllNodes({n1: <ip>}) when starting from ({n1,n2}, {n1,n2}). This call will fail because not all voters are replaced by ReplaceAllNodes, which is counterintuitive to me in the same way that the previously failing example was. The problem is again that because we only have a single view on nodes, the changes need to be processed immediately (i.e. we need to replace the node for {n2}, whereas because we have multiple views on voters/learners, we can slow those changes down arbitrarily. An alternative is checking whether ReplaceAllNodes replaces all nodes in the final config only, and keeping the other nodes around until their voters/learners are removed? Not sure how useful this also when compared to SetNodes

@drmingdrmer drmingdrmer force-pushed the 195-change-membership-batch branch from 1e5aee2 to f5b0161 Compare July 23, 2025 09:44
@drmingdrmer
Copy link
Member Author

You're right. The new structure provides much more flexible and complete state representation for membership. The API changes aren't a concern - we can deprecate the old variants and introduce new ones that better reflect the purpose of each operation.

However, there's one important consideration: modifying the internal structure of Membership is a breaking change. I'll need to implement this in a backward-compatible way to avoid breaking existing user applications.

Regarding the membership structure modifications, I'd appreciate your help reviewing this change to ensure it improves the membership change API capabilities. I plan to implement the internal modifications to use the new membership layout in a separate pull request.

@drmingdrmer drmingdrmer marked this pull request as ready for review July 23, 2025 13:22
@tvsfx
Copy link
Contributor

tvsfx commented Jul 23, 2025

However, there's one important consideration: modifying the internal structure of Membership is a breaking change. I'll need to implement this in a backward-compatible way to avoid breaking existing user applications.

What is your policy on these types of breaking changes? Are they to be avoided at all costs?
Deprecating and eventually deleting an API endpoint is also a breaking change, but do you allow this because you spread it over multiple major versions?
So in that case you can update your cluster one node at a time, because the updated nodes can use the old or new endpoints and the non-updated ones use the old endpoints? And then old or new -> new on the next major version?
Still, from the point of view of user applications, this requires changes when the old endpoints are dropped? Is this ok because you can update Raft clients separately without downtime?
Another concern with changes to the representation of Membership is that with such a rolling update, different nodes have different representations at the same time.

Regarding the membership structure modifications, I'd appreciate your help reviewing this change to ensure it improves the membership change API capabilities. I plan to implement the internal modifications to use the new membership layout in a separate pull request.

Sure, I'll review your PRs :) Do you want me to review this PR in its current state, or are there changes you still want to make following our discussion above?

@drmingdrmer
Copy link
Member Author

What is your policy on these types of breaking changes? Are they to be avoided at all costs? Deprecating and eventually deleting an API endpoint is also a breaking change, but do you allow this because you spread it over multiple major versions?

There are two kinds of breaking changes: API changes and storage data format changes.

API changes are easily handled. We can add deprecation hints and introduce new APIs, informing users to switch to the new ones. All warnings will be shown during compilation time, so users need to update their code to upgrade to a new version of OpenRaft.

Data format changes are more complicated, such as the change to Membership, which is already #[derive(serde:*)]. because there's no transparent way for new data structures to deserialize bytes that were serialized by old data structures. In this case, we have three options:

  • provide an adapter layer and document it for users;
  • make the new data structure capable of loading data serialized by old versions;
  • or abstract the data structure into a trait so OpenRaft doesn't need to handle compatibility issues and user applications can choose to switch to the new data structure or keep using the old one.

I'm going to use the third approach. First, I'll make membership a trait and add another associated type type Membership: RaftMembership to RaftTypeConfig, setting the default value to the current membership implementation.

Then I'll add a new implementation of the RaftMembership trait that follows the new design we discussed, which provides the ability to handle accurate membership config changes as you expected.

So in that case you can update your cluster one node at a time, because the updated nodes can use the old or new endpoints and the non-updated ones use the old endpoints? And then old or new -> new on the next major version? Still, from the point of view of user applications, this requires changes when the old endpoints are dropped? Is this ok because you can update Raft clients separately without downtime? Another concern with changes to the representation of Membership is that with such a rolling update, different nodes have different representations at the same time.

In this scenario, if the user wants to upgrade the Membership to a new format, they need to define a new type that can load data generated by both the current Membership and the new user-defined EnhancedMembership. When using serde, the simplest approach is using an enum with untagged variant:

#[derive(Serialize, Deserialize)]
#[serde(untagged)]
enum CompatMembership {
  A(openraft::impl::Membership),
  B(EnhancedMembership),
}

The upgrading process needs to be phased. In the first phase, upgrade nodes one by one to operate with code that can load both old and new versions of serialized data.

In the second phase, upgrade the nodes one by one to a version that starts using EnhancedMembership instead of the old Membership.

Regarding the membership structure modifications, I'd appreciate your help reviewing this change to ensure it improves the membership change API capabilities. I plan to implement the internal modifications to use the new membership layout in a separate pull request.

Sure, I'll review your PRs :) Do you want me to review this PR in its current state, or are there changes you still want to make following our discussion above?

This PR won't add other changes. I'll put the new format of Membership in a new issue.

@drmingdrmer
Copy link
Member Author

@tvsfx
I moved the planned change to the following issue And this PR will be merged and closed. Let's continue the discussion in:

This commit improves the handling of membership configuration changes to
ensure that the system reliably transitions to a uniform configuration
in the second step of a joint membership change.

### Problem:

During a membership config change, there are two steps:

1. Transition to a joint configuration containing both `Cold` (current
   config) and `Cnew` (new config).
2. Transition to a uniform configuration containing only `Cnew`.

Previously, the second step attempted to apply the same change on the
current membership state. If multiple membership change requests were
processed in parallel, this approach could result in the system being
left in a joint configuration. For example:

- Initial config: `{a, b, c}`.
- Task 1: `AddVoterIds(x)`. After the first step:
  `[{a, b, c}, {a, b, c, x}]`.

- Task 2: `RemoveVoters(x)`. After the first step:
  `[{a, b, c, x}, {a, b, c}]` (applied on the last config `{a, b, c, x}`).

- Task 1 proceeds to the second step, re-applies `AddVoterIds(x)`, and
  the result is `[{a, b, c}, {a, b, c, x}]`.

- The system remains in a joint configuration, which is unintuitive and
  contradicts standard Raft expectations.

### Solution:

The second step now applies an **empty change request**
(`AddVoterIds({})`) to the last configuration in the current joint
config. This ensures that the system always transitions to a uniform
configuration in the second step.

### Impact:

- No behavior changes occur if only one membership change request is in
  progress.

- If multiple requests are processed concurrently, the application must
  still verify the result, and the new behavior ensures the system
  transitions to a uniform state.

This fix prevents the system from being left in an unintended joint
configuration, improving consistency and adherence to Raft principles.

Thanks to @tvsfx for providing feedback on this issue and offering a
detailed explanation of the solution.
@drmingdrmer drmingdrmer force-pushed the 195-change-membership-batch branch from f5b0161 to 37d6943 Compare July 24, 2025 06:44
@drmingdrmer drmingdrmer merged commit 37d6943 into databendlabs:main Jul 24, 2025
34 checks passed
@drmingdrmer drmingdrmer deleted the 195-change-membership-batch branch July 24, 2025 13:45
@tvsfx
Copy link
Contributor

tvsfx commented Jul 24, 2025

FWIW, this PR looked good to me :) Apart from the retain shenanigans, but those should be fixed in #1395

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants