Description
The goal is that we can perform an operations-based recovery for all "reasonable" shard copies C:
- There is a peer recovery retention lease L corresponding with C.
- Every in-sync shard copy has a complete history of operations above the retained seqno of L.
- The retained seqno r of L is no greater than the local checkpoint of the last safe commit of C.
Reasonable shard copies comprise all the copies that are currently being tracked, as well as all the copies that "might be a recovery target": if the shard is not fully allocated then any copy that has been tracked in the last index.soft_deletes.retention_lease.period
(i.e. 12h
) might reasonably be a recovery target.
We also require that history is eventually released: in a stable cluster, for every operation with seqno s below the MSN of a replication group, eventually there are no leases that retain s:
- Every active shard copy eventually advances its LCPoSC past s.
- Every lease for an active shard copy eventually also passes s.
- Every inactive shard copy eventually either becomes active or else its lease expires.
Concretely, this should ensure that operations-based recoveries are possible in the following cases (subject to the copy being allocated back to the same node):
- a shard copy C is offline for a short period (<
12h
)- even if the primary is relocated or a replica is promoted to primary while C is offline.
- even if C was part of a closed/frozen/readonly index that was opened while C was offline
- but not if the index was closed/frozen again before C comes back
- TBD: maybe we are ok with this being a file-based recovery?
- a full-cluster restart
This breaks into a few conceptually-separate pieces:
-
Adjust peer recovery to start by recovering the target using the local translog as far as (the local copy of) the global checkpoint (Use global checkpoint as starting seq in ops-based recovery #43463)
- this means we can discard history that is behind every known global checkpoint
- replicas already share with the primary the necessary information about the movement of the global checkpoint
-
Create peer recovery retention leases to retain the history needed by each shard (Create peer-recovery retention leases #43190, Add missing GCP update #43632)
- For primary, on primary activation
- For replicas, during peer recovery
- retention leases don't guarantee that history is retained by every copy
-
Lazily create retention leases for tracked shards that don't exist because the primary was relocated from an older version. (Create missing PRRLs after primary activation #44009)
-
Advance existing peer recovery retention leases according to the history information exposed by each shard copy. (Advance PRRLs to match GCP of tracked shards #43751, Prevent invalid renewals of PRRLs #43898)
-
Make peer recovery work together with retention leases (Recover peers using history from Lucene #44853)
- Use the existence of a retention lease as the deciding factor for performing an ops-based recovery
- Reinstate recovery from history stored in Lucene if soft deletes are enabled
-
Tests should randomly set the lease expiry time very low sometimes to ensure that everything still works if leases are expiring. (Randomise retention lease expiry time #44067)
-
Discard translog more enthusiastically now that we don't need to retain it any more (Ignore translog retention policy if soft-deletes enabled #45473)
-
Expire leases based on more than time - if a file-based recovery would clearly be cheaper than an ops-based recovery then we may as well throw a lease away (Only retain reasonable history for peer recoveries #45208)
Followup work, out-of-scope for the feature branches.
-
Adjust translog retention
- Should we retain translog generations according to retention leases too?
- Trim translog files eagerly during the "verify-before-close" step for closed/frozen indices (Trim translog for closed indices #43156)
- Properly support peer-recovery retention leases on indices that are not using soft deletes too.
-
Make the
ReplicaShardAllocator
sensitive to leases, so that it prefers to select a location for each replica that only needs an ops-based recovery. (relates Replica allocation consider no-op #42518) -
Seqno-based synced flush: if a copy has LCP == MSN then it needs no recovery. (relates Replica allocation consider no-op #42518)
BWC issues: during a rolling upgrade, we may migrate a primary onto a new node without first establishing the appropriate leases. They can't be established before or during this promotion, so we must weaken the assertions so that they only apply to sufficiently-newly-created indices. We will still establish leases properly during peer recovery, and can establish them lazily on older indices, but they may not retain all the right history when first created.
Closed replicated indices issues: a closed index permits no replicated actions, but should not need any history to be retained. We cannot replay history into a closed index, so all recoveries must be file-based, so there's no real need for leases; moreover any existing PRRLs will not be retaining any history. We cannot assert that all the copies of a replicated closed index have a corresponding lease without performing replicated write actions to create such leases as we create new replicas, and nor can we assert that there are no leases on a replicated closed index since again this would require replicated write actions. We elect to ignore PRRLs on closed indices: they might exist, but they might not, and either way is fine.