Retain history for peer recovery using leases

The goal is that we can perform an operations-based recovery for all "reasonable" shard copies _C_:

- There is a peer recovery retention lease _L_ corresponding with _C_.
- Every in-sync shard copy has a complete history of operations above the retained seqno of _L_.
- The retained seqno _r_ of _L_ is no greater than the local checkpoint of the last safe commit of _C_.

Reasonable shard copies comprise all the copies that are currently being tracked, as well as all the copies that "might be a recovery target": if the shard is not fully allocated then any copy that has been tracked in the last `index.soft_deletes.retention_lease.period` (i.e. `12h`) might reasonably be a recovery target.

We also require that history is eventually released: in a stable cluster, for every operation with seqno _s_ below the MSN of a replication group, eventually there are no leases that retain _s_:

- Every active shard copy eventually advances its LCPoSC past _s_.
- Every lease for an active shard copy eventually also passes _s_.
- Every inactive shard copy eventually either becomes active or else its lease expires.

Concretely, this should ensure that operations-based recoveries are possible in the following cases (subject to the copy being allocated back to the same node):

- a shard copy _C_ is offline for a short period (<`12h`)
  - even if the primary is relocated or a replica is promoted to primary while _C_ is offline.
  - even if _C_ was part of a closed/frozen/readonly index that was opened while _C_ was offline
    - but not if the index was closed/frozen again before _C_ comes back
    - TBD: maybe we are ok with this being a file-based recovery?
- a full-cluster restart

---

This breaks into a few conceptually-separate pieces:

- [x] Adjust peer recovery to start by recovering the target using the local translog as far as (the local copy of) the global checkpoint (#43463)
  - this means we can discard history that is behind every known global checkpoint
  - replicas already share with the primary the necessary information about the movement of the global checkpoint

- [x] Create peer recovery retention leases to retain the history needed by each shard (#43190, #43632)
  - For primary, on primary activation
  - For replicas, during peer recovery
  - retention leases don't _guarantee_ that history is retained by every copy

- [x] Lazily create retention leases for tracked shards that don't exist because the primary was relocated from an older version. (#44009)

- [x] Advance existing peer recovery retention leases according to the history information exposed by each shard copy. (#43751, #43898)

- [x] Make peer recovery work together with retention leases (#44853)
  - Use the existence of a retention lease as the deciding factor for performing an ops-based recovery
  - Reinstate recovery from history stored in Lucene if soft deletes are enabled

- [x] Tests should randomly set the lease expiry time very low sometimes to ensure that everything still works if leases are expiring. (#44067)

- [x] Discard translog more enthusiastically now that we don't need to retain it any more (#45473)
- [x] Expire leases based on more than time - if a file-based recovery would clearly be cheaper than an ops-based recovery then we may as well throw a lease away (#45208)

Followup work, out-of-scope for the feature branches.

- Adjust translog retention
  - Should we retain translog generations according to retention leases too?
  - Trim translog files eagerly during the "verify-before-close" step for closed/frozen indices (#43156)
  - Properly support peer-recovery retention leases on indices that are not using soft deletes too.

- [ ] Make the `ReplicaShardAllocator` sensitive to leases, so that it prefers to select a location for each replica that only needs an ops-based recovery.  (relates #42518)

- [x] Seqno-based synced flush: if a copy has LCP == MSN then it needs no recovery. (relates #42518)


---

BWC issues: during a rolling upgrade, we may migrate a primary onto a new node without first establishing the appropriate leases. They can't be established before or during this promotion, so we must weaken the assertions so that they only apply to sufficiently-newly-created indices. We will still establish leases properly during peer recovery, and can establish them lazily on older indices, but they may not retain all the right history when first created.

Closed replicated indices issues: a closed index permits no replicated actions, but should not need any history to be retained. We cannot replay history into a closed index, so all recoveries must be file-based, so there's no real need for leases; moreover any existing PRRLs will not be retaining any history. We cannot assert that all the copies of a replicated closed index have a corresponding lease without performing replicated write actions to create such leases as we create new replicas, and nor can we assert that there are no leases on a replicated closed index since again this would  require replicated write actions. We elect to ignore PRRLs on closed indices: they might exist, but they might not, and either way is fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retain history for peer recovery using leases #41536

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retain history for peer recovery using leases #41536

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions