[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478
Description
Is your feature request related to a problem? Please describe.
Once we enable segment based replication for an index, we wouldn't be indexing any operation on the replica(just writing to translog for durability). Just by virtue of having a successful write to translog we would assume that the replica is caught up. However, since no indexing operation is applied on replicas except the segments on checkpoint refresh, it's possible that the replica may not have successfully processed the checkpoint for a while due to shard overload/slow I/O would still be serving reads.
Currently there are no additional mechanisms(once translog has been written on the replica) to apply back pressure on primary if the replica is slow in processing checkpoints which would be aggravated with remote translog since there wouldn't be any I/O on replica at all since remote translog writes on primary will handle durability altogether.
Describe the solution you'd like
Need to support mechanisms to apply back pressure and as a last resort fail the replica copy if its unable to process any further checkpoint beyond a threshold
Describe alternatives you've considered
Additional context
Add any other context or screenshots about the feature request here.
Metadata
Assignees
Type
Projects
Status
Done