Closed
Description
CCR Bootstrap from Remote
Pre-feature freeze
- Create a
CcrRepository
for each remote cluster- Do we want to store in cluster state? An alternative option would be to introduce “internal repositories” that are not stored in the cluster state and are local to a node. Register CcrRepository based on settings update #36086 prototypes storing the repository in the cluster state.
- How do we handle name conflicts with installed user repositories?
- How do we ensure consistency across nodes if we go with the internal repository route?
- If we go with internal repository route, we do want to provide any introspection capabilities for end users?
- Initialize
CcrRepositorys
based on settings at node start. - Initialize empty shards through
Repository#restoreShard
. - Add restore source service for leader cluster to hold state about in-progess restores. (Add CcrRestoreSourceService to track sessions #36578)
- Initialize restore session on leader when we start restoring a shard. (Add CcrRestoreSourceService to track sessions #36578)
- This restore session will need to have hooks to be cleaned up when the restore is complete.
- Implement local timeouts on leader to clean-up resources used by in-progess restores. (Add local session timeouts to leader node #37438)
- Handle the failure of the leader primary shard during a restore and clean-up resources on leader (Add CcrRestoreSourceService to track sessions #36578)
- Ensure
CcrRepository
restore works with security enabled (Allow system privilege to execute proxied actions #37508)- Currently the repository is performing requests under the system context. Any actions that are proxy actions will be rejected using this context.
- Route
PutFollowAction
through theRestoreService
using theCcrRepository
.- Use
CcrRepository
to init follower index #35719 prototypes this work
- Use
- Determine
PutFollowAction
semantics regards whether it should wait for restore to complete. (UseCcrRepository
to init follower index #35719 ) - Determine how we want to handle pause follow actions
- No action needed
- Allow multiple concurrent restores. (SNAPSHOTS: Allow Parallel Restore Operations #36397)
- Ensure that the mapping versions are kept in-sync with the leader during a restore. (Update index mappings when ccr restore complete #36879)
- The
ShardFollowTasksExecutor
andShardFollowNodeTask
have existing mechanisms for followers to update mapping versions as the leader mappings change.
- The
- Implement a transport mechanism to direct a request to a specific node on the remote cluster. (Send clear session as routable remote request #36805)
- This functionality is not currently exposed on the remote client.
- It is necessary for the request for a file chunk and to delete the session. We do not want to route the file chunks through an intermediate node if possible.
- Restore the shards by fetching files from the leader cluster (Implement ccr file restore #37130)
- Recycle byte arrays on file transfer (Implement ccr file restore #37130)
- Potentially add rate-limiting for the restore similar to what we do in the blob store repository.
- Follower side (Implement follower rate limiting for file restore #37449)
- Leader side (Implement leader rate limiting for file restore #37677)
- Add a test that a follower can be closed and
put
again if it falls behind (Add test forPutFollowAction
on a closed index #38236)
Post-feature freeze
- Set an increased master node timeout on the call to update mappings. (Set update mappings mater node timeout to 30 min #38439)
- Implement timeouts or retry policy on the follower
- Individual action timeouts (Add timeout for ccr recovery action #37840)
- Ensure that we have mixed version tests (Add rolling upgrade multi cluster test module #38277)
- Look into how we do BWC logic. Currently we only allow the restore if the follower master is the same version or higher as the highest versioned leader data node.
- Add test that we disallow
PutFollowAction
and throw exception if leader cluster is on higher version than follower (Add rolling upgrade multi cluster test module #38277) - Related issue Improve error messages in ccr during rolling upgrade #39230
- Add test that we disallow
- Document
ccr.indices.recovery.max_bytes_per_sec
(Add documentation on remote recovery #39483) - Document
ccr.indices.recovery.recovery_activity_timeout
(Add documentation on remote recovery #39483) - Document
ccr.indices.recovery.internal_action_timeout
(Add documentation on remote recovery #39483) - Document
ccr.indices.recovery.chunk_size
(Add documentation on remote recovery #39483) - Document
ccr.indices.recovery.max_concurrent_file_chunks
(Add documentation on remote recovery #39483) - Document behavior about an index falling behind (Expand
following
documentation in ccr overview #39936) - Considering splitting
overview
docs information into a mechanics of replication page. - Considering moving information about leader index falling behind into troubleshooting.
- Determine the ideal buffer size for file transfers.
- In the original PR, we are using 64KB. But his is something that probably should be benchmarked.
- Make Ccr recovery file chunk size configurable #38370 changes this to 1 MB and makes it configurable as we determine the best approach.
- Determine what types of connections we want to use for recovery
- 6 connections of type
TransportRequestOptions.Type.REG, TransportRequestOptions.Type.PING
. The recovery actions use theREG
channel type. However, we do support dedicatedRECOVERY
channel types. We could consider addingRECOVERY
channels to the remote cluster connection profile and use those.
- 6 connections of type