-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
area/docdbYugabyteDB core featuresYugabyteDB core featureskind/new-featureThis is a request for a completely new featureThis is a request for a completely new featurepriority/mediumMedium priority issueMedium priority issuexClusterLabel for xCluster related issues/improvementsLabel for xCluster related issues/improvements
Description
Jira Link: [DB-310](https://yugabyte.atlassian.net/browse/DB-310)
Product-Level Description and Requirements for the First Iteration
The Use Cases Supported
For the use case where only one cluster is taking writes at a given time, we will support the following transaction semantics on the target cluster:
- Multi-shard atomic reads: A transaction spanning tablets A and B will either fully be readable or not at all.
- Ordered multi-shard reads for a single-session: If timestamp(txn1) < timestamp(txn2) on the source universe, then the user will never be able to read txn2 before txn1 in a single session.
- Consistent cutover on disaster recovery. When the target is promoted, it will have a clean cut of the system and preserve atomic and ordered reads.
For the above, multi-shard can refer to single-table, multi-table, or table-index. So this will encapsulate guarantees for transactional updates to a main table and index, as well as two separate transactions to a parent then child foreign key table.
Product-level Changes and Semantics
- Reads on the target will occur at a timestamp bounded by the laggiest tablet in the system. If the lag is 1s, the read timestamp for any read will be at least 1s in the past.
- This read time will be applied on all tables on the target, whether or not they are under replication.
- Each server will periodically get an updated version of the read time from the master leader. So if the master process is down or there is a master - tserver partition, the read time will not advance and reads will become increasingly stale.
- In the case of DR cutover, all tablets will cutover to the timestamp of the laggiest tablet. What this means is, if the laggiest tablet is behind by 1s, every tablet will lose 1s worth of data in the system.
- A toggle will be introduced between reading at a consistent but past timestamp or reading at the current time. The tradeoff is the atomic and ordered guarantees of an older timestamp vs lower latency and less data loss in case of a cutover of a real-time read.
- When target-side read staleness exceeds the history retention cutoff (currently 15m), the target will reject reads to avoid serving incorrect data.
- Transaction status WAL retention will need to be increased on the source, which will which will increase space utilization.
- The user will need to do a full backup restore of all tables in case the transaction status replication stream can’t catch up. This is because the transaction status table has no B/R mechanism and therefore the only way to resolve transactions on the target for user tables is to do a B/R for those tables.
Situations not supported
- Active-acive use cases (both sides taking independent writes) will not preserve transactional guarantees. For this case, we will continue to use last-writer wins semantics.
- Ordered multi-shard reads across sessions: If timestamp(txn1) < timestamp(txn2) on the source universe, then the user may still be to read txn2 before txn1 across multiple sessions.
- 1 : N, N : 1, or triangle formations
- Local transaction status tables on the source cluster
Subtasks
- [xCluster] Replicate txn status table #11199
- [xCluster] Only apply committed records to regular rocksdb if the tablet is caught up to commit_ht #11200
- [xCluster] Send safe time of tablet to target as part of GetChanges response #11201
- [xCluster] Keep min_safe_time calculation periodically refreshed on the master #11202
Open Problems
- What do we do when we need to bootstrap the txn status table? Since there is no rocksdb for this table and all committed txns are eventually gc'ed, we may need to bootstrap the entire cluster over again to resolve lost transactions.
- How do we ensure that two versions of ht_read on two different servers are bounded? On a single cluster, we use clock skew to bound the read time on any two servers. However, ht_read doesn't have any bound between two consecutive versions.
- Do we need to change history retention at all? Given we're reading at a timestamp in the past, if ht_read falls below history retention, we may no longer be able to serve reads.
Metadata
Metadata
Assignees
Labels
area/docdbYugabyteDB core featuresYugabyteDB core featureskind/new-featureThis is a request for a completely new featureThis is a request for a completely new featurepriority/mediumMedium priority issueMedium priority issuexClusterLabel for xCluster related issues/improvementsLabel for xCluster related issues/improvements
Type
Projects
Status
Done