Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# [Transaction Layer](https://www.cockroachlabs.com/docs/v25.4/architecture/transaction-layer)

## Writes
* Write intents are replicated via Raft
* They are a combination of a provisional value and an exclusive lock
* Unreplicated locks, associated with a transaction's writes, represent a provisional, uncommitted state
* Locks are stored an in-memory, per-node lock table and managed by concurrency control manager
* Transaction record is stored in the range where the first write occurs, including the transaction's current state (pending, staging, committed, aborted)
* As write intents are created, there are checks for a newer committed value
* If a newer committed value exists, the transaction may be restarted
* If existing write intents or locks exist on the same keys, the transaction is resolved as a transaction conflict

## Reads
* If a locking read encounters any existing locks, the operation is resolved as a transaction conflict
* Strongly-consistent reads by default
* Reads go through the leaseholder and see all writes performed by writes that committed before the reading transaction (serializable isolation) or statement (read committed isolation)
* Stale reads are faster and do not need to go through the leaseholder
* They read from a local replica at a timestamp that is never higher than the closed timestamp
* Read committed isolation, by default, create exclusive locks on a row
* Only one transaction can hold an exclusive lock on a row at a time
* Only the transaction holding the exclusive lock can write to the row
* Exclusive locks are replicated via Raft

## Commits
* Checks the executing transaction to see if it has been aborted
* Otherwise, the transaction record state is set to staging
* Check the transaction's pending write intents to see if they have been successfully replicated across the cluster
* When the transaction passes these checks, CockroachDB responds with the transaction's success to the client. The transaction has been committed.

## Cleanup
* The coordinating node moves the state of the transaction record from staging to committed
* Resolves the transaction's write intents to MVCC values by removing the element that points it to the transaction record
* Deletes the write intents

## Time and hybrid logical clocks
* Hybrid-logical clocks are composed of a physical component (a local wall time) and a logical component (used to distinguish between events with the same physical component
* HLC time is >= wall time
* Gateway node picks a timestamp for the transaction using HLC time
* This timestamp is used to track versions of values as well as provide transactional isolation guarantees
* When nodes send requests to other nodes, they include the timestamp generated by their local HLC
* When nodes receive requests, they inform their local HLC of the timestamp supplied with the event of the sender
* This lets the node serve reads for data it stores by ensuring the transaction reading the data is at an HLC time greater than the MVCC value it is reading

### Max clock offset enforcement
* Since there is moderate wall clock synchronization to preserve data consistency, if a node detes that its clock is out of sync with at least half of the other nodes in the cluster by 80% of the maximum offset allowed, it crashes immediately
* It is therefore important to prevent clocks from drifting too far by running NTP / other clock synchronization software