This document describes the current implementation in this repository. It is focused on behavior that affects correctness, crash safety, and operational tuning.
- Embedded key-value engine with predictable point-read latency.
- High write throughput via append-only WAL + asynchronous checkpoint publish.
- Snapshot isolation (SI) with MVCC visibility.
- Key/value separation: large values stored in blob files.
- Crash-safe metadata updates and idempotent startup cleanup.
src/store- Owns startup/shutdown orchestration, recovery, GC thread, and public
MaceAPIs.
- Owns startup/shutdown orchestration, recovery, GC thread, and public
src/meta- Owns persistent metadata (
btree_store) and in-memory stat/index caches.
- Owns persistent metadata (
src/map- Bucket runtime (dirty-page generations, checkpoint build, loader/cache, flow control).
src/index- Bw-Tree style index, delta chains, split/merge/compact paths, iterator paths.
src/cc- Concurrency control, writer groups, commit tree, WAL encode/decode, rollback.
src/utils/observe.rs- Fixed-cardinality observability model.
Manifest persists metadata in BTree buckets.
Main buckets:
numerics- global counters and orphan markers (
odf_*,obf_*).
- global counters and orphan markers (
bucket_metas- bucket name ->
BucketMeta { bucket_id }.
- bucket name ->
bucket_frontierBucketDurableFrontierper bucket (per-writer-group durable LSN frontier).
pending_del- buckets pending physical table cleanup.
page_table_{bucket_id}pid -> logical_addrmap.
data_interval_{bucket_id}/blob_interval_{bucket_id}addr-range -> file_idmapping.
data_stat/blob_stat- per-file stats + bitmap references used by GC.
obsolete_data/obsolete_blob- files pending physical delete.
Addressing model:
- each bucket has an independent logical address space (
BucketState::next_addr). - page table maps
pid -> swip(taggedmeans on-disk logical address). - interval tables map logical address ranges to data/blob file IDs.
- relocation tables in file footers map logical address to file offset.
Creation (Manifest::create_bucket):
- Take
structural_lock. - Reject duplicates and enforce
MAX_BUCKETSvianr_buckets. - Allocate
bucket_id, create runtime context/state lazily, initialize frontier. - Commit one metadata txn (
numerics,bucket_frontier,bucket_metas).
Loading (load_bucket_context):
- lazy-load runtime context when first accessed.
- recover page table + data/blob intervals from metadata.
- cache context in
BucketMgr.
Unload (unload_bucket):
- runtime-only operation.
- unpublish from in-memory maps, checkpoint-and-reclaim loaded context.
- persisted bucket metadata/data are kept.
Delete (delete_bucket):
- logical delete txn removes
bucket_metasentry, insertspending_del, records obsolete files. - runtime context/state are removed first.
- GC later performs physical cleanup (
pending_del) and decrementsnr_buckets.
BucketState (in-memory only):
txn_ref,is_deleting,is_drop.- vacuum coordination:
vacuum_inflight,vacuum_epoch, wait/notify primitives.
Pool uses generation slots instead of arena/chunk allocators.
- hot generation: writable pages/retired-chain/root marks.
- sealed generation: previous hot generation retained while checkpoint is in flight.
- epoch gate (
epochs.gate) ensures writers do not straddle a checkpoint cut.
Write-side flow:
- writer captures
WriteEpoch(capture_epoch) and increments inflight counter. - allocation uses
BoxRef::allocand inserts into hot page map. - publish path marks dirty roots (
pid -> latest addr) and unmap marks. merge path requirement:replace(cursor)andmark_unmap(child_pid)must be done in the samePublish/WriteEpochcommit to avoid one addr being observed as both live root and junk. - page replacement/evict paths move retired logical addresses into lineage chains (
RetiredChain).
Checkpoint trigger conditions:
- hot bytes >=
checkpoint_size, or - total dirty bytes (hot + sealed) >=
pool_capacity, or - proactive nudge from evictor (
checkpoint_nudge_ms) when dirty data remains stale.
Checkpoint cut (Pool::checkpoint):
- atomically swap hot generation to new empty maps.
- publish old generation as sealed.
- rotate inflight/root slots in one critical section.
- create
CheckpointTaskand enqueue to checkpointer thread.
CheckpointTask::snapshot:
- waits old-generation writers to leave (
EpochInflight::wait_zero). - walks dirty roots and reachable chains (link/sibling/remote hints).
- carries reachable
addr > snap_addrmutations back to new hot generation only when they are still present in dirty generations; missing addresses are treated as already persisted/reclaimed. - computes per-group checkpoint frontier delta.
- emits data/blob junk candidates for stat apply.
CheckpointTask::finish:
- clears sealed slots.
- requires sealed generations to be uniquely owned (
Arc::try_unwrap). - recycles maps for reuse.
- signals flow-controller completion.
This is the regression boundary introduced by prior reachable-junk lifecycle changes.
junkproduced during publish is not one class:- structural junk: old base/delta retired by replace/evict itself.
- compaction junk: intermediate sibling/remote addresses collected during merge/compact.
- only structural junk may be retired from hot pages at publish time (with EBR deferral).
- compaction junk must stay discoverable in dirty generations until checkpoint durability closure.
- any address still reachable from live graph traversal (
link/ sibling hints / remote hints) must not enter the state "not in dirty memory and not yet in interval metadata".
If these constraints are violated:
- read path can hit interval lookup hole (
ivl_map.find(addr)miss /NotFound) for still-reachable old addresses. - long-lived views can observe false missing keys during checkpoint churn.
- durability/visibility closure is broken: runtime may treat an address as persistent while metadata has no mapping for it yet.
Background issue solved by the frontier design:
- a compacted/materialized durable page can absorb updates from multiple writer groups.
- each page header still carries only one
(group, lsn)pair. - if recovery correctness relies on per-group WAL checkpoint only, multi-group absorbed history can be replayed incorrectly.
Correctness model:
- the durable boundary is
per bucket, per writer group, not a single global/group scalar. - this boundary is represented by
BucketDurableFrontierin manifest (bucket_frontierbucket). - frontier metadata is committed in the same manifest txn as map/stat publish; page/data/blob formats do not need extra frontier fields.
Runtime carrier and propagation:
- hot paths keep building lineage frontier using
SparseFrontiercarried inRetiredChain. - replace/evict transfer paths fold source lineage by pointwise-max per group.
- checkpoint snapshot applies chained frontier to group checkpoint positions (
frontier.apply_to). - when no chain frontier is present for an item, snapshot falls back to page-header
(group, lsn)contribution.
Durable publish boundary:
- checkpoint builds a bucket-level frontier delta for this publish.
StoreFlushObserver::publishmerges and persists it viaManifest::merge_bucket_frontier.- map + frontier are durable together, so crash after manifest commit still has a consistent durable boundary.
Checkpointer builds flushed files from snapshot pages.
Data/blob file layout:
- payload frames
- interval table
- relocation table
- footer (
DataFooter/BlobFooter)
Protocol for each new file:
- stage orphan marker in
numerics(odf_*/obf_*). - stage in-memory unsynced-file marker (
data_unsync/blob_unsync). - write file bytes.
- publish metadata and clear orphan marker in the same manifest txn.
StoreFlushObserver::publish flow:
- sync built data/blob writers (
sync_dataorsyncbased onsync_on_write). - merge bucket frontier with snapshot frontier delta.
- begin manifest txn:
- apply junk to data/blob stats,
- clear orphan markers for published files,
- record intervals/stats/map/frontier/numerics in one atomic commit boundary.
- commit txn.
- clear in-memory unsynced sets.
- update WAL checkpoint records per writer group from global frontier lower bound (scan/recycle hint).
Global frontier lower bound (Manifest::global_frontier_lower_bound):
- considers all active buckets.
- buckets with pending flush can pin frontier via durable bucket frontier.
- clean/read-only buckets fallback to current group checkpoint to avoid stale pinning.
FlowController exists for every bucket; enforcement is gated by Options::enable_backpressure.
Admission model:
- foreground write acquires
ForegroundWritePermitbeforetree.update. - permit reserves bytes against dirty-memory budget.
- if projected dirty bytes exceed admission limit, writer waits on condvar.
- checkpoint progress wakes waiting writers in byte-based quanta.
- permit drop releases reservation.
Throughput/debt model:
- checkpoint enqueue records estimated bytes.
- actual built bytes reconcile debt and feed IO EWMA.
- completion updates end-to-end EWMA.
- idle windows reset stale EWMA samples.
Current flow metrics:
CounterMetric::FlowFgAdmissionWaitHistogramMetric::FlowFgAdmissionWaitMicros
TxnKV:
- allocated to writer group via inflight-aware scheduling.
- records
Beginat start. - first mutation logs
WalUpdateand links viaprev_lsnchain. - commit path:
record_commit(start_ts)log.sync(false)- append
(start_ts, commit_ts)to commit tree - update watermark (
collect_wmk)
Drop of uncommitted TxnKV:
- no writes: log
Abort. - with writes: rollback through WAL chain (
WalReader::rollback), logging CLR records and inverse versions.
TxnView:
- allocates from
CCPool, holds snapshotstart_ts, no WAL writes.
Visibility (ConcurrencyControl::is_visible_to):
- own uncommitted version visible only in same writer group.
- future txid invisible.
safe_txidfast path from global watermark.- fallback to per-group
CommitTree::lcb(start_ts).
WAL payload semantics:
Insert: new value image.Update: old value + new value.Delete: old value image.Clr: tombstone/value + undo pointer (undo_id,undo_off).
Startup sequence:
ManifestBuilder::loadloads metadata caches and counters.clean_orphansscans orphan markers (odf_*,obf_*), removes stray files, deletes markers.- WAL phase1 scans wal files per group and finds latest valid checkpoint as conservative scan start.
- context is created with per-group bootstrap info.
- phase2 runs
analyze -> redo -> undo.
Analyze stage:
- reads WAL from checkpoint position.
- truncates tail on corruption if
truncate_corrupted_wal=true. - checks bucket durable frontier first (
durable_frontier_lsn):- if
record_lsn <= frontier[bucket][group], treat as already durable and skip dirty-table enrollment. - only records beyond frontier continue to redo/undo candidate logic.
- if
Boundary split (important semantic change):
WalCheckpoint[group]is a scan-start optimization and WAL recycle aid.BucketDurableFrontier[bucket][group]is the correctness gate for "already durable or not".- therefore "one WAL record before checkpoint and another after checkpoint" can still exist, but only affects scan cost, not replay correctness.
Redo stage:
- reapplies remaining update/insert/delete/clr records in version order.
Undo stage:
- rolls back incomplete txns using same WAL rollback path used at runtime.
Post recovery:
- advances
oracle/wmk_oldest. - evicts loaded recovery bucket contexts.
GC thread (gc_timeout periodic + manual trigger):
process_data(victim selection + rewrite)process_blob(victim selection + rewrite)process_pending_buckets(physical cleanup forpending_del)scavenge(bounded in-memory page compaction)delete_files(idempotent obsolete file unlink + metadata ack)
Rewrite crash safety:
- rewrite stages orphan marker before file build.
- metadata txn publishes new intervals/stats + delete intents + clears marker.
- old files are then physically deleted through obsolete-file pipeline.
Scavenge:
- scans loaded bucket page tables in bounded batches.
- skips deleting/dropped/vacuuming buckets.
- enforces per-tick compaction quota to cap IO impact.
Manual vacuum APIs:
vacuum_bucket(name)- serialized per bucket using
vacuum_inflight/vacuum_epoch. - returns
VacuumStats { scanned, compacted }.
- serialized per bucket using
vacuum_meta()- compacts manifest btree (
btree_store::compact).
- compacts manifest btree (
Injection point:
Options::observer: Arc<dyn Observer>(NoopObserverby default).
API shape:
counter(CounterMetric, delta)gauge(GaugeMetric, value)histogram(HistogramMetric, value)event(ObserveEvent)
Built-in helper:
InMemoryObserverwith fixed arrays and bounded FIFO events.
Latency sampling:
LATENCY_SAMPLE_SHIFT = 6(1/64) for hot metrics such as:- txn commit/rollback latency,
- tree link lock-hold latency,
- WAL sync latency.
Low-frequency recovery/GC metrics are reported without sampling.
- Data/blob files referenced by metadata must be written before metadata commit.
- Orphan marker stage/clear is the only startup orphan cleanup source of truth.
- Checkpoint epoch cut must atomically rotate hot/sealed generation handles.
- Writers must not straddle checkpoint cut (
epochs.gate+ inflight wait-zero). - Sealed generations must be uniquely owned when checkpoint finishes.
- Reachable addresses (including sibling/remote-referenced old pages) must stay readable from memory until they are either durably published or proven unreachable by lifecycle closure.
- Merge publish must apply
replaceand childmark_unmapin the same epoch commit, so checkpoint snapshot never classifies one addr as both dirty root and newly retired junk. - Bucket durable frontier must monotonically advance per group.
- Map publish and bucket frontier publish must be in the same manifest txn.
- Recovery correctness must be gated by bucket frontier, not by group checkpoint alone.
- Recovery analyze must ignore updates already <= durable frontier.
- Bucket delete is two-phase (
pending_del), andnr_bucketsdecrements only after physical cleanup. - WAL recycling must stay below min(active-txn boundary, last checkpoint boundary).
- Foreground admission wait must happen before entering
tree.updatecritical mutation path.
Invariant-break consequences to watch for:
NotFound/interval-miss on addresses that should still be reachable.- snapshot/read visibility regression under concurrent checkpoint (false key loss).
- frontier/map correctness no longer matching actual durability boundary.
./scripts/prod_test.sh fast|stress|chaos|all./scripts/perf_gate.sh snapshot|compare- tuning defaults in
scripts/perf_thresholds.env - CI wiring in
.github/workflows/ci.yml