MACE design notes

This document describes the current implementation in this repository. It is focused on behavior that affects correctness, crash safety, and operational tuning.

High-level goals

Embedded key-value engine with predictable point-read latency.
High write throughput via append-only WAL + asynchronous checkpoint publish.
Snapshot isolation (SI) with MVCC visibility.
Key/value separation: large values stored in blob files.
Crash-safe metadata updates and idempotent startup cleanup.

Core architecture

src/store
- Owns startup/shutdown orchestration, recovery, GC thread, and public Mace APIs.
src/meta
- Owns persistent metadata (btree_store) and in-memory stat/index caches.
src/map
- Bucket runtime (dirty-page generations, checkpoint build, loader/cache, flow control).
src/index
- Bw-Tree style index, delta chains, split/merge/compact paths, iterator paths.
src/cc
- Concurrency control, writer groups, commit tree, WAL encode/decode, rollback.
src/utils/observe.rs
- Fixed-cardinality observability model.

Metadata model

Manifest persists metadata in BTree buckets.

Main buckets:

numerics
- global counters and orphan markers (odf_*, obf_*).
bucket_metas
- bucket name -> BucketMeta { bucket_id }.
bucket_frontier
- BucketDurableFrontier per bucket (per-writer-group durable LSN frontier).
pending_del
- buckets pending physical table cleanup.
page_table_{bucket_id}
- pid -> logical_addr map.
data_interval_{bucket_id} / blob_interval_{bucket_id}
- addr-range -> file_id mapping.
data_stat / blob_stat
- per-file stats + bitmap references used by GC.
obsolete_data / obsolete_blob
- files pending physical delete.

Addressing model:

each bucket has an independent logical address space (BucketState::next_addr).
page table maps pid -> swip (tagged means on-disk logical address).
interval tables map logical address ranges to data/blob file IDs.
relocation tables in file footers map logical address to file offset.

Bucket lifecycle

Creation (Manifest::create_bucket):

Take structural_lock.
Reject duplicates and enforce MAX_BUCKETS via nr_buckets.
Allocate bucket_id, create runtime context/state lazily, initialize frontier.
Commit one metadata txn (numerics, bucket_frontier, bucket_metas).

Loading (load_bucket_context):

lazy-load runtime context when first accessed.
recover page table + data/blob intervals from metadata.
cache context in BucketMgr.

Unload (unload_bucket):

runtime-only operation.
unpublish from in-memory maps, checkpoint-and-reclaim loaded context.
persisted bucket metadata/data are kept.

Delete (delete_bucket):

logical delete txn removes bucket_metas entry, inserts pending_del, records obsolete files.
runtime context/state are removed first.
GC later performs physical cleanup (pending_del) and decrements nr_buckets.

BucketState (in-memory only):

txn_ref, is_deleting, is_drop.
vacuum coordination: vacuum_inflight, vacuum_epoch, wait/notify primitives.

Dirty-page generations and checkpoint cut

Pool uses generation slots instead of arena/chunk allocators.

hot generation: writable pages/retired-chain/root marks.
sealed generation: previous hot generation retained while checkpoint is in flight.
epoch gate (epochs.gate) ensures writers do not straddle a checkpoint cut.

Write-side flow:

writer captures WriteEpoch (capture_epoch) and increments inflight counter.
allocation uses BoxRef::alloc and inserts into hot page map.
publish path marks dirty roots (pid -> latest addr) and unmap marks. merge path requirement: replace(cursor) and mark_unmap(child_pid) must be done in the same Publish / WriteEpoch commit to avoid one addr being observed as both live root and junk.
page replacement/evict paths move retired logical addresses into lineage chains (RetiredChain).

Checkpoint trigger conditions:

hot bytes >= checkpoint_size, or
total dirty bytes (hot + sealed) >= pool_capacity, or
proactive nudge from evictor (checkpoint_nudge_ms) when dirty data remains stale.

Checkpoint cut (Pool::checkpoint):

atomically swap hot generation to new empty maps.
publish old generation as sealed.
rotate inflight/root slots in one critical section.
create CheckpointTask and enqueue to checkpointer thread.

CheckpointTask::snapshot:

waits old-generation writers to leave (EpochInflight::wait_zero).
walks dirty roots and reachable chains (link/sibling/remote hints).
carries reachable addr > snap_addr mutations back to new hot generation only when they are still present in dirty generations; missing addresses are treated as already persisted/reclaimed.
computes per-group checkpoint frontier delta.
emits data/blob junk candidates for stat apply.

CheckpointTask::finish:

clears sealed slots.
requires sealed generations to be uniquely owned (Arc::try_unwrap).
recycles maps for reuse.
signals flow-controller completion.

Reachable-junk lifecycle constraints

This is the regression boundary introduced by prior reachable-junk lifecycle changes.

junk produced during publish is not one class:
- structural junk: old base/delta retired by replace/evict itself.
- compaction junk: intermediate sibling/remote addresses collected during merge/compact.
only structural junk may be retired from hot pages at publish time (with EBR deferral).
compaction junk must stay discoverable in dirty generations until checkpoint durability closure.
any address still reachable from live graph traversal (link / sibling hints / remote hints) must not enter the state "not in dirty memory and not yet in interval metadata".

If these constraints are violated:

read path can hit interval lookup hole (ivl_map.find(addr) miss / NotFound) for still-reachable old addresses.
long-lived views can observe false missing keys during checkpoint churn.
durability/visibility closure is broken: runtime may treat an address as persistent while metadata has no mapping for it yet.

Frontier design (durable boundary)

Background issue solved by the frontier design:

a compacted/materialized durable page can absorb updates from multiple writer groups.
each page header still carries only one (group, lsn) pair.
if recovery correctness relies on per-group WAL checkpoint only, multi-group absorbed history can be replayed incorrectly.

Correctness model:

the durable boundary is per bucket, per writer group, not a single global/group scalar.
this boundary is represented by BucketDurableFrontier in manifest (bucket_frontier bucket).
frontier metadata is committed in the same manifest txn as map/stat publish; page/data/blob formats do not need extra frontier fields.

Runtime carrier and propagation:

hot paths keep building lineage frontier using SparseFrontier carried in RetiredChain.
replace/evict transfer paths fold source lineage by pointwise-max per group.
checkpoint snapshot applies chained frontier to group checkpoint positions (frontier.apply_to).
when no chain frontier is present for an item, snapshot falls back to page-header (group, lsn) contribution.

Durable publish boundary:

checkpoint builds a bucket-level frontier delta for this publish.
StoreFlushObserver::publish merges and persists it via Manifest::merge_bucket_frontier.
map + frontier are durable together, so crash after manifest commit still has a consistent durable boundary.

Flush publish protocol

Checkpointer builds flushed files from snapshot pages.

Data/blob file layout:

payload frames
interval table
relocation table
footer (DataFooter / BlobFooter)

Protocol for each new file:

stage orphan marker in numerics (odf_* / obf_*).
stage in-memory unsynced-file marker (data_unsync / blob_unsync).
write file bytes.
publish metadata and clear orphan marker in the same manifest txn.

StoreFlushObserver::publish flow:

sync built data/blob writers (sync_data or sync based on sync_on_write).
merge bucket frontier with snapshot frontier delta.
begin manifest txn:
- apply junk to data/blob stats,
- clear orphan markers for published files,
- record intervals/stats/map/frontier/numerics in one atomic commit boundary.
commit txn.
clear in-memory unsynced sets.
update WAL checkpoint records per writer group from global frontier lower bound (scan/recycle hint).

Global frontier lower bound (Manifest::global_frontier_lower_bound):

considers all active buckets.
buckets with pending flush can pin frontier via durable bucket frontier.
clean/read-only buckets fallback to current group checkpoint to avoid stale pinning.

Foreground admission / backpressure

FlowController exists for every bucket; enforcement is gated by Options::enable_backpressure.

Admission model:

foreground write acquires ForegroundWritePermit before tree.update.
permit reserves bytes against dirty-memory budget.
if projected dirty bytes exceed admission limit, writer waits on condvar.
checkpoint progress wakes waiting writers in byte-based quanta.
permit drop releases reservation.

Throughput/debt model:

checkpoint enqueue records estimated bytes.
actual built bytes reconcile debt and feed IO EWMA.
completion updates end-to-end EWMA.
idle windows reset stale EWMA samples.

Current flow metrics:

CounterMetric::FlowFgAdmissionWait
HistogramMetric::FlowFgAdmissionWaitMicros

Transactions, MVCC, and WAL

TxnKV:

allocated to writer group via inflight-aware scheduling.
records Begin at start.
first mutation logs WalUpdate and links via prev_lsn chain.
commit path:
- record_commit(start_ts)
- log.sync(false)
- append (start_ts, commit_ts) to commit tree
- update watermark (collect_wmk)

Drop of uncommitted TxnKV:

no writes: log Abort.
with writes: rollback through WAL chain (WalReader::rollback), logging CLR records and inverse versions.

TxnView:

allocates from CCPool, holds snapshot start_ts, no WAL writes.

Visibility (ConcurrencyControl::is_visible_to):

own uncommitted version visible only in same writer group.
future txid invisible.
safe_txid fast path from global watermark.
fallback to per-group CommitTree::lcb(start_ts).

WAL payload semantics:

Insert: new value image.
Update: old value + new value.
Delete: old value image.
Clr: tombstone/value + undo pointer (undo_id, undo_off).

Recovery

Startup sequence:

ManifestBuilder::load loads metadata caches and counters.
clean_orphans scans orphan markers (odf_*, obf_*), removes stray files, deletes markers.
WAL phase1 scans wal files per group and finds latest valid checkpoint as conservative scan start.
context is created with per-group bootstrap info.
phase2 runs analyze -> redo -> undo.

Analyze stage:

reads WAL from checkpoint position.
truncates tail on corruption if truncate_corrupted_wal=true.
checks bucket durable frontier first (durable_frontier_lsn):
- if record_lsn <= frontier[bucket][group], treat as already durable and skip dirty-table enrollment.
- only records beyond frontier continue to redo/undo candidate logic.

Boundary split (important semantic change):

WalCheckpoint[group] is a scan-start optimization and WAL recycle aid.
BucketDurableFrontier[bucket][group] is the correctness gate for "already durable or not".
therefore "one WAL record before checkpoint and another after checkpoint" can still exist, but only affects scan cost, not replay correctness.

Redo stage:

reapplies remaining update/insert/delete/clr records in version order.

Undo stage:

rolls back incomplete txns using same WAL rollback path used at runtime.

Post recovery:

advances oracle/wmk_oldest.
evicts loaded recovery bucket contexts.

GC, scavenge, and vacuum

GC thread (gc_timeout periodic + manual trigger):

process_data (victim selection + rewrite)
process_blob (victim selection + rewrite)
process_pending_buckets (physical cleanup for pending_del)
scavenge (bounded in-memory page compaction)
delete_files (idempotent obsolete file unlink + metadata ack)

Rewrite crash safety:

rewrite stages orphan marker before file build.
metadata txn publishes new intervals/stats + delete intents + clears marker.
old files are then physically deleted through obsolete-file pipeline.

Scavenge:

scans loaded bucket page tables in bounded batches.
skips deleting/dropped/vacuuming buckets.
enforces per-tick compaction quota to cap IO impact.

Manual vacuum APIs:

vacuum_bucket(name)
- serialized per bucket using vacuum_inflight/vacuum_epoch.
- returns VacuumStats { scanned, compacted }.
vacuum_meta()
- compacts manifest btree (btree_store::compact).

Observability model

Injection point:

Options::observer: Arc<dyn Observer> (NoopObserver by default).

API shape:

counter(CounterMetric, delta)
gauge(GaugeMetric, value)
histogram(HistogramMetric, value)
event(ObserveEvent)

Built-in helper:

InMemoryObserver with fixed arrays and bounded FIFO events.

Latency sampling:

LATENCY_SAMPLE_SHIFT = 6 (1/64) for hot metrics such as:
- txn commit/rollback latency,
- tree link lock-hold latency,
- WAL sync latency.

Low-frequency recovery/GC metrics are reported without sampling.

Key invariants

Data/blob files referenced by metadata must be written before metadata commit.
Orphan marker stage/clear is the only startup orphan cleanup source of truth.
Checkpoint epoch cut must atomically rotate hot/sealed generation handles.
Writers must not straddle checkpoint cut (epochs.gate + inflight wait-zero).
Sealed generations must be uniquely owned when checkpoint finishes.
Reachable addresses (including sibling/remote-referenced old pages) must stay readable from memory until they are either durably published or proven unreachable by lifecycle closure.
Merge publish must apply replace and child mark_unmap in the same epoch commit, so checkpoint snapshot never classifies one addr as both dirty root and newly retired junk.
Bucket durable frontier must monotonically advance per group.
Map publish and bucket frontier publish must be in the same manifest txn.
Recovery correctness must be gated by bucket frontier, not by group checkpoint alone.
Recovery analyze must ignore updates already <= durable frontier.
Bucket delete is two-phase (pending_del), and nr_buckets decrements only after physical cleanup.
WAL recycling must stay below min(active-txn boundary, last checkpoint boundary).
Foreground admission wait must happen before entering tree.update critical mutation path.

Invariant-break consequences to watch for:

NotFound/interval-miss on addresses that should still be reachable.
snapshot/read visibility regression under concurrent checkpoint (false key loss).
frontier/map correctness no longer matching actual durability boundary.

Operational validation

./scripts/prod_test.sh fast|stress|chaos|all
./scripts/perf_gate.sh snapshot|compare
tuning defaults in scripts/perf_thresholds.env
CI wiring in .github/workflows/ci.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MACE design notes

High-level goals

Core architecture

Metadata model

Bucket lifecycle

Dirty-page generations and checkpoint cut

Reachable-junk lifecycle constraints

Frontier design (durable boundary)

Flush publish protocol

Foreground admission / backpressure

Transactions, MVCC, and WAL

Recovery

GC, scavenge, and vacuum

Observability model

Key invariants

Operational validation

FilesExpand file tree

design.md

Latest commit

History

design.md

File metadata and controls

MACE design notes

High-level goals

Core architecture

Metadata model

Bucket lifecycle

Dirty-page generations and checkpoint cut

Reachable-junk lifecycle constraints

Frontier design (durable boundary)

Flush publish protocol

Foreground admission / backpressure

Transactions, MVCC, and WAL

Recovery

GC, scavenge, and vacuum

Observability model

Key invariants

Operational validation