Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: ModernRelay/omnigraph
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: main@{1day}
Choose a base ref
...
head repository: ModernRelay/omnigraph
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
  • 9 commits
  • 34 files changed
  • 3 contributors

Commits on Jun 12, 2026

  1. test(cli): the embedded/remote parity matrix (RFC-009 Phase 1)

    The referee before any unification moves: every forked verb runs once
    against the local graph and once against a spawned server on a twin copy
    of the same fixture, with the SAME actor (--as locally; bearer-resolved
    remotely) and the SAME Cedar bundle on both arms — like-for-like
    enforcement is part of the harness (a tokens-only server is default-deny
    by design; comparing that against a bare local arm measures
    configuration, not the fork). Declared-volatile fields (ids, wall-clock,
    transport locations) scrub to placeholders; everything else must match
    exactly, and exit codes must match for shared failures.
    
    Headline result: 11 rows green with an EMPTY divergence ledger — the
    arms agree on every verb today. The ledger (KNOWN_DIVERGENCES) exists so
    any future divergence is pinned or filed, never silently repaired;
    repairs are Phase 3's job, gated by this referee staying green.
    
    One engine observation surfaced and filed (#207): inline execution with
    a declared-but-unbound param matches ALL rows on both arms, while the
    stored-query invoke path hard-errors — a cross-path asymmetry the matrix
    pins as agreeing behavior pending a deliberate fix. Documented
    exclusions (graphs list, ingest/load-over-/ingest, storage-plane verbs)
    map to RFC-009 Phases 4-5.
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    aaltshuler and claude committed Jun 12, 2026
    Configuration menu
    Copy the full SHA
    08c9b03 View commit details
    Browse the repository at this point in the history

Commits on Jun 13, 2026

  1. Recovery liveness, storage fault-injection matrix, and one storage im…

    …plementation over object_store (#203)
    
    * test(engine): pin the long-lived-handle heal contract for sidecar-covered drift
    
    A Phase B -> Phase C failure (commit_staged advanced Lance HEAD, manifest
    publish did not land, recovery sidecar persists) currently wedges every
    subsequent staged write on the same engine handle: the commit-time drift
    guard rejects with 'run omnigraph repair', but repair itself refuses
    while a recovery sidecar is pending, so a long-lived server can only
    recover by restart. The documented contract (writes.md 'Long-running
    servers', invariants.md invariant 5) says refresh-time roll-forward
    closes this residual without restart -- but no write path runs it.
    
    Two red tests pin the intended contract at the write entry points:
    a follow-up load (the POST /ingest shape: shared handle, no reopen)
    and a follow-up mutation must heal roll-forward-eligible sidecars
    in-process and then succeed.
    
    Currently failing with:
      table 'node:Company' has Lance HEAD version 2 ahead of manifest
      version 1; run `omnigraph repair` before writing
    
    The fix lands in the next commit.
    
    * fix(engine): heal pending recovery sidecars at the staged-write entry points
    
    Close the long-lived-process gap in the recovery protocol: a Phase B ->
    Phase C residual (per-table commit_staged landed, manifest publish did
    not, sidecar persists) previously recovered only at the next ReadWrite
    open or via an explicit refresh() that no production write path called,
    so a long-lived server wedged every subsequent write on the commit-time
    drift guard until restart.
    
    New recovery::heal_pending_sidecars_roll_forward:
    - one list_dir of __recovery/ at write entry (empty -> immediate
      return, the steady state), so the per-write cost is one storage list;
    - per sidecar, acquires the same per-(table_key, table_branch) write
      queues every sidecar writer holds from before write_sidecar until
      after delete_sidecar, then re-checks sidecar existence -- this
      serializes the heal against live writers instead of rolling an
      in-flight sidecar forward from under its writer (which would fail
      that writer's publish CAS spuriously). Lock order queues ->
      coordinator matches every writer's commit->publish path. This is the
      queue-acquisition design recovery.rs and write_queue.rs already
      documented for in-process recovery;
    - processes in RollForwardOnly mode: the common residual rolls forward
      in-process; rollback-eligible sidecars still defer to the next
      ReadWrite open (Dataset::restore is unsafe under concurrency).
    
    Wire it into load_as and mutate_as (before the inline delete path can
    advance any HEAD), and rebase Omnigraph::refresh onto the same helper
    so refresh stops racing live writers' sidecars.
    
    The maintenance entry points (apply_schema_as, branch_merge_as,
    ensure_indices) intentionally keep their strict fail-loud preconditions
    for now; wiring the same heal there is a follow-up with its own tests.
    
    Turns the previous commit's two red tests green.
    
    * fix(engine): name the right recovery path in the commit-time drift guard
    
    The drift guard's 'run omnigraph repair before writing' advice is a
    dead end when the drift is covered by a pending recovery sidecar:
    repair refuses while a sidecar is pending. With the write-entry heal in
    place, reaching this guard with sidecar-covered drift means the heal
    deferred it (rollback-eligible), and the actual recovery path is a
    read-write reopen. Distinguish the two classes on the error path only
    (one sidecar list, after the conflict is already certain); a listing
    failure falls back to the uncovered-drift wording rather than masking
    the conflict.
    
    Pinned by extending refresh_defers_rollback_eligible_sidecar_to_next_open
    with a write attempt against the deferred sidecar.
    
    * docs: write-entry in-process sidecar heal — contract and coverage
    
    Update the recovery contract docs to match the previous two commits:
    invariant 5 now states that the staged-write entry points and refresh
    run in-process roll-forward recovery (long-lived processes converge on
    the next write, not at restart); writes.md 'Long-running servers'
    describes the heal's queue-acquisition concurrency contract, the
    improved drift-guard error, and the entry points that intentionally do
    not heal yet; testing.md indexes the new failpoint tests; AGENTS.md
    capability matrix drops the claim that in-process recovery is entirely
    future work (only the rollback path remains with the background
    reconciler).
    
    * test(engine): pin the entry heal contract for schema apply and branch merge
    
    Without the write-entry heal, the two maintenance writers do worse than
    wedge on sidecar-covered drift -- they proceed and decide its fate
    implicitly:
    
    - schema apply re-plans table rewrites from the manifest pin, orphaning
      the drifted Phase-B commit (its rows silently vanish from the
      rewritten table) while the stale sidecar lingers to misclassify
      against the post-apply pins;
    - branch merge publishes over the drift, making the failed writer's
      commit visible as an unattributed side effect (no recovery audit
      row), and leaves the stale sidecar behind.
    
    Two red tests pin the intended contract: both entry points heal the
    sidecar first (attributed roll-forward), then run on the converged
    state. Currently failing on the stale-sidecar / dropped-rows
    assertions; the fix lands in the next commit.
    
    * fix(engine): heal pending recovery sidecars at the schema-apply and branch-merge entries
    
    Extend the write-entry heal to the remaining two write entry points.
    Unlike load/mutate (which wedge on the drift guard), these proceeded
    over sidecar-covered drift and decided its fate implicitly:
    
    - schema apply re-planned table rewrites from the manifest pin,
      orphaning the drifted Phase-B commit -- its rows silently vanished
      from the rewritten table -- while the stale sidecar lingered to
      misclassify against the post-apply pins;
    - branch merge published over the drift, making the failed writer's
      commit visible without a recovery audit row, and left the stale
      sidecar behind.
    
    Both now run the same queue-serialized roll-forward heal at entry,
    before their own sidecar exists, so recovery is attributed (audit row)
    and deterministic. ensure_indices stays heal-free: it runs inside the
    load / schema-apply flows after their entry heal.
    
    Turns the previous commit's two red tests green. Docs updated in the
    same change (invariant 5, writes.md, testing.md, AGENTS.md).
    
    * test(engine): pin Phase A sidecar-write failure semantics
    
    Storage fault-injection matrix, row 1: a sidecar PUT failure (S3
    PutObject / fs write) in Phase A. New failpoint recovery.sidecar_write
    at the top of write_sidecar -- the single choke point all five sidecar
    writers go through -- models the storage error backend-generically.
    
    Also adds the other three storage-fault failpoints used by the
    following commits (recovery.sidecar_delete, recovery.sidecar_list,
    recovery.record_audit); each is a no-op without the failpoints feature.
    
    Pinned contract: every writer writes its sidecar BEFORE its first
    HEAD-advancing commit, so a put failure aborts with zero drift (no
    sidecar, Lance HEAD == manifest pin, no rows) and a transient fault
    never wedges the graph -- the same handle writes/merges normally once
    it clears. Covered for load (the staging writer) and branch_merge (the
    multi-table writer, forced onto the RewriteMerged path by diverging
    both sides).
    
    * test(engine): pin Phase D delete, list, and audit-append storage-fault semantics
    
    Storage fault-injection matrix, rows 2/3/5, plus the real-backend run:
    
    - recovery.sidecar_delete: a Phase D delete failure (S3 DeleteObject)
      must NOT fail the user's write -- the manifest publish already
      landed, so the caller's data is durable. The swallowed failure
      leaves a stale sidecar; the next write's entry heal consumes it via
      the stale-sidecar audit-recovery path (RolledForward, attributed).
    
    - recovery.sidecar_list: a __recovery/ list failure (S3 ListObjectsV2)
      is loud at every consumer -- the write-entry heal fails the write
      and the open-time sweep fails the open. Silently skipping recovery
      over a pending sidecar would be consumer tolerance of drift. Once
      the fault clears, open recovers the pending sidecar normally.
    
    - recovery.record_audit: an audit write failure after the
      roll-forward's manifest publish aborts that recovery attempt and
      keeps the sidecar; re-entry detects the already-published manifest,
      records exactly ONE RolledForward audit row, and converges -- the
      retry tolerance documented on record_audit, exercised end-to-end.
    
    - s3_load_recovers_after_publisher_failure_without_reopen: the
      same-handle heal scenario on a real bucket (gated on
      OMNIGRAPH_S3_TEST_BUCKET, skips locally), exercising sidecar
      put/list/delete through S3StorageAdapter instead of the local-FS
      adapter. CI wiring lands in a follow-up commit.
    
    * test(engine): refuse corrupt recovery sidecars loudly
    
    Storage fault-injection matrix, row 4 (no failpoint needed -- the
    corrupt file is written by hand, sibling to the unknown-schema-version
    refusal test): a truncated/garbage __recovery/{ulid}.json must be
    refused loudly by both the write-entry heal (the write fails naming
    the parse error) and the open-time sweep (ReadWrite open fails naming
    the file), with the file left on disk for operator inspection.
    Read-only opens still work -- the sweep is skipped there.
    
    * test(engine): run the S3 sidecar-lifecycle coverage in CI + document the fault matrix
    
    - ci.yml rustfs_integration: new step running the bucket-gated
      failpoints tests (name filter s3_) against the RustFS container, so
      sidecar put/list/delete are exercised through S3StorageAdapter on
      every storage-affecting PR.
    - writes.md: sidecar I/O failure semantics -- Phase A put failure
      aborts with zero drift; Phase D delete failure is swallowed (write
      already durable) and healed by the next write; list failures are
      loud at heal and open; corrupt sidecars are refused with the file
      kept for inspection; audit-append failures are retried to exactly
      one audit row.
    - testing.md: index the storage-fault matrix in the failpoints.rs row
      and the new RustFS CI line.
    
    * test(engine): pin read-visibility of acknowledged local if-absent writes
    
    The cluster lib test import_missing_state_creates_state_with_graph_-
    observation flakes at ~50% under full-workspace load ('EOF while
    parsing a value' reading back the state.json its own import just
    acknowledged). Root cause is in the engine's local storage adapter:
    write_text_if_absent writes through a buffered tokio::fs::File and
    returns when write_all resolves -- which, per tokio's documented File
    semantics, means the bytes reached tokio's internal buffer, not the
    file. The actual write completes in a background blocking task after
    drop, so a caller that acknowledges success and reads the object back
    can see an empty or partial file. Under load the window widens; the
    red run fails at iteration 0 with 0 of 8192 bytes on disk.
    
    The regression test pins the contract at the adapter boundary: when
    write_text_if_absent resolves, the full contents are visible to any
    reader; a losing second claim leaves the winner's object untouched.
    
    The fix lands in the next commit.
    
    * fix(engine): publish local storage writes with atomic visibility
    
    Close the class, not the instance. The local adapter admitted three
    ways for a reader to observe a write that was acknowledged or visible
    before its bytes were complete:
    
    1. write_text_if_absent acknowledged success when the buffered
       tokio::fs::File write_all resolved -- i.e. when the bytes reached
       tokio's internal buffer, not the file. A caller reading back its own
       acknowledged write could see an empty object (the ~50% cluster
       import flake under full-workspace load; the regression test failed
       at iteration 0 with 0 of 8192 bytes visible).
    2. The same call published its CLAIM (create_new) before its CONTENT,
       so concurrent readers saw an empty claimed file in the window.
    3. write_text (plain tokio::fs::write) exposed truncated content
       mid-replace -- silently falsifying write_sidecar's 'readers either
       see the complete sidecar or none' contract on local FS (true on S3,
       where PutObject is atomic).
    
    A flush in write_text_if_absent would have fixed only (1). Instead,
    both local write paths now publish complete temp files atomically:
    rename for replace (write_text -- the idiom write_text_if_match
    already used) and hard_link for no-replace (write_text_if_absent --
    link fails AlreadyExists, so exactly one of N concurrent claimants
    wins and the winner's object is fully readable at the instant it
    becomes visible). The local adapter now honors the same object-level
    atomic-visibility contract as the S3 adapter, which is what every
    caller (recovery sidecar protocol, cluster state CAS) was written
    against. Crash-orphaned *.tmp.* files are inert: the sidecar sweep
    filters to .json, and cluster state reads address state.json by name.
    
    fsync/durability policy is unchanged (no fsync before, none now);
    this fix is about visibility ordering, not power-loss durability.
    
    Pre-existing on main (landed with the multi-graph server mode change,
    PR #119); surfaced by this branch's heal work only because one extra
    list_dir per write shifted test timing. Cluster lib suite: 12/25
    failures before, 0/25 after. Turns the previous commit's red test
    green.
    
    * refactor(engine): one storage implementation over object_store for every backend
    
    Collapse LocalStorageAdapter (hand-rolled tokio::fs) and
    S3StorageAdapter into a single ObjectStorageAdapter backed by
    Arc<dyn object_store::ObjectStore> -- LocalFileSystem for local URIs,
    the existing AmazonS3 build for s3://, plus a pub in_memory()
    constructor (full contract including TRUE conditional updates; the
    in-memory test backend testing.md asked for at the adapter level).
    
    Why: the acknowledged-before-visible bug showed the two-impl shape has
    no referee -- one prose contract, two independent answers. Upstream
    LocalFileSystem::put_opts is byte-for-byte the staged-temp+rename/
    hard_link idiom that fix converged on, and Lance's own commit protocol
    is built on the same primitives (put-if-not-exists / rename-if-not-
    exists), so the substrate-aligned move is to stop hand-rolling it.
    The per-backend residue shrinks to a UriCodec (URI <-> object path)
    and one capability flag.
    
    Semantics preserved by construction, with three deliberate deltas:
    - exists() is now object-store-semantics everywhere (head + non-empty
      prefix fallback): an EMPTY local directory no longer 'exists'. The
      only dir-shaped caller (_graph_commits.lance probes) self-heals via
      ensure_commit_graph_initialized where it previously wedged loudly.
    - A directory at an object path reads as NotFound, not as an IO error
      ('only objects exist'). The cluster unreadable-payload test used a
      same-named directory as a portable non-NotFound trigger; it now uses
      chmod 000, which still models genuine transient IO.
    - write_text_if_match keeps content-token semantics on local
      (PutMode::Update is NotImplemented upstream for LocalFileSystem in
      0.12.5 and 0.13.2); the capability flag gates the token SOURCE in
      read_text_versioned too -- an ETag token with content-compare writes
      would lose every CAS.
    
    delete_prefix keeps a local remove_dir_all branch: directories are a
    local-FS concept, and list+delete would leave empty skeletons that
    cluster graph_root_exists (raw Path::exists) reports as still present.
    
    LocalStorageAdapter remains as a delegating shim so the pinned
    contract tests gate this swap textually unchanged; the shim and the
    test parameterization over local + in-memory land next. Cargo gains
    the explicit 'fs' feature (already transitively enabled by lance).
    
    * test(engine): one executable storage contract, run against every backend
    
    Remove the LocalStorageAdapter delegation shim and migrate its
    construction sites to ObjectStorageAdapter::local(). Replace the
    per-backend duplicated tests with a single contract_suite asserting
    the trait's promises (atomic replace, exists incl. the dataset-root
    prefix probe, one-winner if_absent, versioned CAS with loud CAS-lost,
    rename, list round-trip with no sibling-prefix bleed, idempotent
    delete/delete_prefix), run against the local backend and the new
    in-memory backend -- which implements true conditional updates, so the
    strong-CAS path is exercised without a bucket. The bucket-gated S3
    variant already exists (s3_adapter_conditional_writes_contract).
    
    New local-specific pins for the deliberate semantic edges of the
    collapse: empty directories are not objects (exists=false; the Lance
    dataset-root probe shape is the non-empty case), file://-anchored and
    spaces-in-path list output round-trips byte-identically into
    read_text, dot-segment paths are lexically absolutized (the CLI's
    ./graph.omni shape), and upstream rename creating missing destination
    parents. The acknowledged-write visibility regression test stays, now
    documenting that the cross-API std::fs read-back is the point.
    
    * refactor(cluster): drop put_json's per-backend atomicity branch
    
    The local temp+rename dance predates the storage adapter guaranteeing
    atomic visibility; now that write_text publishes via a staged temp +
    rename on the filesystem (and a single atomic PUT on object stores) by
    contract, the branch duplicated upstream behavior. One call, both
    backends.
    
    * docs: storage adapter collapse — contract, in-memory backend, local CAS gap
    
    - testing.md: the 'no MemStorage backend' note is half-closed —
      ObjectStorageAdapter::in_memory() covers the text-object layer with
      the full contract (true conditional updates); Lance datasets bypass
      the adapter, so the engine substrate ask stays open.
    - invariants.md: truth-matrix Tests row updated; new Known Gap for
      local write_text_if_match (upstream PutMode::Update is unimplemented
      for LocalFileSystem; content-token emulation is safe only under the
      cluster lock protocol — close before admitting a lock-free caller).
    - writes.md: backend notes for the unified adapter (name#N staging
      residue invisible to the sweep, backend-wrapped error text with
      exists()-probing for missing-vs-error, loud permission failures).
    
    * docs: finish renaming the storage adapters in user docs and test comments
    
    storage.md's URI-scheme table and the S3 failpoint test's doc comment
    still named the deleted LocalStorageAdapter/S3StorageAdapter; both now
    describe the unified ObjectStorageAdapter over object_store, including
    the relative-path absolutization note for local URIs.
    
    * test(engine): pin branch-awareness of the drift guard's recovery advice
    
    A pending sidecar on ANOTHER branch does not cover this branch's
    drift: with a deferred feature-branch sidecar on disk and genuinely
    uncovered drift on main, the main write's error must still point at
    omnigraph repair -- a read-write reopen recovers the sidecar but
    cannot repair main's uncovered drift. Currently red: the guard
    matches sidecar pins by table_key only, so the feature sidecar flips
    main's advice to the reopen path. Fix in the next commit.
    
    Surfaced by external review of the drift-guard change.
    
    * fix(engine): branch-aware sidecar matching in the drift guard's advice
    
    The commit-time drift guard's sidecar-covered check matched pins by
    table_key alone, so a pending sidecar on another branch flipped this
    branch's uncovered-drift advice from 'run omnigraph repair' to the
    reopen path -- and a reopen recovers that sidecar but cannot repair
    this branch's drift. Compare the pin's table_branch too. Turns the
    previous commit's red test green.
    
    Surfaced by external review of the drift-guard change.
    
    * test(engine): pin heal non-interference with a live schema apply
    
    The write-entry heal's schema-staging reconcile runs before any queue
    acquisition, so a load on the same handle, overlapping a schema apply
    parked between its staging write and manifest commit, promotes the
    apply's staging files (new catalog live against the old manifest),
    classifies the LIVE apply's sidecar, and publishes its registrations
    out from under it. The resumed apply then collides with its own stolen
    commit. Currently red with:
    
      Lance("Concurrent modification: table version 3 already exists for
      node:Tag")
    
    The fix (per-sidecar reconcile under the sidecar's write-queue guards,
    plus a serialization key the schema-apply writer and the heal both
    acquire) lands in the next commit.
    
    Surfaced by external review of the write-entry heal.
    
    * fix(engine): serialize the heal's schema-staging reconcile with live schema applies
    
    The write-entry heal ran recover_schema_state_files up front, before
    acquiring any queue guards. Overlapping a live schema apply parked
    between its staging write and manifest commit, the heal promoted the
    apply's staging files (new catalog live against the old manifest),
    classified the LIVE apply's sidecar, and published its registrations —
    the resumed apply then collided with its own stolen commit.
    
    Correct by construction:
    
    - New schema-apply serialization queue key, acquired by the schema-
      apply writer (alongside its per-table keys) from before write_sidecar
      until after delete_sidecar. Per-table keys alone don't cover a
      registration-only migration, which pins no existing tables but has a
      sidecar and staging files on disk.
    - The heal reconciles schema staging lazily, PER SchemaApply sidecar,
      after acquiring that sidecar's guards (including the serialization
      key) and re-confirming the sidecar exists — a sidecar that survives
      the queue wait belongs to a dead writer, so the reconcile can no
      longer race a live apply. Recomputing per sidecar also removes the
      staleness of one up-front result across a multi-sidecar pass.
    - Omnigraph::refresh drops its up-front reconcile-and-pass-through
      (same race, and a pre-promoted result would make the heal's guarded
      reconcile see clean staging and wrongly defer the sidecar): it now
      reconciles standalone only when NO sidecar exists — which cannot
      race a live apply, whose sidecar always precedes its staging files —
      and otherwise defers entirely to the heal.
    
    The open-time sweep keeps its precomputed reconcile: open has no
    concurrent writers. Turns the previous commit's red test green.
    
    Surfaced by external review of the write-entry heal.
    
    Self-audit addendum folded in: refresh's no-sidecar gate had a TOCTOU
    (a live apply could write its sidecar + staging between the empty
    check and the reconcile) — the standalone reconcile now holds the
    serialization key across the list-then-reconcile pair. The remaining
    residual is cross-process only (in-process queues cannot serialize
    against a writer in another process; the open-time sweep has the same
    pre-existing exposure) and is now an explicit Known Gap in
    invariants.md rather than an implicit one.
    
    * test(engine): pin catalog reload after the heal recovers a schema apply
    
    When the write-entry heal rolls a crashed apply's SchemaApply sidecar
    forward on the same handle, disk and manifest move to the new schema
    (staging promoted, registrations published) but the handle's in-memory
    schema_source/catalog do not. Subsequent writes then validate against
    the stale catalog and reject rows of types the graph already has.
    Currently red with:
    
      record 1: unknown node type 'Tag'
    
    refresh() reloads after its heal; the write entry points must too.
    Fix in the next commit.
    
    Surfaced by external review of the write-entry heal.
    
    * fix(engine): reload the in-memory catalog after the heal recovers a schema apply
    
    heal_pending_recovery_sidecars refreshed the coordinator and
    invalidated the runtime cache after processing sidecars, but never
    reloaded schema_source/catalog — so a write whose entry heal rolled a
    crashed SchemaApply sidecar forward proceeded to validate against the
    OLD schema while disk and manifest were already on the new one.
    reload_schema_if_source_changed is the same post-heal step refresh()
    already runs; it no-ops on the (overwhelmingly common) non-schema heal
    because the on-disk source is unchanged. Turns the previous commit's
    red test green.
    
    Surfaced by external review of the write-entry heal.
    
    * test(engine): pin that a deleted-branch sidecar cannot wedge the graph
    
    A rollback-eligible sidecar pinned to a branch is deferred by every
    roll-forward-only pass; if the branch is then deleted, the sidecar
    survives, referencing a branch with no manifest tree. The heal (every
    write entry) and the open-time sweep (every ReadWrite open) both fail
    opening the dead branch, and repair refuses while a sidecar is pending
    -- a terminal read-only state with manual sidecar surgery as the only
    exit. Currently red with:
    
      Lance("Not found: .../__manifest/tree/feature/_versions")
    
    The branch's tree and forks are already reclaimed, so the pinned drift
    is unreachable and the sidecar is provably moot; the fix classifies it
    as an orphaned-branch terminal state (audit + discard) in both passes.
    
    Surfaced by review (P1, verified by repro).
    
    * fix(engine): classify deleted-branch sidecars as orphaned instead of wedging
    
    A deferred (rollback-eligible) sidecar pinned to a branch survives
    branch_delete; both the write-entry heal and the open-time sweep then
    failed unconditionally opening the dead branch -- every write and
    every ReadWrite open errored, and repair refuses while a sidecar
    pends. Terminal state, manual sidecar surgery the only exit.
    
    The branch's tree and per-table forks are already reclaimed at delete,
    so the drift the sidecar pins is unreachable and the sidecar is
    provably moot. Both passes now check the sidecar's branch against the
    manifest's branch list (the authority -- deliberately NOT inferred
    from a Not-found on open, which could be a transient storage error
    masking real recovery intent) and discard orphans with an
    OrphanedBranchDiscarded audit row, commit appended on main since the
    sidecar's own branch no longer has a commit graph.
    
    The open-time half is pre-existing; the write-entry heal made it hot.
    Turns the previous commit's red test green.
    
    Surfaced by review (P1, verified by repro).
    
    * chore: harden review nits — vacuous CI filter, root-runner skip, liveness note
    
    - ci.yml: the RustFS sidecar-lifecycle step now fails loudly if the
      's3_' name filter matches zero tests (cargo passes vacuously on an
      empty filter; the step exists specifically to prove S3 sidecar I/O
      coverage). The pre-existing CLI smoke step has the same shape and is
      left for a follow-up.
    - cluster unreadable-payload test: cfg(unix) + a skip-with-log when
      running as root (mode 000 is still readable to root, common in
      container dev runners), so the test degrades instead of failing.
    - refresh: document the one-pass-late convergence for legacy staging
      residue while non-SchemaApply sidecars pend, so nobody 'fixes' it by
      re-running the reconcile unserialized — the exact race the
      serialization key closes.
    
    * test(engine): pin orphan-discard idempotency across a delete fault
    
    discard_orphaned_branch_sidecar writes its audit row and main commit
    before deleting the sidecar; a Phase D delete fault leaves the sidecar
    on disk with the audit already durable, and the retry repeated the
    whole path -- a second OrphanedBranchDiscarded audit row (and commit)
    for the same operation. Currently red: 2 rows after one fault + retry.
    The retry must only finish the delete. Fix next.
    
    Also promotes the recovery-audit kinds reader into the shared test
    helpers (it was recovery.rs-local).
    
    Surfaced by external review of the orphan-discard fix.
    
    * fix(engine): orphan-discard idempotency + heal reports acted-vs-deferred
    
    Two review findings on the recovery surface:
    
    - discard_orphaned_branch_sidecar now checks the audit table for an
      existing (operation_id, OrphanedBranchDiscarded) row before appending
      the commit + audit pair, so a Phase D delete fault retries ONLY the
      delete instead of duplicating audit rows and commit-graph entries.
      Cold path: the list scan runs only when an orphaned sidecar exists.
      Turns the previous commit's red test green (exactly one audit row
      across fault + retry).
    
    - process_sidecar returns whether durable state changed; the heal sets
      processed_any only for sidecars that were actually rolled forward /
      rolled back / audit-recovered (orphan discards count). Deferred
      sidecars (rollback-eligible, invariant-violating, unpromoted
      SchemaApply) no longer trigger a per-write schema reload + full
      runtime-cache invalidation while they pend -- the cache is
      snapshot-keyed so this was waste, not corruption, but it was paid on
      every write until reopen. Acted-paths' processed=true remains pinned
      by load_after_schema_apply_phase_b_failure_uses_recovered_catalog
      (the reload depends on it).
    
    Surfaced by external review.
    
    * test(engine): pin the orphan-discard audit-append fault leg as documented tolerance
    
    The orphan discard's commit append and audit append are two writes; a
    failure between them leaves a recovery commit with no audit row, and
    the retry (keyed on the audit row, the operator-facing record) appends
    a second commit before the audit lands. This is the same
    not-atomic-pair-write tolerance record_audit documents and the
    manifest->commit-graph Known Gap covers for every publish: bounded
    commit-graph noise, audit row exactly-once under clean failures.
    Keying idempotency on commit rows instead would need an operation_id
    column on _graph_commits, and audit-before-commit would dangle the
    graph_commit_id join -- both worse than the documented residual.
    
    Make the tolerance explicit instead of implicit: docstring names the
    window, a failpoint sits inside it, and the new test pins convergence
    across the fault (sidecar consumed, exactly one audit row), completing
    the orphan-discard fault matrix alongside the delete-fault leg.
    
    Surfaced by external review of the orphan-discard idempotency.
    
    * test(engine): pin honest drift-guard advice when sidecar listing fails
    
    The guard's unwrap_or(false) conflated 'classified as uncovered' with
    'could not classify': a transient list fault on the guard's second
    list (the entry heal's first list having succeeded) confidently routed
    the operator to omnigraph repair even when the heal had just deferred
    a rollback-eligible sidecar -- and repair refuses while a sidecar is
    pending. Currently red: the error says 'run omnigraph repair' with no
    mention of the reopen path. The fix names both paths plus the failure
    cause when classification is impossible.
    
    Surfaced by external review of the drift-guard fallback.
    
    * fix(engine): admit ambiguity in the drift guard when sidecar listing fails
    
    Replace the unwrap_or(false) fallback with a tri-state: covered ->
    reopen advice; uncovered -> repair advice; listing FAILED -> say the
    drift could not be classified, name the cause, and give both paths in
    order ('run repair, or reopen read-write if repair reports a pending
    sidecar'). The old fallback confidently routed a transient list fault
    to repair, which refuses while a sidecar is pending -- a self-
    correcting but pointless detour. The conflict itself is still always
    raised; only the advice degrades honestly. Turns the previous commit's
    red test green.
    
    Surfaced by external review of the drift-guard fallback.
    ragnorc authored Jun 13, 2026
    Configuration menu
    Copy the full SHA
    446b46d View commit details
    Browse the repository at this point in the history
  2. Merge pull request #208 from ModernRelay/test/parity-matrix

    test(cli): embedded/remote parity matrix (RFC-009 Phase 1)
    aaltshuler authored Jun 13, 2026
    Configuration menu
    Copy the full SHA
    82dda29 View commit details
    Browse the repository at this point in the history
  3. refactor(api): extract omnigraph-api-types crate (RFC-009 Phase 2)

    The HTTP wire DTOs and their engine-result -> DTO mappings move from
    omnigraph-server's api module into a new omnigraph-api-types crate that
    both server and CLI can depend on (engine must not — DAG: api-types ->
    engine, never the reverse). The crate holds plain serde/utoipa types only;
    the transport-coupled error->status mapping stays in the server (lib.rs/
    handlers). The one server-runtime coupling (query_catalog_entry, which
    maps a StoredQuery — not a wire type) stays behind in api.rs, now calling
    the crate's pub param_descriptor.
    
    api.rs becomes a thin `pub use omnigraph_api_types::*` re-export, so every
    omnigraph_server::api::Foo path (handlers, the OpenApi schema list, CLI
    imports) resolves unchanged. openapi.json regenerates BYTE-IDENTICAL (the
    Phase-2 referee: 77 openapi tests green, zero diff).
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    aaltshuler and claude committed Jun 13, 2026
    Configuration menu
    Copy the full SHA
    4821e72 View commit details
    Browse the repository at this point in the history
  4. refactor(cli): consume omnigraph-api-types directly; unify the load m…

    …apping
    
    The CLI's wire-DTO imports repoint from omnigraph_server::api to
    omnigraph-api-types (the server's other exports — queries registry,
    config types — still come from omnigraph-server). The local Load arm's
    inline LoadOutput hand-construction in main.rs is extracted into
    load_output_from_result next to load_output_from_tables in output.rs, so
    both '-> LoadOutput' mappings (engine LoadResult for local, wire
    IngestOutput for remote) live in one place.
    
    Deviation from the plan, with reason: LoadOutput stays CLI-side rather
    than moving into the wire-DTO crate — it is a rendered CLI output type,
    not an HTTP wire DTO, and its mapping consumes a CLI clap type
    (CliLoadMode). The shared crate stays strictly wire DTOs. Shapes
    unchanged: the parity matrix passes textually unchanged.
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    aaltshuler and claude committed Jun 13, 2026
    Configuration menu
    Copy the full SHA
    adbb2a1 View commit details
    Browse the repository at this point in the history
  5. docs: omnigraph-api-types in the crate list; RFC-009 Phase 2 outcome

    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    aaltshuler and claude committed Jun 13, 2026
    Configuration menu
    Copy the full SHA
    3e2502c View commit details
    Browse the repository at this point in the history
  6. Merge pull request #209 from ModernRelay/refactor/api-types-crate

    refactor(api): extract omnigraph-api-types crate (RFC-009 Phase 2)
    aaltshuler authored Jun 13, 2026
    Configuration menu
    Copy the full SHA
    e4334de View commit details
    Browse the repository at this point in the history
  7. refactor(cli): GraphClient enum + read verbs (RFC-009 Phase 3a)

    The embedded-vs-remote split gets one home: a GraphClient enum
    (Embedded { uri } | Remote { http, base_url, token }) with a resolve()
    factory that absorbs the shared preamble (apply_server_flag -> token ->
    URI/remoteness) and a verb method per command. The five uniform read
    forks — branch list, commit list, commit show, schema show, snapshot —
    collapse from per-command if-graph-is-remote else to one line each
    (main.rs: -113/+47). Behavior identical per verb (local reads still open
    WITHOUT policy, as today); the Phase-1 parity matrix is the referee and
    passes textually unchanged.
    
    Enum, not the RFC trait: only two variants ever, and inherent async
    methods avoid async_trait boxing and the apply_schema closure that is not
    object-safe (3b) — same one-body-two-impls collapse, less ceremony.
    
    Scope: the uniform reads only. The query verb (policy-open + operator-
    alias early-return + param merge) joins the write verbs in 3b;
    export/streaming and graphs-list in 3c, where the now-shared
    execute_*_remote/execute_* pairs get retired.
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    aaltshuler and claude committed Jun 13, 2026
    Configuration menu
    Copy the full SHA
    25d74d6 View commit details
    Browse the repository at this point in the history
  8. Merge pull request #210 from ModernRelay/refactor/graph-client-reads

    refactor(cli): GraphClient enum + read verbs (RFC-009 Phase 3a)
    aaltshuler authored Jun 13, 2026
    Configuration menu
    Copy the full SHA
    7bfe9c6 View commit details
    Browse the repository at this point in the history
Loading