Skip to content

ONE-1361 index durable Dreamer milestone fallback#289

Merged
olety merged 3 commits into
mainfrom
codex/one-1361-index-dreamer-milestone-fallback
Jul 4, 2026
Merged

ONE-1361 index durable Dreamer milestone fallback#289
olety merged 3 commits into
mainfrom
codex/one-1361-index-dreamer-milestone-fallback

Conversation

@olety

@olety olety commented Jul 4, 2026

Copy link
Copy Markdown
Member

Linear: ONE-1361

Summary

  • add durable vault_meta index rows for Dreamer milestone fallback candidates keyed by job_id, milestone time, learned time, and claim id
  • maintain a claim-id forward row so claim overwrites, lifecycle changes, hard deletes, and soft erases invalidate stale fallback candidates
  • backfill the index once for legacy vault rows, then use per-job indexed lookup while preserving stale/malformed/non-active fallback skips

Tests

  • rtk proxy cargo test -p oneiron --features sync dreamer_durable_milestone -- --nocapture
  • rtk proxy cargo fmt --all --check
  • rtk proxy cargo clippy --workspace --all-targets --all-features -- -D warnings
  • rtk proxy cargo nextest run --workspace --all-features --profile full
  • rtk proxy cargo test --doc --workspace --exclude oneiron-bench --all-features
  • rtk proxy env RUSTDOCFLAGS="-D warnings" cargo doc --workspace --all-features --no-deps
  • rtk proxy cargo nextest run -p oneiron --features sync --profile full

Copilot AI review requested due to automatic review settings July 4, 2026 03:27
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 52 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7682d2f8-e000-4def-97be-7e14a17e2197

📥 Commits

Reviewing files that changed from the base of the PR and between 57080c1 and 741df8d.

📒 Files selected for processing (3)
  • crates/oneiron/src/batch.rs
  • crates/oneiron/src/dreamer_runner.rs
  • crates/oneiron/src/vault.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/one-1361-index-dreamer-milestone-fallback

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Index durable Dreamer milestone fallback candidates in vault_meta (ONE-1361)

✨ Enhancement 🐞 Bug fix 🧪 Tests 🕐 40+ Minutes

Grey Divider

AI Description

• Add a vault_meta-backed index for active/approved Dreamer milestone fallback claims per job.
• Maintain forward claim-id rows so overwrites, lifecycle changes, and deletes invalidate stale
 candidates.
• One-time backfill legacy vault rows, then use per-job indexed lookup to avoid full scans.
Diagram

graph TD
  A["Claim write path"] --> B["index_milestone_for_put()"] --> E[("vault_meta")]
  C["Claim delete/erase"] --> D["deindex_milestone_claim()"] --> E
  F["latest_durable_milestone()"] --> G["prefix scan (job_id)"] --> E --> H["validate against entity"] --> I[("entities")]
  subgraph Legend
    direction LR
    _flow["Function/Flow"] ~~~ _db[("DB (heed)")]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Store a single 'latest milestone per job' pointer row
  • ➕ O(1) lookup instead of iterating candidates
  • ➕ Less repeated claim decoding during lookup
  • ➖ More complex correctness: must handle out-of-order writes and tie-breaking on (at, learned_at, claim_id)
  • ➖ Invalidation on lifecycle changes may require re-scan or secondary structure anyway
2. Use a dedicated heed DB/table for the milestone index
  • ➕ Clear separation of concerns and potential for typed key/value codecs
  • ➕ Easier future evolution (versioning, compaction, stats)
  • ➖ More plumbing/migration surface than leveraging existing vault_meta
  • ➖ Still requires reverse mapping for invalidation
3. Persist decoded milestone fields alongside the index row (avoid re-decoding claim bodies)
  • ➕ Lookup could avoid reading entities/decode for most candidates
  • ➕ Lower CPU for frequent fallback reads
  • ➖ Risk of divergence from canonical claim record unless carefully invalidated
  • ➖ Needs additional compatibility/versioning handling for stored fields

Recommendation: The PR’s approach (vault_meta prefix-key index + forward claim-id mapping + canonical validation) is a good balance of performance and correctness: it narrows scans to a single job while still re-checking the source-of-truth claim record to safely skip stale/malformed entries. Consider the single-pointer optimization later only if prefix iteration becomes a hotspot, as it materially increases invalidation complexity.

Files changed (3) +362 / -50

Enhancement (2) +361 / -50
batch.rsIndex/deindex Dreamer milestone claims on claim writes and deletes +8/-0

Index/deindex Dreamer milestone claims on claim writes and deletes

• Captures decoded claim bodies during apply_put and indexes eligible Dreamer milestone claims into vault_meta after writing the entity. Also deindexes milestone claims during entity deindex/delete to prevent stale fallback candidates.

crates/oneiron/src/batch.rs

dreamer_runner.rsAdd vault_meta milestone index with one-time backfill and per-job lookup +353/-50

Add vault_meta milestone index with one-time backfill and per-job lookup

• Replaces full entity scans for durable milestone fallback with a per-job vault_meta prefix index keyed by job_id + milestone time + learned time + claim id, plus a forward claim-id mapping for invalidation. Adds one-time backfill guarded by a vault_meta marker and introduces tests covering indexed lookup behavior and invalidation on lifecycle changes and soft deletes.

crates/oneiron/src/dreamer_runner.rs

Bug fix (1) +1 / -0
vault.rsDeindex Dreamer milestone candidates when soft-erasing entities +1/-0

Deindex Dreamer milestone candidates when soft-erasing entities

• Ensures the Dreamer milestone fallback index is invalidated when a claim is soft-erased (payload truncated) so fallback lookups cannot return erased claims.

crates/oneiron/src/vault.rs

@qodo-code-review

qodo-code-review Bot commented Jul 4, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 📜 Skill insights (0)

Context used

Grey Divider


Remediation recommended

1. Bad index row breaks lookup ✓ Resolved 🐞 Bug ☼ Reliability
Description
latest_indexed_dreamer_milestone propagates decode_dreamer_milestone_candidate_key errors, so a
single malformed/corrupted candidate key aborts durable milestone lookup instead of being skipped.
This can surface as failures in progress_snapshot, which calls latest_durable_milestone when the
live ephemeral row is missing.
Code

crates/oneiron/src/dreamer_runner.rs[2143]

+    let (job_id, at, learned_at, claim_id) = decode_dreamer_milestone_candidate_key(key)?;
Evidence
The indexed lookup loop calls indexed_dreamer_milestone_if_current(...)?, which calls
decode_dreamer_milestone_candidate_key(key)?; any malformed key therefore returns an error from
the whole lookup. The decoder explicitly returns Error::CorruptedIndex when the key does not match
the pinned length/prefix, and progress_snapshot depends on this lookup when no ephemeral progress
row is usable.

crates/oneiron/src/dreamer_runner.rs[2119-2172]
crates/oneiron/src/dreamer_runner.rs[2276-2281]
crates/oneiron/src/dreamer_runner.rs[1593-1609]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The per-job indexed lookup fails hard if any candidate key in `vault_meta` is malformed (wrong length/prefix), because decode errors are propagated via `?`. This makes the fallback path fragile to on-disk corruption and turns what should be best-effort selection into a hard error.

## Issue Context
Candidate keys are iterated via `vault_meta.prefix_iter`, and each key is decoded by `decode_dreamer_milestone_candidate_key`, which returns `Error::CorruptedIndex` on unexpected shapes.

## Fix Focus Areas
- crates/oneiron/src/dreamer_runner.rs[2119-2172]
- crates/oneiron/src/dreamer_runner.rs[2276-2303]

### Suggested change
- In `indexed_dreamer_milestone_if_current`, convert decode failures into `Ok(None)` (optionally emitting a `tracing::warn!`/`debug!`), e.g. `let Ok((job_id, at, learned_at, claim_id)) = decode_... else { return Ok(None) };`.
- (Optional) Consider opportunistic cleanup: on decode failure, delete the malformed key in a write txn (if you can safely route that through an async/maintenance path).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

2. Backfill rescans on contention ✓ Resolved 🐞 Bug ➹ Performance
Description
latest_durable_milestone checks the backfill marker in a read txn, then unconditionally runs a
full backfill scan in a write txn if absent, without re-checking after acquiring the writer lock.
Concurrent first-lookups can therefore perform redundant O(N) rescans and rewrites even after
another caller has already backfilled.
Code

crates/oneiron/src/dreamer_runner.rs[R1575-1589]

        let rtxn = self.vault.store.env.read_txn()?;
-        let mut latest: Option<DreamerDurableMilestone> = None;
-
-        for row in self.vault.store.entities.iter(&rtxn)? {
-            let (key, raw) = row?;
-            let header =
-                EntityMetadataHeader::parse(raw).ok_or(Error::CorruptedIndex("entity header"))?;
-            if header.entity_type != ENTITY_TYPE_CLAIM || raw.len() == ENTITY_METADATA_HEADER_LEN {
-                continue;
-            }
-            let body = crate::claim::decode_claim_body(&raw[ENTITY_METADATA_HEADER_LEN..], true)?;
-            if body.predicate != DREAMER_MILESTONE_PREDICATE
-                || body.approval != ClaimApprovalStatus::Approved
-                || body.lifecycle != ClaimLifecycleStatus::Active
-                || body.stale
-            {
-                continue;
-            }
-            let (kind, at) = match decode_milestone_value_for_job(&body.value, job_id) {
-                Ok(Some(decoded)) => decoded,
-                Ok(None) | Err(_) => continue,
-            };
-
-            let key_bytes: [u8; 16] = key.try_into().map_err(|_| Error::InvalidKey)?;
-            let milestone = DreamerDurableMilestone {
-                claim_id: EntityId::from_bytes(key_bytes)?,
-                job_id,
-                kind,
-                at,
-                learned_at: header.learned_at,
-            };
-
-            if latest.as_ref().is_none_or(|current| {
-                (milestone.at, milestone.learned_at, milestone.claim_id)
-                    > (current.at, current.learned_at, current.claim_id)
-            }) {
-                latest = Some(milestone);
-            }
+        if self
+            .vault
+            .store
+            .vault_meta
+            .get(&rtxn, DREAMER_MILESTONE_INDEX_BACKFILLED_KEY)?
+            .is_some()
+        {
+            return latest_indexed_dreamer_milestone(&self.vault.store, &rtxn, job_id);
        }
+        drop(rtxn);

+        let mut wtxn = self.vault.store.env.write_txn()?;
+        let latest = backfill_dreamer_milestone_index(&self.vault.store, &mut wtxn, job_id)?;
+        wtxn.commit()?;
Evidence
The code only checks DREAMER_MILESTONE_INDEX_BACKFILLED_KEY before dropping the read txn; if it
was absent at that time it always proceeds to backfill in a write txn. There is no second check
inside the write txn to short-circuit if another thread backfilled in the meantime.

crates/oneiron/src/dreamer_runner.rs[1571-1590]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The one-time backfill path can do redundant full rescans under concurrent callers: a second thread can observe "not backfilled" in a read txn, block on the LMDB writer lock until the first backfill commits, then still rescan because there is no marker re-check inside the write txn.

## Issue Context
LMDB allows only one writer at a time, so contention is likely serialized; however, after the first writer commits, later callers can avoid repeating the expensive scan by re-checking the marker before starting backfill.

## Fix Focus Areas
- crates/oneiron/src/dreamer_runner.rs[1571-1590]

### Suggested change
- After acquiring `wtxn`, re-check `vault_meta.get(&*wtxn, DREAMER_MILESTONE_INDEX_BACKFILLED_KEY)`.
 - If present, avoid calling `backfill_dreamer_milestone_index` and return the indexed result instead.
 - Implementation options:
   - Change `latest_indexed_dreamer_milestone` to accept a generic read txn (or a `&heed::RwTxn` for reads) so you can reuse `wtxn` for the read.
   - Or drop `wtxn` and open a fresh read txn to call `latest_indexed_dreamer_milestone`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread crates/oneiron/src/dreamer_runner.rs Outdated
Comment thread crates/oneiron/src/dreamer_runner.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f08b1cfd4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/oneiron/src/dreamer_runner.rs Outdated
Copilot AI review requested due to automatic review settings July 4, 2026 03:47

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@olety olety merged commit 06fcf3d into main Jul 4, 2026
8 of 9 checks passed
@olety olety deleted the codex/one-1361-index-dreamer-milestone-fallback branch July 4, 2026 03:52

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 741df8d34e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2169 to +2170
let Ok(body) = crate::claim::decode_claim_body(&raw[ENTITY_METADATA_HEADER_LEN..], true) else {
return Ok(None);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail closed on corrupt indexed milestone claims

When the global backfill marker is present, latest_durable_milestone reaches this indexed path instead of the old full scan. If an indexed milestone claim row has a malformed body, this Ok(None) silently drops it, so callers such as progress_snapshot can report an older or empty durable milestone instead of surfacing the corruption as the previous decode_claim_body(...)? scan did. This only needs to ignore stale/malformed index keys; decode/header errors for the referenced claim row should remain fail-closed.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants