This guide walks through your first Memory Layer evaluation. It is written for someone who wants practical proof that memory is helping an agent, without starting from the full command reference.
An evaluation answers a simple question: does the same task work better with Memory enabled than without it? The useful evidence comes from paired runs. Run the same suite under two conditions, compare item by item, then inspect the quality, cost, and latency differences.
For the full command reference, see memory eval.
You need:
- a running Memory Layer service for non-dry-run evaluations
- project memory to search against
- an LLM provider configured when using
--profile llm - PostgreSQL with
pgvectorwhen testing semantic retrieval
If you are running from an installed package, use commands like this:
memory eval doctor --suite evals/examples/memory-smoke --textIf you are developing inside this repository, use the dev binary instead:
cargo run --bin memory -- eval doctor --suite evals/examples/memory-smoke --textAn eval suite is a directory with two files:
suite.toml: suite name, project, profile defaults, repeat count, and label statusitems.jsonl: one evaluation item per line
The main item types are:
retrieval_qa: checks whether the right memories are returnedgrounded_answer: checks whether an answer includes required facts and avoids forbidden claimsresume_quality: checks whether a get-up-to-speed briefing covers the right topicscommand_task: checks whether a command succeedsagent_build_task: copies a fixture app or website, lets an agent modify it, then scores the finished workspaceagent_build_sequence: runs many ordered agent-building steps against the same copied workspace and scores the accumulated app
Start with the checked-in smoke suite:
evals/examples/memory-smoke/
It is intentionally tiny. Use it to learn the workflow, not to make quality claims.
Run doctor before expensive runs:
memory eval doctor --suite evals/examples/memory-smoke --textdoctor checks that the suite can be parsed and that the environment is ready
for the selected work. Fix these problems before running an LLM-backed
evaluation:
- the service is unreachable
- the suite has invalid JSONL rows
- the suite is smaller than its configured
min_items - required provider or retrieval configuration is missing
Use an offline dry run first:
memory eval run \
--suite evals/examples/memory-smoke \
--condition full-memory \
--profile offline \
--dry-run \
--textThis validates the suite and scoring path without spending provider tokens or executing shell tasks. A passing dry run means the harness shape is valid. It does not prove that Memory improves model behavior.
You can also try the build-simulation smoke suite. It uses a fake deterministic agent, so it proves the fixture-copying, agent-command, and scoring path without model cost:
memory eval run \
--suite evals/examples/app-build-smoke \
--condition no-memory \
--condition full-memory \
--profile offline \
--allow-shell \
--textWhen you want the full token-spending version, run the Codex-backed suite:
MEMORY_EVAL_CODEX_MODEL=gpt-5.4-mini \
memory eval run \
--suite evals/suites/app-build-codex-v1 \
--condition no-memory \
--condition full-memory \
--profile llm \
--repeat 1 \
--allow-shell \
--textThat suite runs real codex exec agents against static app fixtures. The
full-memory condition receives required Memory questions and must run the
generated ./.memory-eval/query-memory helper for each question. The harness
verifies the helper's raw query JSON before accepting the item. The no-memory
condition is forbidden from using Memory and fails if Memory evidence artifacts
appear.
For a longer software-building test, use the Dockerized sequence suite:
docker compose -f evals/docker/app-build-sequence/compose.yml run --rm evalThis starts PostgreSQL with pgvector, starts the Memory service, seeds
deterministic project memories, and runs the 20-step Codex app-build sequence
under no-memory and full-memory. Each step keeps the previous workspace
state, so the run tests continuity across a realistic product build rather than
isolated prompt answers. Use docker compose -f evals/docker/app-build-sequence/compose.yml down -v
before a clean rerun if you want to reset the database volume.
When you want the strongest checked-in benchmark for whether Memory improves agent behavior, use the Memory improvement suite:
MEMORY_EVAL_REPEAT=5 \
docker compose -f evals/docker/memory-improvement/compose.yml run --rm evalIt combines retrieval, grounded answers, get-up-to-speed briefings, and a 20-step coding-continuity task. Each item is tagged as deductive, inductive, or abductive, so you can see what kind of reasoning Memory helped. The suite also uses hidden seeded memories, which means the no-memory condition cannot pass by reading the fixture files.
For useful evidence, compare a baseline against a Memory-backed condition:
memory eval run \
--suite evals/suites/research-v1 \
--condition no-memory \
--condition full-memory \
--profile llm \
--repeat 5 \
--allow-shell \
--textThe important conditions are:
no-memory: no retrieval channel; answer and resume items use the configured LLM directlylexical: only lexical retrievalsemantic: only semantic retrievalgraph: only graph retrievalfull-memory: lexical, semantic, graph, and relation boosts together
Use --repeat for provider-backed runs. Repeats make flaky LLM behavior visible
instead of hiding it behind one lucky or unlucky result.
For software-building proof, use agent_build_task. It gives both conditions
the same starter project, same prompt, same model, same timeout, and same
deterministic checker. The no-memory run is told not to use Memory and has
common Memory environment variables cleared; the full-memory run is told to use
Memory where useful. This makes the result easier to explain than a pure Q&A
test: Memory is valuable if the agent ships more of the requested app, passes
more checks, or needs fewer interventions under the same budget.
Use agent_build_sequence when the claim is about long-running development.
The sequence runner preserves one workspace across ordered steps, verifies
Memory helper calls step by step, and aggregates Codex token usage from
codex-events.jsonl. That lets you inspect whether Memory changed quality,
continuity, latency, and token cost across the whole build.
Run artifacts are written under target/memory-evals/. Keep the generated JSON
files for release notes, research notes, or regression tracking.
After the paired run, compare the baseline artifact with the candidate artifact:
memory eval compare \
--baseline target/memory-evals/no-memory.json \
--candidate target/memory-evals/full-memory.json \
--out target/memory-evals/comparison.json \
--textUse the actual artifact paths printed by your run if they differ from the example above.
The comparison is paired by item id. That means each item is compared against itself under both conditions, which is much stronger than comparing unrelated aggregate scores.
For repeated runs, compare globs instead of one file at a time:
memory eval compare \
--baseline 'target/memory-evals/*no-memory*.json' \
--candidate 'target/memory-evals/*full-memory*.json' \
--out target/memory-evals/comparison.json \
--textCreate a readable report:
memory eval report \
--comparison target/memory-evals/comparison.json \
--markdown \
--out target/memory-evals/report.mdBegin with these fields:
- success-rate delta: whether the candidate condition passed more items
- McNemar p-value: whether pass/fail changes look meaningful for paired items
- confidence interval: uncertainty around numeric metric deltas
- recall metrics: whether expected memories were retrieved
- tag/file recall: whether retrieval found the intended tags and source files
- forbidden hits: whether answers included claims they should avoid
- token delta: extra or saved provider tokens
- latency delta: extra or saved time
- grouped deltas: whether Memory helped retrieval, resume, coding continuity, and each reasoning mode separately
A good result is not just "full-memory won once". Prefer a result where the candidate improves quality, the confidence interval is not obviously weak, and the token/latency cost is acceptable for the use case.
Use a gate policy when an evaluation becomes part of release discipline:
memory eval gate \
--comparison target/memory-evals/comparison.json \
--policy evals/gates/research-v1.toml \
--textThe gate encodes the minimum acceptable evidence for a release or experiment. If it fails, inspect the comparison instead of weakening the gate immediately.
Scaffold a starter suite from recent project memories:
memory eval scaffold --project memory --out evals/suites/my-first-suite --textReview every generated item before trusting it:
- expected memory ids should point to memories that really answer the question
- required assertions should be specific and observable
- forbidden assertions should catch plausible wrong answers
- command tasks should be deterministic and safe
Keep label_status = "draft" while tuning labels. Only mark a suite reviewed
when the labels have been manually checked and the suite is large enough for the
claim you want to make.
- Treating the smoke suite as proof. It is only a workflow example.
- Running only
full-memory. You need a baseline such asno-memoryto measure improvement. - Using
--profile offlineas evidence. Offline mode is for CI-safe validation, not provider-backed behavior. - Ignoring token and latency deltas. Better answers may still be too expensive for a workflow.
- Marking labels reviewed too early. Bad labels produce bad evidence.
- Changing retrieval or prompts while also changing the suite. Keep experiments controlled so the result explains one change at a time.
For internal development, a small paired suite can catch regressions and show
direction. For external claims, use a held-out reviewed suite with enough items
to support statistics. The checked-in research-v1 suite is a reviewed-seed
location for repeatable work, but it still needs to grow before it can support
publication-grade claims.