All AtomicMemory quality scores use the BEAM (Benchmark for Episodic AI Memory) protocol on the LoCoMo-10 dataset (100K token context window slice). Dataset licensed CC BY-NC 4.0 from snap-research/LoCoMo.
Six abilities scored: temporal reasoning (TR), event ordering (EO), contradiction resolution (CR), knowledge update (KU), preference following (PF), multi-session reasoning (MS). Composite = unweighted mean.
These are our primary published numbers. All systems use the same backbone LLM, same judge, same dataset slice, and same scoring protocol — this is an apples-to-apples comparison.
| System | Composite | Configuration |
|---|---|---|
| AtomicMemory (research config) | 0.572 | rerank + TBC + contradiction-safe versioning |
| Mem0 (algorithm, local) | 0.550 | mem0ai open-source, local Haiku 4.5 backend |
| Hindsight (HTTP local) | 0.541 | self-hosted, Haiku 4.5 backend |
| Truncation baseline | 0.350 | no memory system, context window only |
Backbone: Claude Haiku 4.5 · Judge: Claude Haiku 4.5 · Dataset: LoCoMo-10 BEAM-100K
Sampling: n=180 questions drawn from LoCoMo-10 with fixed seed BEAM-S2-042 across a stratified 4-conversation slice. All systems evaluated on the same slice. Full-dataset (400q) numbers are in the evaluation run history below.
Note on shipped vs. research configurations: The 0.572 number above comes from an AtomicMemory research evaluation configuration. The shipped AtomicMemory v1.0 stack in this repo measures 0.411 under a stricter judge. The gap reflects two compounding factors: judge strictness, and configuration changes between the research run (which used "contradiction-safe versioning") and the shipped v1.0 stack (which adds "timeline + observed_at" for harder contradiction handling). The individual contributions have not been isolated. See "Why Scores Differ" below for the judge-variance data.
Judge choice accounts for ≥0.10 composite variance. The same AtomicMemory stack scores differently depending on which model acts as judge:
| Judge | AM Composite | Notes |
|---|---|---|
| Haiku 4.5 | 0.411 | AtomicMemory v1.0 (strict judge, shipped) |
| Sonnet 4.6 | ~0.313 | Stricter rubric application |
| Gemini 2.5 Pro | TBD | planned (Gemini-judge baseline) |
Published competitor numbers use different judges: Hindsight (0.734) uses a Gemini judge; Mem0 (0.641) uses a GPT-5 judge. Direct cross-vendor comparisons require controlling for judge — our head-to-head (0.572) does this.
| Date | Configuration | Composite | Judge | Notes |
|---|---|---|---|---|
| 2026-05-07 | Research config — multirun (n=180) | 0.572 | Haiku 4.5 (loose) | Primary headline |
| 2026-05-09 | Research config — Opus engine | 0.650 | Haiku 4.5 (loose) | Best single-engine result (research only) |
| 2026-05-11 | AtomicMemory v1.0 (shipped) | 0.411 | Haiku 4.5 (strict) | Shipped stack: rerank+TBC+timeline+observed_at |
| 2026-05-11 | AtomicMemory v1.0 + candidate profile | 0.421 | Haiku 4.5 (strict) | Under validation |
Measured on BEAM-100K with Haiku 4.5 backbone:
- $0.005/query (ingest + retrieve combined)
- <1s p95 retrieve latency
- Compare: Hindsight ~$0.075/query (~15× more expensive)
The published 0.572 head-to-head was measured against a frozen evaluation config — judge model, embedding provider, conversation slice (4 of 10 from LoCoMo-10), and seed BEAM-S2-042. Reproducing the exact published scores requires both the frozen config artifact and a pinned commit of atomicmemory-core.
git clone https://github.com/atomicstrata/atomicmemory-benchmarks
cd atomicmemory-benchmarks
pnpm install
cp atomicbench/.env.example atomicbench/.env.local
# Fill in OPENAI_API_KEY and ATOMICMEMORY_API_URL=http://localhost:3050
pnpm --filter atomicbench smokeRequires a running AtomicMemory server (see atomicmemory-core).
The existing pnpm --filter atomicbench locomo10 target will not reproduce the published 0.572 number out of the box — any drift in defaults (judge, embedding model, slice IDs, seed) produces a different absolute score. The runner is intentionally gated: it requires a populated configs/canonical-v{N}.json committed at a canonical-config-v* git tag, which pins the exact evaluation configuration.
A dedicated repro target with the frozen config (slice IDs, seed BEAM-S2-042, judge model pin, embedding pin) is published alongside this release in atomicmemory-benchmarks. Until that target ships, treat absolute composite numbers from locomo10 as indicative-not-authoritative — use smoke only to verify local wiring.
- Gemini-judge baseline: re-score AtomicMemory under Gemini judge for direct Hindsight comparison
- BEAM-1M / BEAM-10M: longer-context evaluation planned
- Contradiction-resolution: CR (0.17) is our weakest ability — active area of improvement