-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Goal
Build an open, reproducible benchmark suite that measures memory retrieval quality for the Basic Memory plugin against OpenClaw's builtin memory search and QMD backend.
This is not about marketing claims. This is about using real evals to systematically improve Basic Memory's retrieval quality over time. Every PR can run benchmarks and show whether changes improve or regress recall accuracy.
Why
- No existing memory benchmark uses realistic agent memory workloads
- Supermemory and Mem0 publish self-serving benchmarks with no reproducible methodology
- We want to build in the open — publish methodology, corpus, and results
- Evals are the feedback loop: benchmark → identify weakness → fix → benchmark again
What we're measuring
Retrieval Quality (primary)
- Recall@K: Does the correct memory appear in top K results?
- Precision@K: Of top K results, how many are relevant?
- MRR (Mean Reciprocal Rank): Where does the first correct answer appear?
Context Efficiency (our differentiator)
- Signal-to-noise ratio: Of tokens returned, what % is useful for answering?
- BM returns structured observations/relations; builtin returns raw text chunks
- Same token budget should yield more signal with BM
Query Categories
| Category | Example | What it tests |
|---|---|---|
| Exact fact | "What is the beta pricing?" | Keyword precision |
| Semantic | "How do we compare to competitors?" | Vector similarity |
| Temporal | "What happened on Feb 14?" | Date-aware retrieval |
| Relational | "What's connected to the plugin?" | Graph traversal (BM advantage) |
| Cross-note | "Summarize marketing decisions" | Multi-doc recall |
| Needle-in-haystack | "What's the project ID?" | Exact token retrieval |
| Task recall | "What are our active tasks?" | Composited search |
Providers to compare
- Basic Memory (this plugin) —
bm searchvia CLI - OpenClaw builtin (
memory-core) — SQLite + vector hybrid search - QMD (experimental) — BM25 + vectors + reranking sidecar
Implementation
benchmark/directory in this repobenchmark/corpus/— realistic anonymized agent memory filesbenchmark/queries.json— questions with ground truth annotationsbenchmark/run.ts— eval harnessbenchmark/results/— output comparisonsjust benchmarkto run everything- Real providers, no mocks
Success criteria
- Reproducible:
just benchmarkruns end-to-end - Category breakdown (not one aggregate number)
- Shows failures honestly
- Can run in CI on every PR
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request