Benchmark framework: evaluate memory retrieval quality

## Goal

Build an open, reproducible benchmark suite that measures memory retrieval quality for the Basic Memory plugin against OpenClaw's builtin memory search and QMD backend.

**This is not about marketing claims.** This is about using real evals to systematically improve Basic Memory's retrieval quality over time. Every PR can run benchmarks and show whether changes improve or regress recall accuracy.

## Why

- No existing memory benchmark uses realistic agent memory workloads
- Supermemory and Mem0 publish self-serving benchmarks with no reproducible methodology
- We want to build in the open — publish methodology, corpus, and results
- Evals are the feedback loop: benchmark → identify weakness → fix → benchmark again

## What we're measuring

### Retrieval Quality (primary)
- **Recall@K**: Does the correct memory appear in top K results?
- **Precision@K**: Of top K results, how many are relevant?
- **MRR (Mean Reciprocal Rank)**: Where does the first correct answer appear?

### Context Efficiency (our differentiator)
- **Signal-to-noise ratio**: Of tokens returned, what % is useful for answering?
- BM returns structured observations/relations; builtin returns raw text chunks
- Same token budget should yield more signal with BM

### Query Categories
| Category | Example | What it tests |
|----------|---------|---------------|
| Exact fact | "What is the beta pricing?" | Keyword precision |
| Semantic | "How do we compare to competitors?" | Vector similarity |
| Temporal | "What happened on Feb 14?" | Date-aware retrieval |
| Relational | "What's connected to the plugin?" | Graph traversal (BM advantage) |
| Cross-note | "Summarize marketing decisions" | Multi-doc recall |
| Needle-in-haystack | "What's the project ID?" | Exact token retrieval |
| Task recall | "What are our active tasks?" | Composited search |

## Providers to compare

1. **Basic Memory** (this plugin) — `bm search` via CLI
2. **OpenClaw builtin** (`memory-core`) — SQLite + vector hybrid search
3. **QMD** (experimental) — BM25 + vectors + reranking sidecar

## Implementation

- `benchmark/` directory in this repo
- `benchmark/corpus/` — realistic anonymized agent memory files
- `benchmark/queries.json` — questions with ground truth annotations
- `benchmark/run.ts` — eval harness
- `benchmark/results/` — output comparisons
- `just benchmark` to run everything
- Real providers, no mocks

## Success criteria

- Reproducible: `just benchmark` runs end-to-end
- Category breakdown (not one aggregate number)
- Shows failures honestly
- Can run in CI on every PR


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark framework: evaluate memory retrieval quality #10

Goal

Why

What we're measuring

Retrieval Quality (primary)

Context Efficiency (our differentiator)

Query Categories

Providers to compare

Implementation

Success criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Category	Example	What it tests
Exact fact	"What is the beta pricing?"	Keyword precision
Semantic	"How do we compare to competitors?"	Vector similarity
Temporal	"What happened on Feb 14?"	Date-aware retrieval
Relational	"What's connected to the plugin?"	Graph traversal (BM advantage)
Cross-note	"Summarize marketing decisions"	Multi-doc recall
Needle-in-haystack	"What's the project ID?"	Exact token retrieval
Task recall	"What are our active tasks?"	Composited search

Benchmark framework: evaluate memory retrieval quality #10

Description

Goal

Why

What we're measuring

Retrieval Quality (primary)

Context Efficiency (our differentiator)

Query Categories

Providers to compare

Implementation

Success criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions