Skip to content

Benchmark framework: evaluate memory retrieval quality #10

@bm-clawd

Description

@bm-clawd

Goal

Build an open, reproducible benchmark suite that measures memory retrieval quality for the Basic Memory plugin against OpenClaw's builtin memory search and QMD backend.

This is not about marketing claims. This is about using real evals to systematically improve Basic Memory's retrieval quality over time. Every PR can run benchmarks and show whether changes improve or regress recall accuracy.

Why

  • No existing memory benchmark uses realistic agent memory workloads
  • Supermemory and Mem0 publish self-serving benchmarks with no reproducible methodology
  • We want to build in the open — publish methodology, corpus, and results
  • Evals are the feedback loop: benchmark → identify weakness → fix → benchmark again

What we're measuring

Retrieval Quality (primary)

  • Recall@K: Does the correct memory appear in top K results?
  • Precision@K: Of top K results, how many are relevant?
  • MRR (Mean Reciprocal Rank): Where does the first correct answer appear?

Context Efficiency (our differentiator)

  • Signal-to-noise ratio: Of tokens returned, what % is useful for answering?
  • BM returns structured observations/relations; builtin returns raw text chunks
  • Same token budget should yield more signal with BM

Query Categories

Category Example What it tests
Exact fact "What is the beta pricing?" Keyword precision
Semantic "How do we compare to competitors?" Vector similarity
Temporal "What happened on Feb 14?" Date-aware retrieval
Relational "What's connected to the plugin?" Graph traversal (BM advantage)
Cross-note "Summarize marketing decisions" Multi-doc recall
Needle-in-haystack "What's the project ID?" Exact token retrieval
Task recall "What are our active tasks?" Composited search

Providers to compare

  1. Basic Memory (this plugin) — bm search via CLI
  2. OpenClaw builtin (memory-core) — SQLite + vector hybrid search
  3. QMD (experimental) — BM25 + vectors + reranking sidecar

Implementation

  • benchmark/ directory in this repo
  • benchmark/corpus/ — realistic anonymized agent memory files
  • benchmark/queries.json — questions with ground truth annotations
  • benchmark/run.ts — eval harness
  • benchmark/results/ — output comparisons
  • just benchmark to run everything
  • Real providers, no mocks

Success criteria

  • Reproducible: just benchmark runs end-to-end
  • Category breakdown (not one aggregate number)
  • Shows failures honestly
  • Can run in CI on every PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions