Skip to content

Benchmark: eval harness for Basic Memory search #12

@bm-clawd

Description

@bm-clawd

Part of #10

Build the eval harness that runs queries against real Basic Memory and scores results.

Requirements

  • TypeScript, runs via bun
  • Uses real bm search CLI (not mocked)
  • Indexes the test corpus into a BM project
  • Runs all queries from queries.json
  • Scores: Recall@5, Recall@10, MRR, Precision@5
  • Groups results by query category
  • Outputs results as JSON + human-readable table
  • just benchmark in the justfile

Implementation notes

  • Set up a temporary BM project for the corpus
  • Run bm search via CLI subprocess for each query
  • Parse JSON output, compare against ground truth
  • Also test bm context for relational queries (unique to BM)
  • Measure latency per query (secondary metric)
  • Clean up temp project after run

Stretch

  • Compare composited memory_search (MEMORY.md grep + BM search + task scan) vs BM search alone

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions