Skip to content

On-disk index cache for the Grid benchmark harness#612

Open
tlwillke wants to merge 3 commits intomainfrom
on-disk-index-cache
Open

On-disk index cache for the Grid benchmark harness#612
tlwillke wants to merge 3 commits intomainfrom
on-disk-index-cache

Conversation

@tlwillke
Copy link
Collaborator

@tlwillke tlwillke commented Feb 5, 2026

This PR adds a deterministic on-disk index cache for the Grid benchmark harness and wires it in end-to-end so repeated runs can reuse previously-built graph indexes to save time.

Key changes

  1. Introduced OnDiskGraphIndexCache (flat directory cache, one file per index) keyed by a stable signature derived from:
  • dataset base name
  • feature set (per-index; not dependent on the list of feature sets)
  • build params (M, efConstruction, neighborOverflow, addHierarchy, refineFinalGraph)
  • build compressor identity
  1. Cache filenames are derived from the signature (sanitized for filesystem safety). This allows multiple cached indexes to coexist in a single flat cache directory without collisions.

  2. Updated Grid.runOneGraph to treat caching per-index (per feature set):

  • load cached indexes when present
  • build only the missing ones and merge results
  • keep non-cached builds writing into the temp work directory using the original graphN naming
  1. Refactored buildOnDisk so it can write either:
  • the original graph0..graphN temp files (cache disabled), or
  • signature-named files in the cache directory (cache enabled), while preserving existing build behavior and minimizing churn.
  1. Updated Bench, BenchYAML dataset files, and HelloVectorWorld
  • disabled the index cache by default
  • easy enabling of the cache if desired (useSavedIndexIfExists or enableIndexCache)
  1. Improved logging so it’s obvious when a cached index is used vs built from scratch, and added a cache-enabled startup message pointing to the cache directory and how to reclaim disk.

Behavior

  • Cache enabled: reuse cached indexes when signatures match; build only missing feature sets; artifacts persist in the cache directory.
  • Cache disabled: no cache reads/writes; build into temp work directory and clean up as before. NOTE: cached files indeed persist and must be explicitly deleted.

Notes

  • Signatures are per-index (per feature set) and include all build-defining params to prevent accidental reuse across incompatible builds.
  • Filenames are sanitized to avoid filesystem issues from compressor IDs or other signature components.
  • No impact on graph construction or other benchmarking performance
  • Cleaned up a few minor bugs in the existing code
  • Does not address disk use metrics (disk space or file count). They report 0 for cached indexes (a TODO)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@tlwillke tlwillke requested a review from ashkrisk February 5, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant