Skip to content

Comments

[eval][cli][doc] feat: add resumable eval cache with cache management commands#78

Merged
JoyboyBrian merged 23 commits intomainfrom
brian/eval
Feb 23, 2026
Merged

[eval][cli][doc] feat: add resumable eval cache with cache management commands#78
JoyboyBrian merged 23 commits intomainfrom
brian/eval

Conversation

@JoyboyBrian
Copy link
Contributor

@JoyboyBrian JoyboyBrian commented Feb 23, 2026

What

  • Implements the feature scope proposed in #77.
  • Add a file-based eval cache system for osmosis eval, including deterministic task IDs/config hashes, dataset/module/eval-function fingerprinting, atomic JSON writes, lock-based concurrency protection, and corrupt-cache backup handling.
  • Introduce EvalOrchestrator to separate orchestration from CLI/runner concerns:
    • lock lifecycle management
    • resume from partial progress
    • --retry-failed behavior (re-queue failed runs only)
    • graceful interruption handling (SIGINT/SIGTERM)
    • periodic dataset integrity checks and periodic cache flush
    • optional sample message logging (.jsonl)
  • Expand osmosis eval CLI:
    • new flags: --fresh, --retry-failed, --log-samples, --output-path
    • new cache subcommands: osmosis eval cache dir|ls|rm with filtering and confirmation controls
    • structured output writing while preserving legacy --output compatibility
    • resume and dataset-fingerprint warning messages for better UX
  • Update eval execution plumbing:
    • add messages and row_index to EvalRunResult
    • add EvalRunner.run_batch() for orchestrator-driven batch execution
    • expose ExternalLLMClient.api_key / api_base via properties (remove private attribute access)
  • Migrate rubric cache path from ~/.cache/osmosis/eval_result to ~/.cache/osmosis/rubric with one-time safe migration logic.
  • Add required dependencies for caching (filelock, xxhash) and update docs (cli, eval-mode, configuration, troubleshooting) plus comprehensive unit/integration tests.

Why

osmosis eval workloads are often long-running and expensive to restart from scratch after interruptions.
This PR implements resumable eval execution and cache lifecycle management proposed in issue #77, including resume, fresh rerun, failed-run retry, cache inspection/cleanup, and safer file handling.
Closes #77.

  • Long-running eval jobs need durable progress persistence and reliable resume after interruption.
  • Cache safety requires deterministic invalidation when code/data/config changes, plus lock protection against concurrent runs with the same config.
  • Users need first-class cache operations (inspect/remove/retry-failed/fresh restart) and better debugging artifacts (--log-samples, structured output).
  • Separating rubric cache from eval cache clarifies ownership and avoids path confusion from legacy cache naming.

How to Test

  • Manual smoke test:
    • run osmosis eval ..., interrupt, then rerun same command to verify resume
    • run with --fresh to verify backup + restart behavior
    • run with --retry-failed to verify failed-only re-execution
    • verify osmosis eval cache ls/rm filters and deletion flow

Checklist

  • PR title follows [module] type: description format
  • Appropriate labels added (e.g. enhancement, bug, breaking)
  • ruff check . and ruff format --check . pass
  • pyright osmosis_ai/ passes
  • pytest passes (new tests added if applicable)
  • Public API changes are documented
  • No secrets or credentials included

Summary by cubic

Adds a durable, resumable cache for osmosis eval with CLI tools to inspect/remove cached runs, plus improved logging/error handling and a small fix to reduce local auth server port conflicts. Improves reliability for long evaluations and makes resuming, retrying, and debugging easier.

  • New Features

    • File-based eval cache with deterministic task IDs and fingerprints across code/data/config; dataset fingerprint checks with warnings on change.
    • Atomic JSON writes with file locks, auto-resume, corrupt-cache backup, and clearer logging/warnings for read/write/timeout issues.
    • EvalOrchestrator: lock lifecycle, resume, retry failed-only (keeps successful), SIGINT/SIGTERM handling, periodic integrity checks/flush, progress includes prior completed runs, optional sample logs.
    • CLI: --fresh, --retry-failed, --log-samples, --output-path; cache commands osmosis eval cache dir|ls|rm with model/dataset/status filters; structured output remains --output compatible.
    • Runner: EvalRunResult adds messages and row_index; new run_batch for batched execution; LLM client exposes api_key and api_base via properties.
    • Supports MCP module paths in fingerprinting; docs and troubleshooting updated; added filelock and xxhash; extensive unit/integration tests.
    • Local auth server: enable address reuse (SO_REUSEADDR) to reduce port conflicts.
  • Migration

    • Rubric cache moved from ~/.cache/osmosis/eval_result to ~/.cache/osmosis/rubric with one-time safe migration.
    • New env vars: OSMOSIS_CACHE_DIR to override cache root; OSMOSIS_EVAL_LOCK_TIMEOUT to control lock wait time.

Written for commit af05cf5. Summary will update on new commits.

…roject.toml and uv.lock

This update introduces 'filelock' and 'xxhash' as dependencies in both the pyproject.toml and uv.lock files, ensuring compatibility with the latest features. Additionally, the version of 'virtualenv' has been updated to 20.38.0.
…pabilities and additional result attributes. Introduced 'messages' and 'row_index' to EvalRunResult, and implemented 'run_batch' method for concurrent execution of work items, improving efficiency in evaluation runs.
…nt execution and added 'messages' and 'row_index' attributes to EvalRunResult, improving evaluation efficiency.
…e cache management commands in CLI

This update introduces a migration function to transition the rubric cache from the old path to a new one, ensuring backward compatibility. Additionally, new subcommands for cache management have been added to the CLI, allowing users to print the cache root directory and manage cached results more effectively. The evaluation command has been updated to integrate these changes, improving overall cache handling and user experience.
…management commands and optimize cache migration process. This update refines the migration function for the rubric cache, enhancing reliability and user feedback during cache operations.
…ns in EvalOrchestrator, improve cache management in CLI, and refine error handling in cache operations. This update also optimizes the handling of cache data and improves the overall robustness of the evaluation process.
… prevent cache usage with modified datasets, improve cache data handling in EvalCommand, and refine orchestration results to include warnings for dataset changes. This update strengthens the integrity of evaluation results and enhances user feedback during the evaluation process.
…gging samples. This update improves user guidance by suggesting the use of the --fresh flag for fresh evaluations and the --log-samples flag for saving full conversation logs, enhancing the overall user experience during evaluations.
…t for MCP module paths in compute_module_fingerprint, update EvalCommand to handle MCP specifications, and ensure prior completed runs are included in progress callbacks within EvalOrchestrator. This improves the accuracy of module identification and user feedback during evaluations.
…ed runs to retain only successful ones when retrying failed evaluations. Enhance cache management by ensuring only successful runs are written back, improving the robustness of the evaluation process. Add tests to validate the retry behavior and ensure proper re-queuing of failed runs.
…ey and base URL in ExternalLLMClient for better encapsulation. Update JSON writing to ensure proper character encoding. Include dataset fingerprint warning in OrchestratorResult for improved user feedback on dataset changes. Refactor EvalRunner to utilize new properties for LLM client configuration, enhancing code clarity and maintainability.
… with logger warnings for cache write failures, invalid timeout values, and backup issues. This enhances logging consistency and improves user feedback during evaluation processes. Additionally, update the summary calculation to accurately reflect the number of runs.
… internal functions to public, update error handling to catch broader exceptions, and enhance logging for cache write failures. This change improves code readability and maintains functionality across the evaluation framework.
…ments with logger warnings for better error handling in rubric cache migration and directory checks. Update exception handling in cache evaluation to catch broader OSError, enhancing robustness and consistency in logging across the evaluation framework.
…de new caching features for `osmosis eval`, such as automatic result caching, resuming interrupted evaluations, and additional command options. Introduce environment variables for cache settings and improve error handling for cache-related issues. This update strengthens user guidance and clarifies the evaluation process.
…removing cached evaluations in `osmosis eval`. Update documentation to reflect these changes, including filtering options for model, dataset, and status. Improve error handling in cache operations and add tests for new CLI functionalities, ensuring robust cache management and user guidance.
@JoyboyBrian JoyboyBrian removed the request for review from BaiqingL February 23, 2026 19:55
@github-actions github-actions bot added cli CLI related documentation Improvements or additions to documentation enhancement New feature or request eval Eval/Rubric evaluation labels Feb 23, 2026
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 17 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="osmosis_ai/rollout/eval/evaluation/cache.py">

<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:520">
P2: Silent data loss: prior session runs are discarded without logging when disk read fails during flush. If the cache file becomes temporarily unreadable (transient I/O error, corruption), `old_runs` silently falls back to `[]` and the subsequent write permanently drops all previously persisted runs. At minimum, log a warning so users know prior progress was lost.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

…cache flushing by adding detailed warnings for cache read failures. Update CLI and runner error handling to include comments for clarity on best-effort cleanup. Remove unused logging imports in orchestrator for cleaner code. These changes aim to strengthen robustness and maintainability across the evaluation framework.
…low address reuse in `find_available_port` function, improving port availability handling.
@JoyboyBrian
Copy link
Contributor Author

@cubic-dev-ai

@cubic-dev-ai
Copy link

cubic-dev-ai bot commented Feb 23, 2026

@cubic-dev-ai

@JoyboyBrian I have started the AI code review. It will take a few minutes to complete.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 18 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="osmosis_ai/rollout/eval/evaluation/cache.py">

<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:233">
P2: Inconsistent exception handling: `compute_eval_fns_fingerprint` catches only `ImportError` while the analogous `compute_module_fingerprint` catches `Exception`. If an eval function module has a syntax error or raises a different exception during import, this will propagate as an unhandled crash instead of gracefully returning `None`. Broaden the catch to `Exception` for consistency.</violation>
</file>

<file name="osmosis_ai/rollout/eval/evaluation/cli.py">

<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cli.py:487">
P2: `--all` flag doesn't override filters in `rm` subcommand: `osmosis eval cache rm --all --model foo` deletes only entries matching "foo" rather than all cached evaluations, contradicting the `--all` help text ("Delete all cached evaluations."). Either make `--all` mutually exclusive with filter flags, or skip filtering when `--all` is set.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="osmosis_ai/rollout/eval/evaluation/cache.py">

<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:233">
P2: Catching all Exception here masks non-import failures (e.g., syntax errors or side-effect crashes) and makes them look like “module not found,” which contradicts the function contract and hides real bugs. Restrict this to ImportError/ModuleNotFoundError so genuine import-time errors surface.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@JoyboyBrian JoyboyBrian merged commit 48ab013 into main Feb 23, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI related documentation Improvements or additions to documentation enhancement New feature or request eval Eval/Rubric evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resumable osmosis eval with Cache Management

1 participant