[eval][cli][doc] feat: add resumable eval cache with cache management commands by JoyboyBrian · Pull Request #78 · Osmosis-AI/osmosis-sdk-python

JoyboyBrian · 2026-02-23T19:55:37Z

What

Implements the feature scope proposed in #77.
Add a file-based eval cache system for osmosis eval, including deterministic task IDs/config hashes, dataset/module/eval-function fingerprinting, atomic JSON writes, lock-based concurrency protection, and corrupt-cache backup handling.
Introduce EvalOrchestrator to separate orchestration from CLI/runner concerns:
- lock lifecycle management
- resume from partial progress
- --retry-failed behavior (re-queue failed runs only)
- graceful interruption handling (SIGINT/SIGTERM)
- periodic dataset integrity checks and periodic cache flush
- optional sample message logging (.jsonl)
Expand osmosis eval CLI:
- new flags: --fresh, --retry-failed, --log-samples, --output-path
- new cache subcommands: osmosis eval cache dir|ls|rm with filtering and confirmation controls
- structured output writing while preserving legacy --output compatibility
- resume and dataset-fingerprint warning messages for better UX
Update eval execution plumbing:
- add messages and row_index to EvalRunResult
- add EvalRunner.run_batch() for orchestrator-driven batch execution
- expose ExternalLLMClient.api_key / api_base via properties (remove private attribute access)
Migrate rubric cache path from ~/.cache/osmosis/eval_result to ~/.cache/osmosis/rubric with one-time safe migration logic.
Add required dependencies for caching (filelock, xxhash) and update docs (cli, eval-mode, configuration, troubleshooting) plus comprehensive unit/integration tests.

Why

osmosis eval workloads are often long-running and expensive to restart from scratch after interruptions.
This PR implements resumable eval execution and cache lifecycle management proposed in issue #77, including resume, fresh rerun, failed-run retry, cache inspection/cleanup, and safer file handling.
Closes #77.

Long-running eval jobs need durable progress persistence and reliable resume after interruption.
Cache safety requires deterministic invalidation when code/data/config changes, plus lock protection against concurrent runs with the same config.
Users need first-class cache operations (inspect/remove/retry-failed/fresh restart) and better debugging artifacts (--log-samples, structured output).
Separating rubric cache from eval cache clarifies ownership and avoids path confusion from legacy cache naming.

How to Test

Manual smoke test:
- run osmosis eval ..., interrupt, then rerun same command to verify resume
- run with --fresh to verify backup + restart behavior
- run with --retry-failed to verify failed-only re-execution
- verify osmosis eval cache ls/rm filters and deletion flow

Checklist

PR title follows [module] type: description format
Appropriate labels added (e.g. enhancement, bug, breaking)
ruff check . and ruff format --check . pass
pyright osmosis_ai/ passes
pytest passes (new tests added if applicable)
Public API changes are documented
No secrets or credentials included

Summary by cubic

Adds a durable, resumable cache for osmosis eval with CLI tools to inspect/remove cached runs, plus improved logging/error handling and a small fix to reduce local auth server port conflicts. Improves reliability for long evaluations and makes resuming, retrying, and debugging easier.

New Features
- File-based eval cache with deterministic task IDs and fingerprints across code/data/config; dataset fingerprint checks with warnings on change.
- Atomic JSON writes with file locks, auto-resume, corrupt-cache backup, and clearer logging/warnings for read/write/timeout issues.
- EvalOrchestrator: lock lifecycle, resume, retry failed-only (keeps successful), SIGINT/SIGTERM handling, periodic integrity checks/flush, progress includes prior completed runs, optional sample logs.
- CLI: --fresh, --retry-failed, --log-samples, --output-path; cache commands osmosis eval cache dir|ls|rm with model/dataset/status filters; structured output remains --output compatible.
- Runner: EvalRunResult adds messages and row_index; new run_batch for batched execution; LLM client exposes api_key and api_base via properties.
- Supports MCP module paths in fingerprinting; docs and troubleshooting updated; added filelock and xxhash; extensive unit/integration tests.
- Local auth server: enable address reuse (SO_REUSEADDR) to reduce port conflicts.
Migration
- Rubric cache moved from ~/.cache/osmosis/eval_result to ~/.cache/osmosis/rubric with one-time safe migration.
- New env vars: OSMOSIS_CACHE_DIR to override cache root; OSMOSIS_EVAL_LOCK_TIMEOUT to control lock wait time.

^{Written for commit af05cf5. Summary will update on new commits.}

…roject.toml and uv.lock This update introduces 'filelock' and 'xxhash' as dependencies in both the pyproject.toml and uv.lock files, ensuring compatibility with the latest features. Additionally, the version of 'virtualenv' has been updated to 20.38.0.

…pabilities and additional result attributes. Introduced 'messages' and 'row_index' to EvalRunResult, and implemented 'run_batch' method for concurrent execution of work items, improving efficiency in evaluation runs.

…nt execution and added 'messages' and 'row_index' attributes to EvalRunResult, improving evaluation efficiency.

…e cache management commands in CLI This update introduces a migration function to transition the rubric cache from the old path to a new one, ensuring backward compatibility. Additionally, new subcommands for cache management have been added to the CLI, allowing users to print the cache root directory and manage cached results more effectively. The evaluation command has been updated to integrate these changes, improving overall cache handling and user experience.

…management commands and optimize cache migration process. This update refines the migration function for the rubric cache, enhancing reliability and user feedback during cache operations.

…ns in EvalOrchestrator, improve cache management in CLI, and refine error handling in cache operations. This update also optimizes the handling of cache data and improves the overall robustness of the evaluation process.

… prevent cache usage with modified datasets, improve cache data handling in EvalCommand, and refine orchestration results to include warnings for dataset changes. This update strengthens the integrity of evaluation results and enhances user feedback during the evaluation process.

…gging samples. This update improves user guidance by suggesting the use of the --fresh flag for fresh evaluations and the --log-samples flag for saving full conversation logs, enhancing the overall user experience during evaluations.

…t for MCP module paths in compute_module_fingerprint, update EvalCommand to handle MCP specifications, and ensure prior completed runs are included in progress callbacks within EvalOrchestrator. This improves the accuracy of module identification and user feedback during evaluations.

…ed runs to retain only successful ones when retrying failed evaluations. Enhance cache management by ensuring only successful runs are written back, improving the robustness of the evaluation process. Add tests to validate the retry behavior and ensure proper re-queuing of failed runs.

…ey and base URL in ExternalLLMClient for better encapsulation. Update JSON writing to ensure proper character encoding. Include dataset fingerprint warning in OrchestratorResult for improved user feedback on dataset changes. Refactor EvalRunner to utilize new properties for LLM client configuration, enhancing code clarity and maintainability.

… with logger warnings for cache write failures, invalid timeout values, and backup issues. This enhances logging consistency and improves user feedback during evaluation processes. Additionally, update the summary calculation to accurately reflect the number of runs.

… internal functions to public, update error handling to catch broader exceptions, and enhance logging for cache write failures. This change improves code readability and maintains functionality across the evaluation framework.

…ments with logger warnings for better error handling in rubric cache migration and directory checks. Update exception handling in cache evaluation to catch broader OSError, enhancing robustness and consistency in logging across the evaluation framework.

…de new caching features for `osmosis eval`, such as automatic result caching, resuming interrupted evaluations, and additional command options. Introduce environment variables for cache settings and improve error handling for cache-related issues. This update strengthens user guidance and clarifies the evaluation process.

…removing cached evaluations in `osmosis eval`. Update documentation to reflect these changes, including filtering options for model, dataset, and status. Improve error handling in cache operations and add tests for new CLI functionalities, ensuring robust cache management and user guidance.

tests/unit/rollout/eval/evaluation/test_orchestrator.py

osmosis_ai/rollout/eval/evaluation/orchestrator.py

osmosis_ai/cli_services/session.py

osmosis_ai/rollout/eval/evaluation/cli.py

osmosis_ai/rollout/eval/evaluation/runner.py

osmosis_ai/rollout/eval/evaluation/cache.py

cubic-dev-ai

1 issue found across 17 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="osmosis_ai/rollout/eval/evaluation/cache.py">

<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:520">
P2: Silent data loss: prior session runs are discarded without logging when disk read fails during flush. If the cache file becomes temporarily unreadable (transient I/O error, corruption), `old_runs` silently falls back to `[]` and the subsequent write permanently drops all previously persisted runs. At minimum, log a warning so users know prior progress was lost.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

osmosis_ai/rollout/eval/evaluation/cache.py

osmosis_ai/rollout/eval/evaluation/orchestrator.py

osmosis_ai/rollout/eval/evaluation/cache.py

…cache flushing by adding detailed warnings for cache read failures. Update CLI and runner error handling to include comments for clarity on best-effort cleanup. Remove unused logging imports in orchestrator for cleaner code. These changes aim to strengthen robustness and maintainability across the evaluation framework.

osmosis_ai/rollout/eval/evaluation/cache.py

…low address reuse in `find_available_port` function, improving port availability handling.

JoyboyBrian · 2026-02-23T21:36:58Z

@cubic-dev-ai

cubic-dev-ai · 2026-02-23T21:37:19Z

@cubic-dev-ai

@JoyboyBrian I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

2 issues found across 18 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="osmosis_ai/rollout/eval/evaluation/cache.py">

<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:233">
P2: Inconsistent exception handling: `compute_eval_fns_fingerprint` catches only `ImportError` while the analogous `compute_module_fingerprint` catches `Exception`. If an eval function module has a syntax error or raises a different exception during import, this will propagate as an unhandled crash instead of gracefully returning `None`. Broaden the catch to `Exception` for consistency.</violation>
</file>

<file name="osmosis_ai/rollout/eval/evaluation/cli.py">

<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cli.py:487">
P2: `--all` flag doesn't override filters in `rm` subcommand: `osmosis eval cache rm --all --model foo` deletes only entries matching "foo" rather than all cached evaluations, contradicting the `--all` help text ("Delete all cached evaluations."). Either make `--all` mutually exclusive with filter flags, or skip filtering when `--all` is set.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

osmosis_ai/rollout/eval/evaluation/cache.py

osmosis_ai/rollout/eval/evaluation/cli.py

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="osmosis_ai/rollout/eval/evaluation/cache.py">

<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:233">
P2: Catching all Exception here masks non-import failures (e.g., syntax errors or side-effect crashes) and makes them look like “module not found,” which contradicts the function contract and hides real bugs. Restrict this to ImportError/ModuleNotFoundError so genuine import-time errors surface.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

osmosis_ai/rollout/eval/evaluation/cache.py

codecov · 2026-02-23T22:55:15Z

Codecov Report

❌ Patch coverage is 69.58042% with 348 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
osmosis_ai/rollout/eval/evaluation/cli.py	39.87%	179 Missing and 17 partials ⚠️
osmosis_ai/rollout/eval/evaluation/cache.py	82.26%	56 Missing and 27 partials ⚠️
osmosis_ai/rollout/eval/evaluation/orchestrator.py	86.36%	25 Missing and 8 partials ⚠️
osmosis_ai/rollout/eval/evaluation/runner.py	68.75%	23 Missing and 2 partials ⚠️
osmosis_ai/cli_services/session.py	57.14%	8 Missing and 1 partial ⚠️
osmosis_ai/rollout/eval/common/llm_client.py	66.66%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

JoyboyBrian added 17 commits February 20, 2026 21:03

eval enhancement phase 3: Implemented 'run_batch' method for concurre…

533098d

…nt execution and added 'messages' and 'row_index' attributes to EvalRunResult, improving evaluation efficiency.

eval enhancement phase 5: Introduce improved error handling in cache …

d3b05f3

…management commands and optimize cache migration process. This update refines the migration function for the rubric cache, enhancing reliability and user feedback during cache operations.

format

6886c9a

JoyboyBrian requested a review from BaiqingL as a code owner February 23, 2026 19:55

JoyboyBrian removed the request for review from BaiqingL February 23, 2026 19:55

github-actions bot added cli CLI related documentation Improvements or additions to documentation enhancement New feature or request eval Eval/Rubric evaluation labels Feb 23, 2026

github-actions bot approved these changes Feb 23, 2026

View reviewed changes

github-code-quality bot found potential problems Feb 23, 2026

View reviewed changes

cubic-dev-ai bot reviewed Feb 23, 2026

View reviewed changes

osmosis_ai/rollout/eval/evaluation/cache.py Outdated Show resolved Hide resolved

fix pyright type checking issue

7ecc181

github-code-quality bot found potential problems Feb 23, 2026

View reviewed changes

address pyright type checking issues

cf3d3f1

github-code-quality bot found potential problems Feb 23, 2026

View reviewed changes

osmosis_ai/rollout/eval/evaluation/cache.py Fixed Show fixed Hide fixed

osmosis_ai/rollout/eval/evaluation/cache.py Fixed Show fixed Hide fixed

github-code-quality bot found potential problems Feb 23, 2026

View reviewed changes

Enable address reuse for local server socket: Add socket option to al…

b66af47

…low address reuse in `find_available_port` function, improving port availability handling.

cubic-dev-ai bot reviewed Feb 23, 2026

View reviewed changes

osmosis_ai/rollout/eval/evaluation/cache.py Show resolved Hide resolved

osmosis_ai/rollout/eval/evaluation/cli.py Show resolved Hide resolved

address cubic review

2a9b8e4

cubic-dev-ai bot reviewed Feb 23, 2026

View reviewed changes

osmosis_ai/rollout/eval/evaluation/cache.py Outdated Show resolved Hide resolved

fix

af05cf5

JoyboyBrian merged commit 48ab013 into main Feb 23, 2026
11 checks passed

Comments

Conversation

JoyboyBrian commented Feb 23, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How to Test

Checklist

Summary by cubic

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JoyboyBrian commented Feb 23, 2026

Uh oh!

cubic-dev-ai bot commented Feb 23, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 23, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JoyboyBrian commented Feb 23, 2026 •

edited by cubic-dev-ai bot

Loading