[eval][cli][doc] feat: add resumable eval cache with cache management commands#78
Merged
JoyboyBrian merged 23 commits intomainfrom Feb 23, 2026
Merged
[eval][cli][doc] feat: add resumable eval cache with cache management commands#78JoyboyBrian merged 23 commits intomainfrom
JoyboyBrian merged 23 commits intomainfrom
Conversation
…roject.toml and uv.lock This update introduces 'filelock' and 'xxhash' as dependencies in both the pyproject.toml and uv.lock files, ensuring compatibility with the latest features. Additionally, the version of 'virtualenv' has been updated to 20.38.0.
…pabilities and additional result attributes. Introduced 'messages' and 'row_index' to EvalRunResult, and implemented 'run_batch' method for concurrent execution of work items, improving efficiency in evaluation runs.
…nt execution and added 'messages' and 'row_index' attributes to EvalRunResult, improving evaluation efficiency.
…e cache management commands in CLI This update introduces a migration function to transition the rubric cache from the old path to a new one, ensuring backward compatibility. Additionally, new subcommands for cache management have been added to the CLI, allowing users to print the cache root directory and manage cached results more effectively. The evaluation command has been updated to integrate these changes, improving overall cache handling and user experience.
…management commands and optimize cache migration process. This update refines the migration function for the rubric cache, enhancing reliability and user feedback during cache operations.
…ns in EvalOrchestrator, improve cache management in CLI, and refine error handling in cache operations. This update also optimizes the handling of cache data and improves the overall robustness of the evaluation process.
… prevent cache usage with modified datasets, improve cache data handling in EvalCommand, and refine orchestration results to include warnings for dataset changes. This update strengthens the integrity of evaluation results and enhances user feedback during the evaluation process.
…gging samples. This update improves user guidance by suggesting the use of the --fresh flag for fresh evaluations and the --log-samples flag for saving full conversation logs, enhancing the overall user experience during evaluations.
…t for MCP module paths in compute_module_fingerprint, update EvalCommand to handle MCP specifications, and ensure prior completed runs are included in progress callbacks within EvalOrchestrator. This improves the accuracy of module identification and user feedback during evaluations.
…ed runs to retain only successful ones when retrying failed evaluations. Enhance cache management by ensuring only successful runs are written back, improving the robustness of the evaluation process. Add tests to validate the retry behavior and ensure proper re-queuing of failed runs.
…ey and base URL in ExternalLLMClient for better encapsulation. Update JSON writing to ensure proper character encoding. Include dataset fingerprint warning in OrchestratorResult for improved user feedback on dataset changes. Refactor EvalRunner to utilize new properties for LLM client configuration, enhancing code clarity and maintainability.
… with logger warnings for cache write failures, invalid timeout values, and backup issues. This enhances logging consistency and improves user feedback during evaluation processes. Additionally, update the summary calculation to accurately reflect the number of runs.
… internal functions to public, update error handling to catch broader exceptions, and enhance logging for cache write failures. This change improves code readability and maintains functionality across the evaluation framework.
…ments with logger warnings for better error handling in rubric cache migration and directory checks. Update exception handling in cache evaluation to catch broader OSError, enhancing robustness and consistency in logging across the evaluation framework.
…de new caching features for `osmosis eval`, such as automatic result caching, resuming interrupted evaluations, and additional command options. Introduce environment variables for cache settings and improve error handling for cache-related issues. This update strengthens user guidance and clarifies the evaluation process.
…removing cached evaluations in `osmosis eval`. Update documentation to reflect these changes, including filtering options for model, dataset, and status. Improve error handling in cache operations and add tests for new CLI functionalities, ensuring robust cache management and user guidance.
There was a problem hiding this comment.
1 issue found across 17 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="osmosis_ai/rollout/eval/evaluation/cache.py">
<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:520">
P2: Silent data loss: prior session runs are discarded without logging when disk read fails during flush. If the cache file becomes temporarily unreadable (transient I/O error, corruption), `old_runs` silently falls back to `[]` and the subsequent write permanently drops all previously persisted runs. At minimum, log a warning so users know prior progress was lost.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…cache flushing by adding detailed warnings for cache read failures. Update CLI and runner error handling to include comments for clarity on best-effort cleanup. Remove unused logging imports in orchestrator for cleaner code. These changes aim to strengthen robustness and maintainability across the evaluation framework.
…low address reuse in `find_available_port` function, improving port availability handling.
Contributor
Author
|
@JoyboyBrian I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
2 issues found across 18 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="osmosis_ai/rollout/eval/evaluation/cache.py">
<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:233">
P2: Inconsistent exception handling: `compute_eval_fns_fingerprint` catches only `ImportError` while the analogous `compute_module_fingerprint` catches `Exception`. If an eval function module has a syntax error or raises a different exception during import, this will propagate as an unhandled crash instead of gracefully returning `None`. Broaden the catch to `Exception` for consistency.</violation>
</file>
<file name="osmosis_ai/rollout/eval/evaluation/cli.py">
<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cli.py:487">
P2: `--all` flag doesn't override filters in `rm` subcommand: `osmosis eval cache rm --all --model foo` deletes only entries matching "foo" rather than all cached evaluations, contradicting the `--all` help text ("Delete all cached evaluations."). Either make `--all` mutually exclusive with filter flags, or skip filtering when `--all` is set.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="osmosis_ai/rollout/eval/evaluation/cache.py">
<violation number="1" location="osmosis_ai/rollout/eval/evaluation/cache.py:233">
P2: Catching all Exception here masks non-import failures (e.g., syntax errors or side-effect crashes) and makes them look like “module not found,” which contradicts the function contract and hides real bugs. Restrict this to ImportError/ModuleNotFoundError so genuine import-time errors surface.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
osmosis eval, including deterministic task IDs/config hashes, dataset/module/eval-function fingerprinting, atomic JSON writes, lock-based concurrency protection, and corrupt-cache backup handling.EvalOrchestratorto separate orchestration from CLI/runner concerns:--retry-failedbehavior (re-queue failed runs only).jsonl)osmosis evalCLI:--fresh,--retry-failed,--log-samples,--output-pathosmosis eval cache dir|ls|rmwith filtering and confirmation controls--outputcompatibilitymessagesandrow_indextoEvalRunResultEvalRunner.run_batch()for orchestrator-driven batch executionExternalLLMClient.api_key/api_basevia properties (remove private attribute access)~/.cache/osmosis/eval_resultto~/.cache/osmosis/rubricwith one-time safe migration logic.filelock,xxhash) and update docs (cli,eval-mode,configuration,troubleshooting) plus comprehensive unit/integration tests.Why
osmosis evalworkloads are often long-running and expensive to restart from scratch after interruptions.This PR implements resumable eval execution and cache lifecycle management proposed in issue #77, including resume, fresh rerun, failed-run retry, cache inspection/cleanup, and safer file handling.
Closes #77.
--log-samples, structured output).How to Test
osmosis eval ..., interrupt, then rerun same command to verify resume--freshto verify backup + restart behavior--retry-failedto verify failed-only re-executionosmosis eval cache ls/rmfilters and deletion flowChecklist
[module] type: descriptionformatenhancement,bug,breaking)ruff check .andruff format --check .passpyright osmosis_ai/passespytestpasses (new tests added if applicable)Summary by cubic
Adds a durable, resumable cache for osmosis eval with CLI tools to inspect/remove cached runs, plus improved logging/error handling and a small fix to reduce local auth server port conflicts. Improves reliability for long evaluations and makes resuming, retrying, and debugging easier.
New Features
Migration
Written for commit af05cf5. Summary will update on new commits.