-
Notifications
You must be signed in to change notification settings - Fork 2
Labels
enhancementNew feature or requestNew feature or requestevalEval/Rubric evaluationEval/Rubric evaluation
Description
Problem or Use Case
osmosis eval runs are often long (large datasets, multiple runs per row, external model calls) and can be interrupted by network issues, process restarts, machine shutdowns, or quota/auth failures.
When that happens, users may need to restart from scratch, which causes:
- Repeated API/computation cost for already-completed runs
- Longer experiment iteration cycles
- Higher risk of conflicting state when duplicate evals run @concurrently
- No clear built-in workflow to inspect, filter, and clean old cache entries
Proposed Solution
Add first-class resumable execution and cache lifecycle management for osmosis eval:
- Persist eval progress/results to disk with a stable task ID derived from config + source/data fingerprints
- Auto-resume when re-running the same command after interruption
- Add
--freshto force a clean rerun and--retry-failedto rerun only failed runs - Add osmosis eval cache subcommands for cache inspection and cleanup (
dir, ls, rm) - Use file locking + atomic writes to ensure consistency and prevent concurrent corruption
- Detect dataset changes during/after runs and warn or fail with actionable guidance
- Support
--log-samplesand structured output directories for better debugging/auditing
Alternatives Considered
No response
SDK Component
None
Additional Context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestevalEval/Rubric evaluationEval/Rubric evaluation