Conversation
Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini models across multiple evaluation tracks: Provider Abstraction (models/providers/): - BaseAPIProvider ABC with common interface for all providers - AnthropicProvider: Base64 PNG encoding, Messages API - OpenAIProvider: Data URL format, Chat Completions API - GoogleProvider: Native PIL Image support, GenerateContent API - Factory functions: get_provider(), resolve_model_alias() - Error hierarchy: ProviderError, AuthenticationError, RateLimitError Baseline Module (baselines/): - TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM) - TrackConfig dataclass with factory methods for each track - BaselineConfig with model alias resolution and registry - PromptBuilder for track-specific system prompts and user content - UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats - ElementRegistry for element_id to coordinate conversion Benchmark Integration: - UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks - Converts BenchmarkObservation -> adapter format -> BenchmarkAction - Support for all three tracks via --track flag CLI Commands (baselines/cli.py): - run: Single model prediction with track selection - compare: Multi-model comparison on same task - list-models: Show available models and providers All 92 tests pass. Ready for model comparison experiments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…atibility All dependencies (torch, transformers, pillow, peft, etc.) support Python 3.10+. The 3.12 requirement was unnecessarily restrictive and broke `pip install openadapt[all]` on Python 3.10 and 3.11. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add CI workflow that runs on pull requests and main branch pushes: - Tests on Python 3.10 and 3.11 - Runs on Ubuntu and macOS - Uses uv for dependency management - Runs ruff linter and formatter - Runs pytest suite Matches pattern used by openadapt-viewer and follows OpenAdapt ecosystem conventions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- cluster_id: default=0 - cluster_centroid_distance: default=0.0 - internal_similarity: default=1.0 Fixes 1/14 test failures in test_segmentation.py
- Fix unused imports in baselines, benchmarks, and ingest modules - Fix ambiguous variable names (renamed 'l' to 'loss'/'line') - Add missing time import in benchmarks/cli.py - Move warnings import to top of file in benchmarks/cli.py - Add noqa comments for intentional code patterns - Fix bare except clause in lambda_labs.py - Add Episode to TYPE_CHECKING imports in grounding.py - Rename conflicting local variable in config.py - Fix undefined _build_nav_links in viewer.py - Run ruff format to ensure consistent code style All ruff checks now pass successfully.
- Change 'goal' to 'instruction' in column assertions - Change 'image_path' to 'screenshot_path' to match schema
- Update badge URL to use filename-based path (from PR #3) - Add qualifiers to claims about accuracy and performance (from PR #4) - Clarify that results are from synthetic benchmarks, not production UIs - Add disclaimers about extrapolating synthetic results to real-world performance - Update section titles to indicate synthetic nature of benchmarks This consolidates the documentation improvements from PRs #3 and #4.
This was referenced Jan 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds GitHub Actions CI workflow for automated testing on pull requests and main branch pushes.
Changes
.github/workflows/test.ymlwith the following features:uvfor dependency management (matching openadapt-viewer)Pattern Consistency
This workflow follows the same pattern as:
openadapt-viewer/.github/workflows/test.ymlTest Plan
Generated with Claude Code