feat: implement unified baseline adapters for VLM comparison#2
Closed
feat: implement unified baseline adapters for VLM comparison#2
Conversation
Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini models across multiple evaluation tracks: Provider Abstraction (models/providers/): - BaseAPIProvider ABC with common interface for all providers - AnthropicProvider: Base64 PNG encoding, Messages API - OpenAIProvider: Data URL format, Chat Completions API - GoogleProvider: Native PIL Image support, GenerateContent API - Factory functions: get_provider(), resolve_model_alias() - Error hierarchy: ProviderError, AuthenticationError, RateLimitError Baseline Module (baselines/): - TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM) - TrackConfig dataclass with factory methods for each track - BaselineConfig with model alias resolution and registry - PromptBuilder for track-specific system prompts and user content - UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats - ElementRegistry for element_id to coordinate conversion Benchmark Integration: - UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks - Converts BenchmarkObservation -> adapter format -> BenchmarkAction - Support for all three tracks via --track flag CLI Commands (baselines/cli.py): - run: Single model prediction with track selection - compare: Multi-model comparison on same task - list-models: Show available models and providers All 92 tests pass. Ready for model comparison experiments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add ecosystem package table showing all OpenAdapt packages - Add architecture diagram showing how openadapt-ml fits in - Add key innovation section on demo-conditioned prompting - Add meta-package installation instructions - Add section on relationship to openadapt-evals - Add related projects table - Link to ecosystem roadmap Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Member
Author
|
This PR has merge conflicts that need to be resolved before it can be merged.\n\nThe unified baseline adapters implementation looks complete with:\n- 3,071 additions across provider abstractions\n- All 92 tests passing (as of initial commit)\n- CLI commands for model comparison\n\nNext steps:\n1. Resolve merge conflicts with main branch\n2. Verify all tests still pass\n3. Consider whether this should be merged before or after PR #7, as they may have overlapping changes\n\nNote: This PR appears to be a subset of the work in PR #7. Consider whether they should be consolidated or merged in sequence. |
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
openadapt_ml/models/providers/with Anthropic, OpenAI, and Google implementationsKey Features
claude-opus-4.5,gpt-5.2,gemini-3-pro)Test plan
uv run pytest tests/test_providers.py tests/test_baselines.py -v)Generated with Claude Code