feat: implement unified baseline adapters for VLM comparison by abrichr · Pull Request #2 · OpenAdaptAI/openadapt-ml

abrichr · 2026-01-17T04:53:01Z

Summary

Implement unified baseline adapters for comparing VLM providers (Claude, GPT, Gemini) across multiple evaluation tracks
Add provider abstraction layer in openadapt_ml/models/providers/ with Anthropic, OpenAI, and Google implementations
Create comprehensive response parsing system supporting JSON, PyAutoGUI, and function-call formats
Include CLI commands for model listing, single predictions, and multi-model comparison

Key Features

Track A: Direct coordinate prediction (CLICK(x, y))
Track B: ReAct-style reasoning with coordinates
Track C: Set-of-Mark element selection (CLICK([id]))
Model Registry: 9 models across 3 providers with aliases (e.g., claude-opus-4.5, gpt-5.2, gemini-3-pro)
Robust Parser: Handles multiple response formats with fallback strategies

Test plan

All 92 tests pass (uv run pytest tests/test_providers.py tests/test_baselines.py -v)
CLI list-models command works
Provider implementations handle image encoding correctly
Manual verification with real API calls (requires API keys)

Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini models across multiple evaluation tracks: Provider Abstraction (models/providers/): - BaseAPIProvider ABC with common interface for all providers - AnthropicProvider: Base64 PNG encoding, Messages API - OpenAIProvider: Data URL format, Chat Completions API - GoogleProvider: Native PIL Image support, GenerateContent API - Factory functions: get_provider(), resolve_model_alias() - Error hierarchy: ProviderError, AuthenticationError, RateLimitError Baseline Module (baselines/): - TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM) - TrackConfig dataclass with factory methods for each track - BaselineConfig with model alias resolution and registry - PromptBuilder for track-specific system prompts and user content - UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats - ElementRegistry for element_id to coordinate conversion Benchmark Integration: - UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks - Converts BenchmarkObservation -> adapter format -> BenchmarkAction - Support for all three tracks via --track flag CLI Commands (baselines/cli.py): - run: Single model prediction with track selection - compare: Multi-model comparison on same task - list-models: Show available models and providers All 92 tests pass. Ready for model comparison experiments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add ecosystem package table showing all OpenAdapt packages - Add architecture diagram showing how openadapt-ml fits in - Add key innovation section on demo-conditioned prompting - Add meta-package installation instructions - Add section on relationship to openadapt-evals - Add related projects table - Link to ecosystem roadmap Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

abrichr · 2026-01-18T23:54:56Z

This PR has merge conflicts that need to be resolved before it can be merged.\n\nThe unified baseline adapters implementation looks complete with:\n- 3,071 additions across provider abstractions\n- All 92 tests passing (as of initial commit)\n- CLI commands for model comparison\n\nNext steps:\n1. Resolve merge conflicts with main branch\n2. Verify all tests still pass\n3. Consider whether this should be merged before or after PR #7, as they may have overlapping changes\n\nNote: This PR appears to be a subset of the work in PR #7. Consider whether they should be consolidated or merged in sequence.

abrichr · 2026-01-19T00:06:31Z

Closing as redundant. This work was already merged to main via PR #6 (commit aeed4bf) on January 17, 2026. The unified baseline adapters, provider abstraction, and related changes are now in the main branch.

abrichr and others added 2 commits January 16, 2026 23:44

abrichr closed this Jan 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement unified baseline adapters for VLM comparison#2

feat: implement unified baseline adapters for VLM comparison#2
abrichr wants to merge 2 commits intomainfrom
feature/unified-baseline-adapters

abrichr commented Jan 17, 2026

Uh oh!

abrichr commented Jan 18, 2026

Uh oh!

abrichr commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Jan 17, 2026

Summary

Key Features

Test plan

Uh oh!

abrichr commented Jan 18, 2026

Uh oh!

abrichr commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant