Skip to content

feat: implement unified baseline adapters for VLM comparison#2

Closed
abrichr wants to merge 2 commits intomainfrom
feature/unified-baseline-adapters
Closed

feat: implement unified baseline adapters for VLM comparison#2
abrichr wants to merge 2 commits intomainfrom
feature/unified-baseline-adapters

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Jan 17, 2026

Summary

  • Implement unified baseline adapters for comparing VLM providers (Claude, GPT, Gemini) across multiple evaluation tracks
  • Add provider abstraction layer in openadapt_ml/models/providers/ with Anthropic, OpenAI, and Google implementations
  • Create comprehensive response parsing system supporting JSON, PyAutoGUI, and function-call formats
  • Include CLI commands for model listing, single predictions, and multi-model comparison

Key Features

  • Track A: Direct coordinate prediction (CLICK(x, y))
  • Track B: ReAct-style reasoning with coordinates
  • Track C: Set-of-Mark element selection (CLICK([id]))
  • Model Registry: 9 models across 3 providers with aliases (e.g., claude-opus-4.5, gpt-5.2, gemini-3-pro)
  • Robust Parser: Handles multiple response formats with fallback strategies

Test plan

  • All 92 tests pass (uv run pytest tests/test_providers.py tests/test_baselines.py -v)
  • CLI list-models command works
  • Provider implementations handle image encoding correctly
  • Manual verification with real API calls (requires API keys)

Generated with Claude Code

abrichr and others added 2 commits January 16, 2026 23:44
Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini
models across multiple evaluation tracks:

Provider Abstraction (models/providers/):
- BaseAPIProvider ABC with common interface for all providers
- AnthropicProvider: Base64 PNG encoding, Messages API
- OpenAIProvider: Data URL format, Chat Completions API
- GoogleProvider: Native PIL Image support, GenerateContent API
- Factory functions: get_provider(), resolve_model_alias()
- Error hierarchy: ProviderError, AuthenticationError, RateLimitError

Baseline Module (baselines/):
- TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM)
- TrackConfig dataclass with factory methods for each track
- BaselineConfig with model alias resolution and registry
- PromptBuilder for track-specific system prompts and user content
- UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats
- ElementRegistry for element_id to coordinate conversion

Benchmark Integration:
- UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks
- Converts BenchmarkObservation -> adapter format -> BenchmarkAction
- Support for all three tracks via --track flag

CLI Commands (baselines/cli.py):
- run: Single model prediction with track selection
- compare: Multi-model comparison on same task
- list-models: Show available models and providers

All 92 tests pass. Ready for model comparison experiments.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add ecosystem package table showing all OpenAdapt packages
- Add architecture diagram showing how openadapt-ml fits in
- Add key innovation section on demo-conditioned prompting
- Add meta-package installation instructions
- Add section on relationship to openadapt-evals
- Add related projects table
- Link to ecosystem roadmap

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@abrichr
Copy link
Member Author

abrichr commented Jan 18, 2026

This PR has merge conflicts that need to be resolved before it can be merged.\n\nThe unified baseline adapters implementation looks complete with:\n- 3,071 additions across provider abstractions\n- All 92 tests passing (as of initial commit)\n- CLI commands for model comparison\n\nNext steps:\n1. Resolve merge conflicts with main branch\n2. Verify all tests still pass\n3. Consider whether this should be merged before or after PR #7, as they may have overlapping changes\n\nNote: This PR appears to be a subset of the work in PR #7. Consider whether they should be consolidated or merged in sequence.

@abrichr
Copy link
Member Author

abrichr commented Jan 19, 2026

Closing as redundant. This work was already merged to main via PR #6 (commit aeed4bf) on January 17, 2026. The unified baseline adapters, provider abstraction, and related changes are now in the main branch.

@abrichr abrichr closed this Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant