Add GitHub Actions CI workflow by abrichr · Pull Request #6 · OpenAdaptAI/openadapt-ml

abrichr · 2026-01-17T13:47:35Z

Summary

Adds GitHub Actions CI workflow for automated testing on pull requests and main branch pushes.

Changes

Created .github/workflows/test.yml with the following features:
- Tests on Python 3.10 and 3.11
- Runs on both Ubuntu and macOS
- Uses uv for dependency management (matching openadapt-viewer)
- Runs ruff linter and formatter checks
- Runs pytest suite

Pattern Consistency

This workflow follows the same pattern as:

openadapt-viewer/.github/workflows/test.yml
OpenAdapt ecosystem conventions

Test Plan

Push to this branch triggers workflow
Tests pass on both OS platforms
Tests pass on both Python versions
Ruff checks pass

Generated with Claude Code

Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini models across multiple evaluation tracks: Provider Abstraction (models/providers/): - BaseAPIProvider ABC with common interface for all providers - AnthropicProvider: Base64 PNG encoding, Messages API - OpenAIProvider: Data URL format, Chat Completions API - GoogleProvider: Native PIL Image support, GenerateContent API - Factory functions: get_provider(), resolve_model_alias() - Error hierarchy: ProviderError, AuthenticationError, RateLimitError Baseline Module (baselines/): - TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM) - TrackConfig dataclass with factory methods for each track - BaselineConfig with model alias resolution and registry - PromptBuilder for track-specific system prompts and user content - UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats - ElementRegistry for element_id to coordinate conversion Benchmark Integration: - UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks - Converts BenchmarkObservation -> adapter format -> BenchmarkAction - Support for all three tracks via --track flag CLI Commands (baselines/cli.py): - run: Single model prediction with track selection - compare: Multi-model comparison on same task - list-models: Show available models and providers All 92 tests pass. Ready for model comparison experiments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…atibility All dependencies (torch, transformers, pillow, peft, etc.) support Python 3.10+. The 3.12 requirement was unnecessarily restrictive and broke `pip install openadapt[all]` on Python 3.10 and 3.11. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add CI workflow that runs on pull requests and main branch pushes: - Tests on Python 3.10 and 3.11 - Runs on Ubuntu and macOS - Uses uv for dependency management - Runs ruff linter and formatter - Runs pytest suite Matches pattern used by openadapt-viewer and follows OpenAdapt ecosystem conventions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- cluster_id: default=0 - cluster_centroid_distance: default=0.0 - internal_similarity: default=1.0 Fixes 1/14 test failures in test_segmentation.py

- Fix unused imports in baselines, benchmarks, and ingest modules - Fix ambiguous variable names (renamed 'l' to 'loss'/'line') - Add missing time import in benchmarks/cli.py - Move warnings import to top of file in benchmarks/cli.py - Add noqa comments for intentional code patterns - Fix bare except clause in lambda_labs.py - Add Episode to TYPE_CHECKING imports in grounding.py - Rename conflicting local variable in config.py - Fix undefined _build_nav_links in viewer.py - Run ruff format to ensure consistent code style All ruff checks now pass successfully.

- Change 'goal' to 'instruction' in column assertions - Change 'image_path' to 'screenshot_path' to match schema

- Update badge URL to use filename-based path (from PR #3) - Add qualifiers to claims about accuracy and performance (from PR #4) - Clarify that results are from synthetic benchmarks, not production UIs - Add disclaimers about extrapolating synthetic results to real-world performance - Update section titles to indicate synthetic nature of benchmarks This consolidates the documentation improvements from PRs #3 and #4.

abrichr and others added 7 commits January 16, 2026 23:44

fix: add defaults to CanonicalEpisode fields for test compatibility

385d4ae

- cluster_id: default=0 - cluster_centroid_distance: default=0.0 - internal_similarity: default=1.0 Fixes 1/14 test failures in test_segmentation.py

fix: update parquet export tests to match new schema

c899229

- Change 'goal' to 'instruction' in column assertions - Change 'image_path' to 'screenshot_path' to match schema

abrichr merged commit aeed4bf into main Jan 17, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GitHub Actions CI workflow#6

Add GitHub Actions CI workflow#6
abrichr merged 7 commits intomainfrom
feature/add-github-actions-ci

abrichr commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Jan 17, 2026

Summary

Changes

Pattern Consistency

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant