Skip to content

Add GitHub Actions CI workflow#6

Merged
abrichr merged 7 commits intomainfrom
feature/add-github-actions-ci
Jan 17, 2026
Merged

Add GitHub Actions CI workflow#6
abrichr merged 7 commits intomainfrom
feature/add-github-actions-ci

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Jan 17, 2026

Summary

Adds GitHub Actions CI workflow for automated testing on pull requests and main branch pushes.

Changes

  • Created .github/workflows/test.yml with the following features:
    • Tests on Python 3.10 and 3.11
    • Runs on both Ubuntu and macOS
    • Uses uv for dependency management (matching openadapt-viewer)
    • Runs ruff linter and formatter checks
    • Runs pytest suite

Pattern Consistency

This workflow follows the same pattern as:

  • openadapt-viewer/.github/workflows/test.yml
  • OpenAdapt ecosystem conventions

Test Plan

  • Push to this branch triggers workflow
  • Tests pass on both OS platforms
  • Tests pass on both Python versions
  • Ruff checks pass

Generated with Claude Code

abrichr and others added 7 commits January 16, 2026 23:44
Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini
models across multiple evaluation tracks:

Provider Abstraction (models/providers/):
- BaseAPIProvider ABC with common interface for all providers
- AnthropicProvider: Base64 PNG encoding, Messages API
- OpenAIProvider: Data URL format, Chat Completions API
- GoogleProvider: Native PIL Image support, GenerateContent API
- Factory functions: get_provider(), resolve_model_alias()
- Error hierarchy: ProviderError, AuthenticationError, RateLimitError

Baseline Module (baselines/):
- TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM)
- TrackConfig dataclass with factory methods for each track
- BaselineConfig with model alias resolution and registry
- PromptBuilder for track-specific system prompts and user content
- UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats
- ElementRegistry for element_id to coordinate conversion

Benchmark Integration:
- UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks
- Converts BenchmarkObservation -> adapter format -> BenchmarkAction
- Support for all three tracks via --track flag

CLI Commands (baselines/cli.py):
- run: Single model prediction with track selection
- compare: Multi-model comparison on same task
- list-models: Show available models and providers

All 92 tests pass. Ready for model comparison experiments.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…atibility

All dependencies (torch, transformers, pillow, peft, etc.) support Python 3.10+.
The 3.12 requirement was unnecessarily restrictive and broke `pip install openadapt[all]`
on Python 3.10 and 3.11.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add CI workflow that runs on pull requests and main branch pushes:
- Tests on Python 3.10 and 3.11
- Runs on Ubuntu and macOS
- Uses uv for dependency management
- Runs ruff linter and formatter
- Runs pytest suite

Matches pattern used by openadapt-viewer and follows OpenAdapt ecosystem conventions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- cluster_id: default=0
- cluster_centroid_distance: default=0.0
- internal_similarity: default=1.0

Fixes 1/14 test failures in test_segmentation.py
- Fix unused imports in baselines, benchmarks, and ingest modules
- Fix ambiguous variable names (renamed 'l' to 'loss'/'line')
- Add missing time import in benchmarks/cli.py
- Move warnings import to top of file in benchmarks/cli.py
- Add noqa comments for intentional code patterns
- Fix bare except clause in lambda_labs.py
- Add Episode to TYPE_CHECKING imports in grounding.py
- Rename conflicting local variable in config.py
- Fix undefined _build_nav_links in viewer.py
- Run ruff format to ensure consistent code style

All ruff checks now pass successfully.
- Change 'goal' to 'instruction' in column assertions
- Change 'image_path' to 'screenshot_path' to match schema
- Update badge URL to use filename-based path (from PR #3)
- Add qualifiers to claims about accuracy and performance (from PR #4)
- Clarify that results are from synthetic benchmarks, not production UIs
- Add disclaimers about extrapolating synthetic results to real-world performance
- Update section titles to indicate synthetic nature of benchmarks

This consolidates the documentation improvements from PRs #3 and #4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant