Literature eval enhancements #28

valerie-autumn-skye · 2025-09-04T18:05:30Z

Description of earlier commits for this branch are documented in PR #16.

Commit 22:

Fixes Support grouping for cases in metacoder config #27
Test cases updated to include new group key in config.
Eval results YAML will contain case_group.
The default group is Default.

Commit 23:

Updated test_runner.py to include Default case_group in EvalResults to address validation errors in test suite.

Commit 24:

Updated Anthropic fallback mode from claude-3-5-sonnet-20240620 to claude-sonnet-4-20250514.

Commit 25:

Fixed test cases broken by previous commit.
Test is now more generic.

Commit 26:

Removed unnecessary duplicate path element in work directory. Readability improvement to support fix for Issue runner.py fails to use temporary Goose config on Windows #29. Adding as individual commit in case it needs to be rolled back.

Commit 27:

Fixes goose keyring errors on MacOS and Linux #30. Disabled system keyring utilization in Goose.

Commit 28:

Partially addresses Issue runner.py fails to use temporary Goose config on Windows #29 Windows compatibility. Uses os.cwd() instead of unix-specific "." to specify current working directory.

Commit 29:

Uses safer XDG_CONFIG_HOME instead of changing HOME environment variable to avoid interfering with unix environment (shell history, etc.). Separate commit in case this needs to be rolled back.

Commit 30:

Changed informational log message to make it clear that a directory path is not being referenced, but rather a server combination.

Commit 31:

-The Goose executable is now detected in a cross-platform way.

The full path information is propagated into the logs for easier debugging of the environment.

Commit 32:

Moved hard-coded values into variables in preparation for cross-platform support.
Adjusted log level.
Cleaned up comment for readability.

Commit 33:

Fixes runner.py fails to use temporary Goose config on Windows #29
Added OS-specific Goose config folder structures.
Replaced hard-coded paths with function call to generate correct path.
Added OS-specific home directory environment variables.
Added logging of Goose config path for confirmation with base coder.

Commit 34:

Fixes path inconsistencies with MacOS.
Added unit tests.

…es by using consistent printing methods.

…I calls fail, and to DummyMetric if Claude fails.

…server combinations to support Issues #18, #19, and #20.

… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful.

… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful. Addressed ruff warnings.

…ric after the downgrade.

… verbosity temporarily to debug Claude judge unit test on build server. Adjusted logic to work when multiple coders are specified. Improved log messages.

…ic to DummyMetric.

… for the quota exhaustion fallback logic.

…ic downgrade to DummyMetric on quota check failure. Added notes on potential improvements to unit tests.

…to avoid default encoding errors.

…herwise, create an empty file as UTF-8. Partially addresses Issue #24.

…verwrite. Enforced UTF-8 encoding, switched to safe_dump and added document delimiter between records. Also simplified document generation. Fixes issue #24. Added second test case to literature_mcp_encoding_test.yaml for testing.

…r than overwrite. Enforced UTF-8 encoding, switched to safe_dump and added document delimiter between records. Also simplified document generation. Fixes issue #24. Added second test case to literature_mcp_encoding_test.yaml for testing." This reverts commit c436e7f.

… on Windows (where the default codepage is cp1252).

… config. Fixes Issue #27.

…o address validation errors in test suite.

…aude-sonnet-4-20250514.

cmungall

Looks good - let's separate into separate PRs in future!

…lity improvement to support fix for Issue #29. Adding as individual commit in case it needs to be rolled back.

…g the system keyring for secrets (GOOSE_DISABLE_KEYRING).

…stead of unix-specific "." to specify current working directory.

…ble to avoid interfering with unix environment (shell history, etc.). Separate commit in case this needs to be rolled back.

…ath is not being referenced, but rather a server combination.

… full path information is propagated into the logs for easier debugging of the environment.

…orm support. Adjusted log level and cleaned up comment for readability.

… paths with function call to generate correct path. Added OS-specific home directory environment variables. Added logging of Goose config path for confirmation with base coder.

…r configs consistently.

claude · 2025-09-18T21:11:11Z

Claude finished @ct-parker's task —— View job

PR Review: Literature eval enhancements ✅

I've completed a comprehensive review of this PR. Here's my detailed analysis:

Todo List:

Read and analyze key changed files
Review test case grouping implementation (addresses Support grouping for cases in metacoder config #27)
Evaluate cross-platform compatibility changes (addresses runner.py fails to use temporary Goose config on Windows #29)
Check evaluation framework fallback mechanisms
Assess code quality and best practices
Identify potential bugs and security issues
Review test coverage
Provide comprehensive feedback and recommendations

Overall Assessment ✅

This PR implements excellent enhancements to the evaluation framework with robust fallback mechanisms and comprehensive cross-platform support. The key improvements include:

Test case grouping (addresses Support grouping for cases in metacoder config #27) - Clean implementation with sensible defaults
Robust evaluation fallback - OpenAI → Claude → DummyMetric degradation chain
Cross-platform compatibility (addresses runner.py fails to use temporary Goose config on Windows #29) - Excellent Windows/Unix/macOS support for Goose
UTF-8 encoding fixes for better Windows compatibility
Comprehensive test coverage for new features

Code Quality & Best Practices ✅

Strengths:

Clean Architecture: ClaudeJudge properly implements DeepEvalBaseLLM interface (judges.py:13-94)
Type Safety: Excellent use of Pydantic models and type hints throughout
Error Handling: Robust fallback mechanisms with comprehensive logging
Configuration: Clean separation with new group field (eval_model.py:24-26)
Cross-platform Support: Excellent OS detection and path handling (goose.py:31-96)

Code Highlights:

1. Test Case Grouping (eval_model.py:24-26) ✅

group: Optional[str] = Field(
    default="Default", description="Test category for result grouping."
)

Sensible default value
Well-documented field
Properly integrated into results (runner.py:390)

2. Cross-Platform Config Detection (goose.py:31-96) ✅

Excellent OS-specific logic for Windows vs Unix-like systems
Proper XDG Base Directory spec compliance
Clean separation of concerns with dedicated helper functions

3. Robust Fallback Chain (runner.py:302-341) ✅

OpenAI quota check with fast timeout (@functools.lru_cache(maxsize=1))
Graceful degradation to Claude when OpenAI unavailable
Final fallback to DummyMetric if all APIs fail
Comprehensive logging at each step

Potential Issues & Recommendations ⚠️

Minor Issues:

1. Hard-coded Model Name (runner.py:312)

claude_model = "claude-sonnet-4-20250514"  # Consider making configurable

Recommendation: Extract to configuration constant

2. Error Message Parsing (judges.py:86-91)

if "credit balance is too low" in msg or "400" in msg:

Recommendation: Make error detection more robust to handle message variations

3. Metric Instance Modification (runner.py:302-341)
The fallback logic modifies metric instances in-place, which could cause issues if metrics are reused.
Recommendation: Create new metric instances for fallbacks

4. Test Code Cleanup (test_claude_judge.py:90-104)
Large commented blocks should be cleaned up or converted to proper test cases.

Edge Cases to Consider:

1. Timeout Configuration

8-second timeout in quota checks could be made configurable for different environments
Consider exponential backoff for high-volume scenarios

2. Path Handling

Excellent cross-platform support, but consider edge cases with very long paths on Windows

Security Assessment ✅

No security vulnerabilities identified. The code follows excellent security practices:

✅ Environment-based API key management
✅ Proper exception handling without credential exposure
✅ Appropriate timeout configurations
✅ Input validation through Pydantic models
✅ UTF-8 encoding specified explicitly (runner.py:529)

Performance Considerations ✅

Optimizations Present:

Smart Caching: @functools.lru_cache(maxsize=1) on quota checks prevents repeated API calls
Timeout Management: 8-second timeout prevents hanging during quota checks
Lazy Instantiation: Metrics created only when needed

Suggestions:

Quota checks could use HEAD requests if APIs support them for lower cost
Consider implementing exponential backoff for quota checks in high-volume scenarios

Test Coverage Assessment 🎯

Excellent Coverage:

test_claude_judge.py: Comprehensive fallback testing with proper environment manipulation
test_goose_paths.py: Thorough cross-platform path testing with parametrized test cases
Integration tests: Good coverage of the OpenAI → Claude → DummyMetric fallback chain

Test Quality:

✅ Good use of monkeypatch for environment variable manipulation
✅ Proper verification of logging output
✅ Parametrized tests for cross-platform scenarios
⚠️ Some tests acknowledge running full coder pipeline when they should only test eval step (test_claude_judge.py:12)

Cross-Platform Compatibility ✅

The Goose integration shows excellent cross-platform awareness:

Windows:

Uses %APPDATA%\Block\goose\config\ path structure
Proper environment variable handling

Unix-like (Linux/macOS):

XDG Base Directory spec compliance
Fallback to $HOME/.config/goose/ when XDG not set
Proper environment variable precedence

Path Resolution:

Clean helper functions with comprehensive test coverage
Proper handling of relative vs absolute paths

Specific Technical Issues

`runner.py:365` - Technical Debt

# TODO: find a more principled way to extract the metric result

This TODO indicates known technical debt in metric result extraction. The current approach tries multiple attribute access patterns, which works but is fragile.

`judges.py:19-25` - Documentation

The extensive comment about model listing could be condensed or moved to documentation for better code readability.

Summary & Recommendations

This is a well-engineered PR that demonstrates:

✅ Robust error handling and fallback mechanisms
✅ Excellent cross-platform compatibility
✅ Comprehensive test coverage
✅ Clean architectural patterns
✅ Good security practices

Recommendations for Future Improvements:

Configuration Enhancement: Extract hard-coded model names and timeouts to configuration
Error Handling: Make error message parsing more flexible and robust
Metric Architecture: Consider immutable metric instances to prevent side effects
Test Cleanup: Remove commented code blocks and improve test isolation
Documentation: Add detailed docstrings for quota checking and fallback logic

Final Assessment: ✅ APPROVE

Ready to merge after optional cleanup of commented test code. This PR significantly improves the evaluation framework's robustness and cross-platform compatibility while maintaining excellent code quality standards.

The fallback mechanisms are particularly well-designed and will provide excellent resilience in production environments where API quotas or availability may vary.

Charles Parker added 22 commits August 29, 2025 14:58

Enforce UTF-8 for Goose session files.

cd95918

Fixes issue #15. Prevents divide by zero errors and cleans up summari…

a791ce5

…es by using consistent printing methods.

Cleaned up output by using consistent printing methods.

49891a3

Fixes Issue #18 by implementing metric downgrades to Claude if OpenAP…

46ad344

…I calls fail, and to DummyMetric if Claude fails.

Satisfied ruff's bizarre rules.

fc7ba41

Added extra logging and test for goose UTF-8 handling.

54dd3d3

Added metacoder configuration test cases for claude downgrade and no …

72f586c

…server combinations to support Issues #18, #19, and #20.

Added unit test for claude downgrade to support Issue #18. Cleaned up…

d7beb19

… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful.

Added unit test for claude downgrade to support Issue #18. Cleaned up…

d88ca90

… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful. Addressed ruff warnings.

Added assertion to confirm that ClaudeJudge completed scoring the met…

e7bba40

…ric after the downgrade.

Added assertion to force test to fail on Exception. Increased logging…

d27277b

… verbosity temporarily to debug Claude judge unit test on build server. Adjusted logic to work when multiple coders are specified. Improved log messages.

Fixed runtime issues related to metric downgrade from CorrectnessMetr…

3f22fc6

…ic to DummyMetric.

Added test coverage of new evaluation judge functionality. Added test…

d6e1e44

… for the quota exhaustion fallback logic.

Reduced logging verbosity. Added Anthropic quota check. Added automat…

882a3d9

…ic downgrade to DummyMetric on quota check failure. Added notes on potential improvements to unit tests.

Fixed issue #23. Forced processes to be launched with UTF-8 encoding …

c98c9d7

…to avoid default encoding errors.

Addressed ruff formatting issue.

4761d19

Added output file check to fail if the output file already exists. Ot…

6b64a79

…herwise, create an empty file as UTF-8. Partially addresses Issue #24.

Updated ClaudeJudge model to claude-sonnet-4-20250514.

b0b1c8b

Added UTF-8 encoding to prevent character mangling during YAML export…

7e143da

… on Windows (where the default codepage is cp1252).

Added support for grouping test case eval results with 'group' key in…

37cbb2f

… config. Fixes Issue #27.