-
Notifications
You must be signed in to change notification settings - Fork 1
Literature eval enhancements #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…es by using consistent printing methods.
…I calls fail, and to DummyMetric if Claude fails.
… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful.
… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful. Addressed ruff warnings.
…ric after the downgrade.
… verbosity temporarily to debug Claude judge unit test on build server. Adjusted logic to work when multiple coders are specified. Improved log messages.
…ic to DummyMetric.
… for the quota exhaustion fallback logic.
…ic downgrade to DummyMetric on quota check failure. Added notes on potential improvements to unit tests.
…to avoid default encoding errors.
…herwise, create an empty file as UTF-8. Partially addresses Issue #24.
…verwrite. Enforced UTF-8 encoding, switched to safe_dump and added document delimiter between records. Also simplified document generation. Fixes issue #24. Added second test case to literature_mcp_encoding_test.yaml for testing.
… on Windows (where the default codepage is cp1252).
… config. Fixes Issue #27.
This comment was marked as outdated.
This comment was marked as outdated.
…o address validation errors in test suite.
This comment was marked as outdated.
This comment was marked as outdated.
…aude-sonnet-4-20250514.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
cmungall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - let's separate into separate PRs in future!
…lity improvement to support fix for Issue #29. Adding as individual commit in case it needs to be rolled back.
This comment was marked as outdated.
This comment was marked as outdated.
…g the system keyring for secrets (GOOSE_DISABLE_KEYRING).
This comment was marked as outdated.
This comment was marked as outdated.
…stead of unix-specific "." to specify current working directory.
This comment was marked as outdated.
This comment was marked as outdated.
…ble to avoid interfering with unix environment (shell history, etc.). Separate commit in case this needs to be rolled back.
This comment was marked as outdated.
This comment was marked as outdated.
…ath is not being referenced, but rather a server combination.
This comment was marked as outdated.
This comment was marked as outdated.
… full path information is propagated into the logs for easier debugging of the environment.
This comment was marked as resolved.
This comment was marked as resolved.
…orm support. Adjusted log level and cleaned up comment for readability.
This comment was marked as outdated.
This comment was marked as outdated.
… paths with function call to generate correct path. Added OS-specific home directory environment variables. Added logging of Goose config path for confirmation with base coder.
This comment was marked as outdated.
This comment was marked as outdated.
…r configs consistently.
|
Claude finished @ct-parker's task —— View job PR Review: Literature eval enhancements ✅I've completed a comprehensive review of this PR. Here's my detailed analysis: Todo List:
Overall Assessment ✅This PR implements excellent enhancements to the evaluation framework with robust fallback mechanisms and comprehensive cross-platform support. The key improvements include:
Code Quality & Best Practices ✅Strengths:
Code Highlights:1. Test Case Grouping ( group: Optional[str] = Field(
default="Default", description="Test category for result grouping."
)
2. Cross-Platform Config Detection (
3. Robust Fallback Chain (
Potential Issues & Recommendations
|
Description of earlier commits for this branch are documented in PR #16.
Commit 22:
groupkey in config.case_group.Default.Commit 23:
test_runner.pyto includeDefaultcase_groupinEvalResultsto address validation errors in test suite.Commit 24:
claude-3-5-sonnet-20240620toclaude-sonnet-4-20250514.Commit 25:
Commit 26:
Commit 27:
Commit 28:
os.cwd()instead of unix-specific"."to specify current working directory.Commit 29:
XDG_CONFIG_HOMEinstead of changingHOMEenvironment variable to avoid interfering with unix environment (shell history, etc.). Separate commit in case this needs to be rolled back.Commit 30:
Commit 31:
-The Goose executable is now detected in a cross-platform way.
Commit 32:
Commit 33:
Commit 34: