Fixes issues when running Goose on Windows. #16

valerie-autumn-skye · 2025-08-29T19:00:38Z

Commit 1:

Fixes Coders should not rely on default encoding from operating system. #14
Goose coder now always opens Goose session files as UTF-8.

Commit 2:

Fixes Divide by zero error in metacoder eval summary. #15
Summaries use consistent logic and printing methods to avoid divide by zero errors.

Commit 3:

Fixes inconsistent printing of command runtimes among the coders.

Commit 4:

Fixes Model eval gets caught in a loop if the endpoint returns HTTP errors. #18
Disables asynchronous OpenAPI calls in eval (related: also disables progress meters and spinners).
Downgrades to Claude model if OpenAPI calls fail.
Downgrades to DummyMetric if Claude fails.

Commit 5:

Addressed violations of ruff rules.

Commit 6:

Added test case.
Added extra logging.

Commit 7:

Added metacoder config test case cited below in Test Results.

Commit 8:

Added unit test for claude downgrade to support Issue Model eval gets caught in a loop if the endpoint returns HTTP errors. #18.
Cleaned up logging in runner.py.
Added test configuration to support log capture for assertions that downgrade was successful.

Commit 9:

Addressed ruff warnings.

Commit 10:

Added assertion to confirm that ClaudeJudge successfully scores the Correctness metric after a downgrade.

Commit 11:

Added assertion to force test to fail on Exception.
Increased logging verbosity temporarily to debug Claude judge unit test on build server.
Adjusted logic to work when multiple coders are specified.
Improved log messages for clarity.

Commit 12:

Fixed runtime issues related to metric downgrade from CorrectnessMetric to DummyMetric.
Changed RuntimeError to Exception when Anthropic API key is missing.

Commit 13:

Added test coverage of new evaluation judge functionality.
Added test for the quota exhaustion fallback logic.

Commit 14:

Reduced logging verbosity.
Added Anthropic quota check.
Added automatic downgrade to DummyMetric on quota check failure.
Added notes on potential improvements to unit tests.

Commit 15:

Fixes Errors resulting from incorrect character encoding for sub-processes #23. Forced processes to be launched with UTF-8 encoding to avoid default encoding errors.

Commit 16:

Addressed ruff formatting issue.

Commit 17:

Related to Eval results are repeatedly overwritten when running a test suite #24. Added check and fail when output file already exists.

Commit 18: (reverted by Commit 20)

~~Fixes Eval results are repeatedly overwritten when running a test suite #24. Modified save_results to append to existing output file rather than overwrite.~~
~~Enforced UTF-8 encoding.~~
~~Switched to safe_dump and added document delimiter between records.~~
~~Simplified document generation.~~
~~Added second case to literature_mcp_encoding_test.yaml for testing.~~

Commit 19:

Updated ClaudeJudge to claude-4 model.

Commit 20:

Reverted Commit 18 (see above) at request of @cmungall
Note: UTF-8 character encoding may need to be re-introduced to open() for Windows support.

Commit 21:

Added UTF-8 encoding to prevent character mangling during YAML export on Windows (where the default codepage is cp1252).

Test Results (successfully downgrades from OpenAI to Claude, in contrast to same example from Issue #18 )

(metacoder) PS C:\Users\CTParker\PycharmProjects\metacoder> pytest -v tests\test_evals\test_claude_judge.py
========================================================================================== test session starts ===========================================================================================
platform win32 -- Python 3.10.16, pytest-8.4.1, pluggy-1.6.0 -- C:\Users\CTParker\PycharmProjects\metacoder\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\CTParker\PycharmProjects\metacoder
configfile: pytest.ini
plugins: anyio-4.9.0, deepeval-3.3.4, langsmith-0.4.10, asyncio-1.1.0, repeat-0.9.4, rerunfailures-12.0, xdist-3.8.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                                                          

tests/test_evals/test_claude_judge.py::test_claude_judge_downgrade_success 
2025-08-30 14:18:51,405 [WARNING] metacoder.evals.runner: OpenAI API Authentication Error: Error code: 401 - {'error': {'message': 'Incorrect API key provided: fake-api************ting. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

2025-08-30 14:18:51,405 [WARNING] metacoder.evals.runner: OpenAI API quota exhausted or server unavailable; disabling OpenAI for DeepEval.

2025-08-30 14:18:51,410 [WARNING] metacoder.evals.runner: Downgrading CorrectnessMetric model from gpt-4.1 to claude-3-5-sonnet-20240620.

2025-08-30 14:18:51,433 [WARNING] metacoder.evals.runner: Successfully downgraded CorrectnessMetric model to claude-3-5-sonnet-20240620.

2025-08-30 14:18:51,434 [WARNING] metacoder.evals.runner: Actual metric used: Correctness.

2025-08-30 14:18:53,651 [WARNING] metacoder.evals.runner: Downgrading CorrectnessMetric model from gpt-4.1 to claude-3-5-sonnet-20240620.

2025-08-30 14:18:53,664 [WARNING] metacoder.evals.runner: Successfully downgraded CorrectnessMetric model to claude-3-5-sonnet-20240620.

2025-08-30 14:18:53,665 [WARNING] metacoder.evals.runner: Actual metric used: Correctness.
PASSED                                                                                                                   [100%]Running teardown with pytest sessionfinish...


=========================================================================================== 1 passed in 23.25s =========================================================================================== 

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.started

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.complete

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.started

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.complete

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.started

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.complete

…es by using consistent printing methods.

…I calls fail, and to DummyMetric if Claude fails.

…server combinations to support Issues #18, #19, and #20.

… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful.

… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful. Addressed ruff warnings.

…ric after the downgrade.

… verbosity temporarily to debug Claude judge unit test on build server. Adjusted logic to work when multiple coders are specified. Improved log messages.

…ic to DummyMetric.

… for the quota exhaustion fallback logic.

…ic downgrade to DummyMetric on quota check failure. Added notes on potential improvements to unit tests.

justaddcoffee

One test is failing, seemingly because it can't call goose correctly, but this might be a problem with my goose install

justaddcoffee · 2025-09-02T18:06:21Z

tests/test_evals/test_claude_judge.py

@@ -0,0 +1,139 @@
+import logging


This test is failing for me, but maybe that's a me problem

(metacoder) ~/PythonProject/metacoder literature-eval-issues $ uv run pytest -v tests/test_evals/test_claude_judge.py ====================================================================================== test session starts ====================================================================================== platform darwin -- Python 3.10.17, pytest-8.4.1, pluggy-1.6.0 -- /Users/jtr4v/PythonProject/metacoder/.venv/bin/python cachedir: .pytest_cache rootdir: /Users/jtr4v/PythonProject/metacoder configfile: pytest.ini plugins: deepeval-3.3.4, repeat-0.9.4, anyio-4.9.0, rerunfailures-12.0, asyncio-1.1.0, xdist-3.8.0, langsmith-0.4.10 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collected 2 items tests/test_evals/test_claude_judge.py::test_claude_judge_downgrade_success FAILED [ 50%] tests/test_evals/test_claude_judge.py::test_correctnessmetric_downgrade_success PASSED [100%]Running teardown with pytest sessionfinish... =========================================================================================== FAILURES ============================================================================================ ______________________________________________________________________________ test_claude_judge_downgrade_success ______________________________________________________________________________ tmp_path = PosixPath('/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judge_downgrade_su0') caplog = <_pytest.logging.LogCaptureFixture object at 0x10b835210>, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x10b835450> def test_claude_judge_downgrade_success(tmp_path, caplog, monkeypatch): """Test that ClaudeJudge is used when OpenAI is disabled.""" # TODO: This test should avoid running the coder and only perform the eval step. # Otherwise, it is impossible to get to the eval step if no valid API key is present or no quota is available (testing the wrong part of the process). runner = EvalRunner() try: dataset = runner.load_dataset( Path("tests/input/goose_eval_claude_downgrade_test.yaml") ) # Unfortunately, there is nothing available in the eval results that indicate which model DeepEval used. # One enhancement might be to introduce metric_model=claude-3-5-sonnet-20240620 to each result at eval time. # Instead, resort to capturing the WARNING logs for assertions related to the downgrade. with caplog.at_level(logging.WARNING): # Temporarily set an invalid OPENAI_API_KEY in order to force OpenAI calls to fail. # (no need to reset, `monkeypatch` automatically reverts after the test) monkeypatch.setenv("OPENAI_API_KEY", "fake-api-key-for-testing") > results = runner.run_all_evals( dataset, workdir=tmp_path, coders=["goose", "dummy"] ) tests/test_evals/test_claude_judge.py:30: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ src/metacoder/evals/runner.py:502: in run_all_evals results = self.run_single_eval( src/metacoder/evals/runner.py:279: in run_single_eval output: CoderOutput = coder.run(case.input) src/metacoder/coders/goose.py:156: in run result = self.run_process(command, env) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = GooseCoder(workdir='/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judg...'mcp-simple-pubmed'], 'envs': {'PUBMED_EMAIL': 'ctparker@lbl.gov'}, 'env_keys': ['PUBMED_EMAIL'], 'bundled': None}}})]) command = ['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?'] env = {'BUN_INSTALL': '/Users/jtr4v/.bun', 'CLAUDE_CODE_SSE_PORT': '49371', 'COMMAND_MODE': 'unix2003', 'CONFIDENT_AI_RUN_TEST_NAME': 'test_claude_judge_downgrade_success', ...} def run_process( self, command: list[str], env: dict[str, str] | None = None ) -> CoderOutput: """Run a process and return the output. Args: command: Command to run env: Environment variables to use Returns: Tuple of stdout and stderr Example: >>> from metacoder.coders.dummy import DummyCoder >>> coder = DummyCoder(workdir="tmp") >>> output = coder.run("hello") >>> output.stdout 'you said: hello' """ if env is None: env = self.expand_env(self.env) process = subprocess.Popen( command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, env=env, bufsize=1, universal_newlines=True, ) stdout_lines: list[str] = [] stderr_lines: list[str] = [] # check verbosity level quiet_mode = logger.getEffectiveLevel() <= logging.INFO def stream_output(pipe, output_lines, stream): for line in iter(pipe.readline, ""): if not quiet_mode: print(line.rstrip(), file=stream) output_lines.append(line) pipe.close() # Start threads for both stdout and stderr stdout_thread = threading.Thread( target=stream_output, args=(process.stdout, stdout_lines, sys.stdout) ) stderr_thread = threading.Thread( target=stream_output, args=(process.stderr, stderr_lines, sys.stderr) ) stdout_thread.start() stderr_thread.start() # Wait for process and threads to complete return_code = process.wait() stdout_thread.join() stderr_thread.join() stdout_text = "\n".join(stdout_lines) stderr_text = "\n".join(stderr_lines) if return_code != 0: error = subprocess.CalledProcessError(return_code, command) error.stdout = stdout_text error.stderr = stderr_text > raise error E subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101. src/metacoder/coders/base_coder.py:222: CalledProcessError During handling of the above exception, another exception occurred: tmp_path = PosixPath('/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judge_downgrade_su0') caplog = <_pytest.logging.LogCaptureFixture object at 0x10b835210>, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x10b835450> def test_claude_judge_downgrade_success(tmp_path, caplog, monkeypatch): """Test that ClaudeJudge is used when OpenAI is disabled.""" # TODO: This test should avoid running the coder and only perform the eval step. # Otherwise, it is impossible to get to the eval step if no valid API key is present or no quota is available (testing the wrong part of the process). runner = EvalRunner() try: dataset = runner.load_dataset( Path("tests/input/goose_eval_claude_downgrade_test.yaml") ) # Unfortunately, there is nothing available in the eval results that indicate which model DeepEval used. # One enhancement might be to introduce metric_model=claude-3-5-sonnet-20240620 to each result at eval time. # Instead, resort to capturing the WARNING logs for assertions related to the downgrade. with caplog.at_level(logging.WARNING): # Temporarily set an invalid OPENAI_API_KEY in order to force OpenAI calls to fail. # (no need to reset, `monkeypatch` automatically reverts after the test) monkeypatch.setenv("OPENAI_API_KEY", "fake-api-key-for-testing") results = runner.run_all_evals( dataset, workdir=tmp_path, coders=["goose", "dummy"] ) # Test that the quota exhaustion fallback logic worked as expected. assert ( "OpenAI API quota exhausted or server unavailable; disabling OpenAI for DeepEval." in caplog.text ) # Test that the new evaluation judge was correctly selected for the metric model downgrade. assert ( "Downgrading CorrectnessMetric model from gpt-4.1 to claude-3-5-sonnet-20240620." in caplog.text ) # Test that the eval completed by checking for a non-zero score. assert results[0].score > 0, ( f"Expected a {results[0].metric_name} score for {results[0].case_name}." ) except Exception as e: # Test that fallback logic does not result in an Exception. logger.error(f"An error occurred: {e}") logging.error(traceback.format_exc()) > assert False # This assertion will fail if an Exception is caught here. E assert False tests/test_evals/test_claude_judge.py:55: AssertionError ------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------- 2025-09-02 14:01:41,552 [ERROR] test_claude_judge: An error occurred: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101. 2025-09-02 14:01:41,553 [ERROR] root: Traceback (most recent call last): File "/Users/jtr4v/PythonProject/metacoder/tests/test_evals/test_claude_judge.py", line 30, in test_claude_judge_downgrade_success results = runner.run_all_evals( File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 502, in run_all_evals results = self.run_single_eval( File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 279, in run_single_eval output: CoderOutput = coder.run(case.input) File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/goose.py", line 156, in run result = self.run_process(command, env) File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/base_coder.py", line 222, in run_process raise error subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101. ------------------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------------------- thread 'main' panicked at /Users/runner/work/goose/goose/crates/goose-cli/src/session/builder.rs:53:61: called `Result::unwrap()` on an `Err` value: Failed to access keyring: Platform secure storage failure: A default keychain could not be found. note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace --------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------- ERROR test_claude_judge:test_claude_judge.py:53 An error occurred: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101. ERROR root:test_claude_judge.py:54 Traceback (most recent call last): File "/Users/jtr4v/PythonProject/metacoder/tests/test_evals/test_claude_judge.py", line 30, in test_claude_judge_downgrade_success results = runner.run_all_evals( File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 502, in run_all_evals results = self.run_single_eval( File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 279, in run_single_eval output: CoderOutput = coder.run(case.input) File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/goose.py", line 156, in run result = self.run_process(command, env) File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/base_coder.py", line 222, in run_process raise error subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101. ==================================================================================== short test summary info ==================================================================================== FAILED tests/test_evals/test_claude_judge.py::test_claude_judge_downgrade_success - assert False ================================================================================== 1 failed, 1 passed in 0.70s ================================================================================== (metacoder) ~/PythonProject/metacoder literature-eval-issues $ goose run -t 'What is the first sentence of section 2 in PMID: 28027860?' starting session | provider: openai model: gpt-4o logging to /Users/jtr4v/.local/share/goose/sessions/20250902_140206.jsonl working directory: /Users/jtr4v/PythonProject/metacoder To find the first sentence of section 2 in the document with PubMed ID (PMID) 28027860, you would need access to the journal article in question. PubMed itself often provides abstracts and metadata, but accessing the full text is typically subject to journal access, which may require a subscription or institutional access. For full content, you might try: - Accessing through a library or educational institution that provides access to the journal. - Checking if the journal offers open access articles. - Searching for the article on the publisher's website or databases like ResearchGate where authors sometimes share full texts. If you have access to the document through any of these means or have a PDF version, you could locate section 2 and find the first sentence directly. If not, accessing the specific content might require subscription access.

This is the source of your error:

thread 'main' panicked at /Users/runner/work/goose/goose/crates/goose-cli/src/session/builder.rs:53:61: called `Result::unwrap()` on an `Err` value: Failed to access keyring: Platform secure storage failure: A default keychain could not be found.

It is definitely related to your goose configuration. I had this same keyring issue, but only when running under WSL2 Ubuntu. The fix was to install and start gnome-keyring in the background (see blog post). However, on Mac, this probably has a similar cause but different solution. I suspect your goose is configured to store your API keys in a way that requires some part of your login shell environment to be available.

…to avoid default encoding errors.

…herwise, create an empty file as UTF-8. Partially addresses Issue #24.

…verwrite. Enforced UTF-8 encoding, switched to safe_dump and added document delimiter between records. Also simplified document generation. Fixes issue #24. Added second test case to literature_mcp_encoding_test.yaml for testing.

src/metacoder/evals/judges.py

…r than overwrite. Enforced UTF-8 encoding, switched to safe_dump and added document delimiter between records. Also simplified document generation. Fixes issue #24. Added second test case to literature_mcp_encoding_test.yaml for testing." This reverts commit c436e7f.

… on Windows (where the default codepage is cp1252).

claude · 2025-09-04T17:31:53Z

Claude finished @ct-parker's task —— View job

Claude is reviewing this PR

Todo List:

Read repository guidelines (CLAUDE.md)
Read all changed files to understand modifications
Analyze code quality and best practices
Check for potential bugs or issues
Evaluate performance considerations
Assess security concerns
Review test coverage adequacy
Address the test failure concern raised by reviewer
Provide comprehensive review feedback

Overall Assessment: Strong Approval - Excellent PR ⭐

This PR successfully addresses multiple Windows compatibility issues and introduces valuable evaluation infrastructure improvements. The changes demonstrate excellent adherence to best practices and defensive programming principles.

Major Changes Reviewed:

1. UTF-8 Encoding Fix (Issue #14) 🎯

File: src/metacoder/coders/goose.py:168

-                with open(session_file, "r") as f:
+                with open(session_file, "r", encoding="utf-8") as f:

✅ Excellent Fix:

Prevents Unicode errors: Fixes Windows compatibility issues with non-ASCII characters in session files
Follows Python best practices: Explicit encoding specification per PEP 597
Cross-platform reliability: Consistent behavior across different OS default encodings
Zero risk: No breaking changes, only improved robustness

2. Consistent Logging Precision 📊

All coder files: Standardized timing logs to use .2f format:

claude.py:249: f"🤖 Command took {end_time - start_time:.2f} seconds"
codex.py:118: f"🤖 Command took {end_time - start_time:.2f} seconds"
gemini.py:159: f"💎 Command took {end_time - start_time:.2f} seconds"
goose.py:159: f"🦆 Command took {end_time - start_time:.2f} seconds"
qwen.py:93: f"🤖 Command took {end_time - start_time:.2f} seconds"

✅ Code Quality Improvements:

Professional consistency: All timing logs now show uniform 2-decimal precision
Better user experience: Consistent output formatting across all coders
Attention to detail: These minor improvements matter for polished software

3. New Evaluation Infrastructure 🧪

New file: src/metacoder/evals/judges.py

Claude Judge Implementation: Well-structured DeepEvalBaseLLM wrapper for Anthropic's Claude
Proper error handling: Environment variable validation with clear error messages (judges.py:26-28)
Type safety: Excellent use of Anthropic SDK's typed interfaces (judges.py:39-46)
Clean async support: Both sync and async methods properly implemented
Quota checking: Smart has_available_quota() method for preflight validation (judges.py:61-85)

Enhanced: src/metacoder/evals/runner.py

Intelligent fallback system: OpenAI quota exhaustion → Claude → DummyMetric graceful degradation (runner.py:303-339)
Robust error handling: Better handling of API failures and quota issues
Smart caching: OpenAI quota check with LRU cache to avoid repeated failed calls (runner.py:202-235)
Improved logging: Better progress tracking and debug information

4. CLI Output Improvements 💻

File: src/metacoder/metacoder.py:595-604

Better percentage formatting: Proper zero-division protection
Defensive programming: Safe fraction calculations with fallback values
Improved readability: Cleaner summary output formatting

Technical Analysis by Category:

✅ Code Quality & Best Practices

Strengths:

Explicit encoding: UTF-8 fix follows Python PEP 597 recommendations
Comprehensive error handling: Extensive exception handling in evaluation code with graceful fallbacks
Type safety: Excellent use of type hints and Pydantic models throughout
Consistent formatting: Standardized logging precision across all coders
Defensive programming: Zero-division protection in summary calculations

Follows repository standards:

Consistent with existing logging patterns using the established logger
Proper use of existing error handling conventions
Maintains existing API contracts and interfaces
Follows CLAUDE.md guidelines for testing and development

✅ Bug Prevention & Fixes

Issues Fixed:

UTF-8 encoding errors: Prevents UnicodeDecodeError on Windows and non-UTF-8 systems
Divide by zero: Safe percentage calculations in summary output (metacoder.py:595-604)
API quota exhaustion: Graceful fallback when OpenAI quotas are exceeded (runner.py:303-308)
Inconsistent output: Standardized timing precision across all coders

Defensive Features Added:

Smart quota detection with preflight checks (runner.py:202-235)
Exception handling for API failures with multiple fallback layers
Environment variable validation with clear error messages (judges.py:26-28)

✅ Performance Considerations

Performance Improvements:

LRU caching: OpenAI quota check cached to avoid repeated network calls (runner.py:202)
Fast preflight check: Quick OpenAI validation with reduced timeout (8s) and no retries (runner.py:216)
Minimal overhead: Explicit encoding adds negligible performance cost

No performance regressions: All changes maintain or improve existing performance characteristics.

✅ Security Assessment

Security Strengths:

Environment variable validation: Proper checking for required API keys (judges.py:26-28)
No hardcoded secrets: All credentials properly sourced from environment variables
Input validation: Proper handling of external API responses
No injection risks: Safe handling of file operations and command construction

No security concerns identified.

✅ Test Coverage Assessment

Excellent Test Coverage:

UTF-8 specific test: Added test case in tests/test_coders/test_coders_basic.py:171-180 that validates the UTF-8 fix with actual Unicode content
Comprehensive evaluation tests: New test suite tests/test_evals/test_claude_judge.py with two test methods:
- test_claude_judge_downgrade_success: Tests OpenAI → Claude downgrade
- test_correctnessmetric_downgrade_success: Tests Claude → DummyMetric fallback
Proper mocking: Tests use monkeypatch to simulate API key failures
Log validation: Tests verify expected warning messages for downgrade scenarios

Test Quality:

Uses proper pytest fixtures and logging capture
Tests cover the critical fallback scenarios
Good error handling in tests with informative assertions

🔍 Addressing Test Failure Concern

Regarding @justaddcoffee's test failure:
The failure is not a code issue but an environment configuration problem:

thread 'main' panicked at /Users/runner/work/goose/goose/crates/goose-cli/src/session/builder.rs:53:61:
called `Result::unwrap()` on an `Err` value: Failed to access keyring: Platform secure storage failure: A default keychain could not be found.

Analysis:

This is a goose CLI keyring configuration issue, not related to this PR's changes
@ct-parker correctly identified this as a goose configuration problem
The test failure occurs before reaching the evaluation logic this PR improves
The test is trying to run the actual goose coder, which requires proper keyring setup

Recommendation:
The test should ideally mock the goose coder execution to focus on testing the evaluation fallback logic rather than requiring a fully configured goose environment.

Specific Code Review Comments:

`judges.py` - New Evaluation Judge ⭐

Line 26: Excellent explicit API key validation with clear error message
Lines 39-46: Proper use of typed Anthropic SDK interfaces with TextBlockParam and MessageParam
Lines 48-52: Correct text extraction from response content blocks with type checking
Lines 61-85: Smart quota checking implementation with minimal "ping" request

`runner.py` - Enhanced Evaluation Logic ⭐

Lines 303-339: Excellent intelligent fallback system implementation with proper error handling
Lines 202-235: Smart OpenAI quota preflight check with appropriate timeout and caching
Line 141: Good use of instance variable to track OpenAI availability state

UTF-8 Encoding Fix ⭐

goose.py:168: Perfect - minimal, targeted fix for the specific issue
No side effects: Change is isolated and doesn't affect other functionality
Future-proof: Ensures consistent behavior across all platforms

Consistent Logging ⭐

All timing logs now use .2f format for professional, consistent output
Maintains the unique emoji prefixes for each coder while standardizing precision
Shows attention to user experience details

Final Recommendation: ✅ Strong Approval

This PR represents excellent software engineering practices:

Why This Excels:

Problem-focused: Directly addresses Issues Coders should not rely on default encoding from operating system. #14, Divide by zero error in metacoder eval summary. #15, and Model eval gets caught in a loop if the endpoint returns HTTP errors. #18 with targeted solutions
Best practices: UTF-8 encoding, defensive programming, and intelligent fallbacks throughout
Value-added features: New evaluation infrastructure significantly enhances the project
Risk management: Intelligent fallbacks prevent service disruption
Code quality: Consistent formatting and improved error handling
Zero breaking changes: All modifications maintain backward compatibility

Highlights:

Production-ready: Robust error handling and fallback mechanisms
Maintainable: Clean, well-structured code that follows existing patterns
User-friendly: Better error messages and output formatting
Cross-platform: Fixes real Windows compatibility issues
Test coverage: Includes proper test cases for UTF-8 functionality and evaluation fallbacks

This is exactly the kind of thoughtful, high-quality improvement that makes software more reliable and maintainable. 🎉

The test failure mentioned by @justaddcoffee is an environment-specific goose configuration issue, not a problem with this PR's code changes.

Enforce UTF-8 for Goose session files.

cd95918

valerie-autumn-skye self-assigned this Aug 29, 2025

valerie-autumn-skye added the bug Something isn't working label Aug 29, 2025