Skip to content

Conversation

@valerie-autumn-skye
Copy link
Collaborator

@valerie-autumn-skye valerie-autumn-skye commented Aug 29, 2025

Commit 1:

Commit 2:

Commit 3:

  • Fixes inconsistent printing of command runtimes among the coders.

Commit 4:

Commit 5:

  • Addressed violations of ruff rules.

Commit 6:

  • Added test case.
  • Added extra logging.

Commit 7:

  • Added metacoder config test case cited below in Test Results.

Commit 8:

Commit 9:

  • Addressed ruff warnings.

Commit 10:

  • Added assertion to confirm that ClaudeJudge successfully scores the Correctness metric after a downgrade.

Commit 11:

  • Added assertion to force test to fail on Exception.
  • Increased logging verbosity temporarily to debug Claude judge unit test on build server.
  • Adjusted logic to work when multiple coders are specified.
  • Improved log messages for clarity.

Commit 12:

  • Fixed runtime issues related to metric downgrade from CorrectnessMetric to DummyMetric.
  • Changed RuntimeError to Exception when Anthropic API key is missing.

Commit 13:

  • Added test coverage of new evaluation judge functionality.
  • Added test for the quota exhaustion fallback logic.

Commit 14:

  • Reduced logging verbosity.
  • Added Anthropic quota check.
  • Added automatic downgrade to DummyMetric on quota check failure.
  • Added notes on potential improvements to unit tests.

Commit 15:

Commit 16:

  • Addressed ruff formatting issue.

Commit 17:

Commit 18: (reverted by Commit 20)

  • Fixes Eval results are repeatedly overwritten when running a test suite #24. Modified save_results to append to existing output file rather than overwrite.
  • Enforced UTF-8 encoding.
  • Switched to safe_dump and added document delimiter between records.
  • Simplified document generation.
  • Added second case to literature_mcp_encoding_test.yaml for testing.

Commit 19:

  • Updated ClaudeJudge to claude-4 model.

Commit 20:

  • Reverted Commit 18 (see above) at request of @cmungall
  • Note: UTF-8 character encoding may need to be re-introduced to open() for Windows support.

Commit 21:

  • Added UTF-8 encoding to prevent character mangling during YAML export on Windows (where the default codepage is cp1252).

Test Results (successfully downgrades from OpenAI to Claude, in contrast to same example from Issue #18 )

(metacoder) PS C:\Users\CTParker\PycharmProjects\metacoder> pytest -v tests\test_evals\test_claude_judge.py
========================================================================================== test session starts ===========================================================================================
platform win32 -- Python 3.10.16, pytest-8.4.1, pluggy-1.6.0 -- C:\Users\CTParker\PycharmProjects\metacoder\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\CTParker\PycharmProjects\metacoder
configfile: pytest.ini
plugins: anyio-4.9.0, deepeval-3.3.4, langsmith-0.4.10, asyncio-1.1.0, repeat-0.9.4, rerunfailures-12.0, xdist-3.8.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                                                          

tests/test_evals/test_claude_judge.py::test_claude_judge_downgrade_success 
2025-08-30 14:18:51,405 [WARNING] metacoder.evals.runner: OpenAI API Authentication Error: Error code: 401 - {'error': {'message': 'Incorrect API key provided: fake-api************ting. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

2025-08-30 14:18:51,405 [WARNING] metacoder.evals.runner: OpenAI API quota exhausted or server unavailable; disabling OpenAI for DeepEval.

2025-08-30 14:18:51,410 [WARNING] metacoder.evals.runner: Downgrading CorrectnessMetric model from gpt-4.1 to claude-3-5-sonnet-20240620.

2025-08-30 14:18:51,433 [WARNING] metacoder.evals.runner: Successfully downgraded CorrectnessMetric model to claude-3-5-sonnet-20240620.

2025-08-30 14:18:51,434 [WARNING] metacoder.evals.runner: Actual metric used: Correctness.

2025-08-30 14:18:53,651 [WARNING] metacoder.evals.runner: Downgrading CorrectnessMetric model from gpt-4.1 to claude-3-5-sonnet-20240620.

2025-08-30 14:18:53,664 [WARNING] metacoder.evals.runner: Successfully downgraded CorrectnessMetric model to claude-3-5-sonnet-20240620.

2025-08-30 14:18:53,665 [WARNING] metacoder.evals.runner: Actual metric used: Correctness.
PASSED                                                                                                                   [100%]Running teardown with pytest sessionfinish...


=========================================================================================== 1 passed in 23.25s =========================================================================================== 

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.started

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.complete

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.started

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.complete

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.started

2025-08-30 14:18:55,710 [DEBUG] httpcore.connection: close.complete

@valerie-autumn-skye valerie-autumn-skye self-assigned this Aug 29, 2025
@valerie-autumn-skye valerie-autumn-skye added the bug Something isn't working label Aug 29, 2025
@claude

This comment was marked as outdated.

…es by using consistent printing methods.
@claude

This comment was marked as outdated.

@valerie-autumn-skye valerie-autumn-skye changed the title Enforce UTF-8 for Goose session files. Fixes issues when running Goose on Windows. Aug 29, 2025
@claude

This comment was marked as outdated.

…I calls fail, and to DummyMetric if Claude fails.
@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

Charles Parker added 2 commits August 30, 2025 12:07
… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful.
… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful. Addressed ruff warnings.
@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

… verbosity temporarily to debug Claude judge unit test on build server. Adjusted logic to work when multiple coders are specified. Improved log messages.
@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

…ic downgrade to DummyMetric on quota check failure. Added notes on potential improvements to unit tests.
@claude

This comment was marked as outdated.

Copy link
Contributor

@justaddcoffee justaddcoffee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test is failing, seemingly because it can't call goose correctly, but this might be a problem with my goose install

@@ -0,0 +1,139 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is failing for me, but maybe that's a me problem

(metacoder) ~/PythonProject/metacoder literature-eval-issues $ uv run pytest -v tests/test_evals/test_claude_judge.py
====================================================================================== test session starts ======================================================================================
platform darwin -- Python 3.10.17, pytest-8.4.1, pluggy-1.6.0 -- /Users/jtr4v/PythonProject/metacoder/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/jtr4v/PythonProject/metacoder
configfile: pytest.ini
plugins: deepeval-3.3.4, repeat-0.9.4, anyio-4.9.0, rerunfailures-12.0, asyncio-1.1.0, xdist-3.8.0, langsmith-0.4.10
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items                                                                                                                                                                               

tests/test_evals/test_claude_judge.py::test_claude_judge_downgrade_success FAILED                                                                                                         [ 50%]
tests/test_evals/test_claude_judge.py::test_correctnessmetric_downgrade_success PASSED                                                                                                    [100%]Running teardown with pytest sessionfinish...


=========================================================================================== FAILURES ============================================================================================
______________________________________________________________________________ test_claude_judge_downgrade_success ______________________________________________________________________________

tmp_path = PosixPath('/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judge_downgrade_su0')
caplog = <_pytest.logging.LogCaptureFixture object at 0x10b835210>, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x10b835450>

    def test_claude_judge_downgrade_success(tmp_path, caplog, monkeypatch):
        """Test that ClaudeJudge is used when OpenAI is disabled."""
        # TODO: This test should avoid running the coder and only perform the eval step.
        # Otherwise, it is impossible to get to the eval step if no valid API key is present or no quota is available (testing the wrong part of the process).
    
        runner = EvalRunner()
    
        try:
            dataset = runner.load_dataset(
                Path("tests/input/goose_eval_claude_downgrade_test.yaml")
            )
    
            # Unfortunately, there is nothing available in the eval results that indicate which model DeepEval used.
            # One enhancement might be to introduce metric_model=claude-3-5-sonnet-20240620 to each result at eval time.
            # Instead, resort to capturing the WARNING logs for assertions related to the downgrade.
            with caplog.at_level(logging.WARNING):
                # Temporarily set an invalid OPENAI_API_KEY in order to force OpenAI calls to fail.
                # (no need to reset, `monkeypatch` automatically reverts after the test)
                monkeypatch.setenv("OPENAI_API_KEY", "fake-api-key-for-testing")
    
>               results = runner.run_all_evals(
                    dataset, workdir=tmp_path, coders=["goose", "dummy"]
                )

tests/test_evals/test_claude_judge.py:30: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/metacoder/evals/runner.py:502: in run_all_evals
    results = self.run_single_eval(
src/metacoder/evals/runner.py:279: in run_single_eval
    output: CoderOutput = coder.run(case.input)
src/metacoder/coders/goose.py:156: in run
    result = self.run_process(command, env)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = GooseCoder(workdir='/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judg...'mcp-simple-pubmed'], 'envs': {'PUBMED_EMAIL': 'ctparker@lbl.gov'}, 'env_keys': ['PUBMED_EMAIL'], 'bundled': None}}})])
command = ['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']
env = {'BUN_INSTALL': '/Users/jtr4v/.bun', 'CLAUDE_CODE_SSE_PORT': '49371', 'COMMAND_MODE': 'unix2003', 'CONFIDENT_AI_RUN_TEST_NAME': 'test_claude_judge_downgrade_success', ...}

    def run_process(
        self, command: list[str], env: dict[str, str] | None = None
    ) -> CoderOutput:
        """Run a process and return the output.
    
        Args:
            command: Command to run
            env: Environment variables to use
    
        Returns:
            Tuple of stdout and stderr
    
        Example:
            >>> from metacoder.coders.dummy import DummyCoder
            >>> coder = DummyCoder(workdir="tmp")
            >>> output = coder.run("hello")
            >>> output.stdout
            'you said: hello'
    
    
        """
        if env is None:
            env = self.expand_env(self.env)
        process = subprocess.Popen(
            command,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            env=env,
            bufsize=1,
            universal_newlines=True,
        )
    
        stdout_lines: list[str] = []
        stderr_lines: list[str] = []
    
        # check verbosity level
        quiet_mode = logger.getEffectiveLevel() <= logging.INFO
    
        def stream_output(pipe, output_lines, stream):
            for line in iter(pipe.readline, ""):
                if not quiet_mode:
                    print(line.rstrip(), file=stream)
                output_lines.append(line)
            pipe.close()
    
        # Start threads for both stdout and stderr
        stdout_thread = threading.Thread(
            target=stream_output, args=(process.stdout, stdout_lines, sys.stdout)
        )
        stderr_thread = threading.Thread(
            target=stream_output, args=(process.stderr, stderr_lines, sys.stderr)
        )
    
        stdout_thread.start()
        stderr_thread.start()
    
        # Wait for process and threads to complete
        return_code = process.wait()
        stdout_thread.join()
        stderr_thread.join()
    
        stdout_text = "\n".join(stdout_lines)
        stderr_text = "\n".join(stderr_lines)
    
        if return_code != 0:
            error = subprocess.CalledProcessError(return_code, command)
            error.stdout = stdout_text
            error.stderr = stderr_text
>           raise error
E           subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.

src/metacoder/coders/base_coder.py:222: CalledProcessError

During handling of the above exception, another exception occurred:

tmp_path = PosixPath('/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judge_downgrade_su0')
caplog = <_pytest.logging.LogCaptureFixture object at 0x10b835210>, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x10b835450>

    def test_claude_judge_downgrade_success(tmp_path, caplog, monkeypatch):
        """Test that ClaudeJudge is used when OpenAI is disabled."""
        # TODO: This test should avoid running the coder and only perform the eval step.
        # Otherwise, it is impossible to get to the eval step if no valid API key is present or no quota is available (testing the wrong part of the process).
    
        runner = EvalRunner()
    
        try:
            dataset = runner.load_dataset(
                Path("tests/input/goose_eval_claude_downgrade_test.yaml")
            )
    
            # Unfortunately, there is nothing available in the eval results that indicate which model DeepEval used.
            # One enhancement might be to introduce metric_model=claude-3-5-sonnet-20240620 to each result at eval time.
            # Instead, resort to capturing the WARNING logs for assertions related to the downgrade.
            with caplog.at_level(logging.WARNING):
                # Temporarily set an invalid OPENAI_API_KEY in order to force OpenAI calls to fail.
                # (no need to reset, `monkeypatch` automatically reverts after the test)
                monkeypatch.setenv("OPENAI_API_KEY", "fake-api-key-for-testing")
    
                results = runner.run_all_evals(
                    dataset, workdir=tmp_path, coders=["goose", "dummy"]
                )
    
                # Test that the quota exhaustion fallback logic worked as expected.
                assert (
                    "OpenAI API quota exhausted or server unavailable; disabling OpenAI for DeepEval."
                    in caplog.text
                )
    
                # Test that the new evaluation judge was correctly selected for the metric model downgrade.
                assert (
                    "Downgrading CorrectnessMetric model from gpt-4.1 to claude-3-5-sonnet-20240620."
                    in caplog.text
                )
    
                # Test that the eval completed by checking for a non-zero score.
                assert results[0].score > 0, (
                    f"Expected a {results[0].metric_name} score for {results[0].case_name}."
                )
    
        except Exception as e:
            # Test that fallback logic does not result in an Exception.
            logger.error(f"An error occurred: {e}")
            logging.error(traceback.format_exc())
>           assert False  # This assertion will fail if an Exception is caught here.
E           assert False

tests/test_evals/test_claude_judge.py:55: AssertionError
------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------

2025-09-02 14:01:41,552 [ERROR] test_claude_judge: An error occurred: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.

2025-09-02 14:01:41,553 [ERROR] root: Traceback (most recent call last):
  File "/Users/jtr4v/PythonProject/metacoder/tests/test_evals/test_claude_judge.py", line 30, in test_claude_judge_downgrade_success
    results = runner.run_all_evals(
  File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 502, in run_all_evals
    results = self.run_single_eval(
  File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 279, in run_single_eval
    output: CoderOutput = coder.run(case.input)
  File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/goose.py", line 156, in run
    result = self.run_process(command, env)
  File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/base_coder.py", line 222, in run_process
    raise error
subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.

------------------------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------------------------

thread 'main' panicked at /Users/runner/work/goose/goose/crates/goose-cli/src/session/builder.rs:53:61:
called `Result::unwrap()` on an `Err` value: Failed to access keyring: Platform secure storage failure: A default keychain could not be found.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--------------------------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------------------------
ERROR    test_claude_judge:test_claude_judge.py:53 An error occurred: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.
ERROR    root:test_claude_judge.py:54 Traceback (most recent call last):
  File "/Users/jtr4v/PythonProject/metacoder/tests/test_evals/test_claude_judge.py", line 30, in test_claude_judge_downgrade_success
    results = runner.run_all_evals(
  File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 502, in run_all_evals
    results = self.run_single_eval(
  File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 279, in run_single_eval
    output: CoderOutput = coder.run(case.input)
  File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/goose.py", line 156, in run
    result = self.run_process(command, env)
  File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/base_coder.py", line 222, in run_process
    raise error
subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.
==================================================================================== short test summary info ====================================================================================
FAILED tests/test_evals/test_claude_judge.py::test_claude_judge_downgrade_success - assert False
================================================================================== 1 failed, 1 passed in 0.70s ==================================================================================
(metacoder) ~/PythonProject/metacoder literature-eval-issues $ goose run -t 'What is the first sentence of section 2 in PMID: 28027860?'
starting session | provider: openai model: gpt-4o
    logging to /Users/jtr4v/.local/share/goose/sessions/20250902_140206.jsonl
    working directory: /Users/jtr4v/PythonProject/metacoder
To find the first sentence of section 2 in the document with PubMed ID (PMID) 28027860, you would need access to the journal article in question. PubMed itself often provides abstracts and metadata, but accessing the full text is typically subject to journal access, which may require a subscription or institutional access.                                                              

For full content, you might try:
- Accessing through a library or educational institution that provides access to the journal.
- Checking if the journal offers open access articles.
- Searching for the article on the publisher's website or databases like ResearchGate where authors sometimes share full texts.

If you have access to the document through any of these means or have a PDF version, you could locate section 2 and find the first sentence directly. If not, accessing the specific content might require subscription access.                                                                                                                                                                   

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source of your error:

thread 'main' panicked at /Users/runner/work/goose/goose/crates/goose-cli/src/session/builder.rs:53:61:
called `Result::unwrap()` on an `Err` value: Failed to access keyring: Platform secure storage failure: A default keychain could not be found.

It is definitely related to your goose configuration. I had this same keyring issue, but only when running under WSL2 Ubuntu. The fix was to install and start gnome-keyring in the background (see blog post). However, on Mac, this probably has a similar cause but different solution. I suspect your goose is configured to store your API keys in a way that requires some part of your login shell environment to be available.

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

Charles Parker added 2 commits September 2, 2025 19:12
…herwise, create an empty file as UTF-8. Partially addresses Issue #24.
…verwrite. Enforced UTF-8 encoding, switched to safe_dump and added document delimiter between records. Also simplified document generation. Fixes issue #24. Added second test case to literature_mcp_encoding_test.yaml for testing.
@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

…r than overwrite. Enforced UTF-8 encoding, switched to safe_dump and added document delimiter between records. Also simplified document generation. Fixes issue #24. Added second test case to literature_mcp_encoding_test.yaml for testing."

This reverts commit c436e7f.
@claude

This comment was marked as outdated.

… on Windows (where the default codepage is cp1252).
@claude
Copy link
Contributor

claude bot commented Sep 4, 2025

Claude finished @ct-parker's task —— View job


Claude is reviewing this PR


Todo List:

  • Read repository guidelines (CLAUDE.md)
  • Read all changed files to understand modifications
  • Analyze code quality and best practices
  • Check for potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage adequacy
  • Address the test failure concern raised by reviewer
  • Provide comprehensive review feedback

Overall Assessment: Strong Approval - Excellent PR

This PR successfully addresses multiple Windows compatibility issues and introduces valuable evaluation infrastructure improvements. The changes demonstrate excellent adherence to best practices and defensive programming principles.

Major Changes Reviewed:

1. UTF-8 Encoding Fix (Issue #14) 🎯

File: src/metacoder/coders/goose.py:168

-                with open(session_file, "r") as f:
+                with open(session_file, "r", encoding="utf-8") as f:

✅ Excellent Fix:

  • Prevents Unicode errors: Fixes Windows compatibility issues with non-ASCII characters in session files
  • Follows Python best practices: Explicit encoding specification per PEP 597
  • Cross-platform reliability: Consistent behavior across different OS default encodings
  • Zero risk: No breaking changes, only improved robustness

2. Consistent Logging Precision 📊

All coder files: Standardized timing logs to use .2f format:

  • claude.py:249: f"🤖 Command took {end_time - start_time:.2f} seconds"
  • codex.py:118: f"🤖 Command took {end_time - start_time:.2f} seconds"
  • gemini.py:159: f"💎 Command took {end_time - start_time:.2f} seconds"
  • goose.py:159: f"🦆 Command took {end_time - start_time:.2f} seconds"
  • qwen.py:93: f"🤖 Command took {end_time - start_time:.2f} seconds"

✅ Code Quality Improvements:

  • Professional consistency: All timing logs now show uniform 2-decimal precision
  • Better user experience: Consistent output formatting across all coders
  • Attention to detail: These minor improvements matter for polished software

3. New Evaluation Infrastructure 🧪

New file: src/metacoder/evals/judges.py

  • Claude Judge Implementation: Well-structured DeepEvalBaseLLM wrapper for Anthropic's Claude
  • Proper error handling: Environment variable validation with clear error messages (judges.py:26-28)
  • Type safety: Excellent use of Anthropic SDK's typed interfaces (judges.py:39-46)
  • Clean async support: Both sync and async methods properly implemented
  • Quota checking: Smart has_available_quota() method for preflight validation (judges.py:61-85)

Enhanced: src/metacoder/evals/runner.py

  • Intelligent fallback system: OpenAI quota exhaustion → Claude → DummyMetric graceful degradation (runner.py:303-339)
  • Robust error handling: Better handling of API failures and quota issues
  • Smart caching: OpenAI quota check with LRU cache to avoid repeated failed calls (runner.py:202-235)
  • Improved logging: Better progress tracking and debug information

4. CLI Output Improvements 💻

File: src/metacoder/metacoder.py:595-604

  • Better percentage formatting: Proper zero-division protection
  • Defensive programming: Safe fraction calculations with fallback values
  • Improved readability: Cleaner summary output formatting

Technical Analysis by Category:

✅ Code Quality & Best Practices

Strengths:

  • Explicit encoding: UTF-8 fix follows Python PEP 597 recommendations
  • Comprehensive error handling: Extensive exception handling in evaluation code with graceful fallbacks
  • Type safety: Excellent use of type hints and Pydantic models throughout
  • Consistent formatting: Standardized logging precision across all coders
  • Defensive programming: Zero-division protection in summary calculations

Follows repository standards:

  • Consistent with existing logging patterns using the established logger
  • Proper use of existing error handling conventions
  • Maintains existing API contracts and interfaces
  • Follows CLAUDE.md guidelines for testing and development

✅ Bug Prevention & Fixes

Issues Fixed:

  • UTF-8 encoding errors: Prevents UnicodeDecodeError on Windows and non-UTF-8 systems
  • Divide by zero: Safe percentage calculations in summary output (metacoder.py:595-604)
  • API quota exhaustion: Graceful fallback when OpenAI quotas are exceeded (runner.py:303-308)
  • Inconsistent output: Standardized timing precision across all coders

Defensive Features Added:

  • Smart quota detection with preflight checks (runner.py:202-235)
  • Exception handling for API failures with multiple fallback layers
  • Environment variable validation with clear error messages (judges.py:26-28)

✅ Performance Considerations

Performance Improvements:

  • LRU caching: OpenAI quota check cached to avoid repeated network calls (runner.py:202)
  • Fast preflight check: Quick OpenAI validation with reduced timeout (8s) and no retries (runner.py:216)
  • Minimal overhead: Explicit encoding adds negligible performance cost

No performance regressions: All changes maintain or improve existing performance characteristics.

✅ Security Assessment

Security Strengths:

  • Environment variable validation: Proper checking for required API keys (judges.py:26-28)
  • No hardcoded secrets: All credentials properly sourced from environment variables
  • Input validation: Proper handling of external API responses
  • No injection risks: Safe handling of file operations and command construction

No security concerns identified.

✅ Test Coverage Assessment

Excellent Test Coverage:

  • UTF-8 specific test: Added test case in tests/test_coders/test_coders_basic.py:171-180 that validates the UTF-8 fix with actual Unicode content
  • Comprehensive evaluation tests: New test suite tests/test_evals/test_claude_judge.py with two test methods:
    • test_claude_judge_downgrade_success: Tests OpenAI → Claude downgrade
    • test_correctnessmetric_downgrade_success: Tests Claude → DummyMetric fallback
  • Proper mocking: Tests use monkeypatch to simulate API key failures
  • Log validation: Tests verify expected warning messages for downgrade scenarios

Test Quality:

  • Uses proper pytest fixtures and logging capture
  • Tests cover the critical fallback scenarios
  • Good error handling in tests with informative assertions

🔍 Addressing Test Failure Concern

Regarding @justaddcoffee's test failure:
The failure is not a code issue but an environment configuration problem:

thread 'main' panicked at /Users/runner/work/goose/goose/crates/goose-cli/src/session/builder.rs:53:61:
called `Result::unwrap()` on an `Err` value: Failed to access keyring: Platform secure storage failure: A default keychain could not be found.

Analysis:

  • This is a goose CLI keyring configuration issue, not related to this PR's changes
  • @ct-parker correctly identified this as a goose configuration problem
  • The test failure occurs before reaching the evaluation logic this PR improves
  • The test is trying to run the actual goose coder, which requires proper keyring setup

Recommendation:
The test should ideally mock the goose coder execution to focus on testing the evaluation fallback logic rather than requiring a fully configured goose environment.


Specific Code Review Comments:

judges.py - New Evaluation Judge

  • Line 26: Excellent explicit API key validation with clear error message
  • Lines 39-46: Proper use of typed Anthropic SDK interfaces with TextBlockParam and MessageParam
  • Lines 48-52: Correct text extraction from response content blocks with type checking
  • Lines 61-85: Smart quota checking implementation with minimal "ping" request

runner.py - Enhanced Evaluation Logic

  • Lines 303-339: Excellent intelligent fallback system implementation with proper error handling
  • Lines 202-235: Smart OpenAI quota preflight check with appropriate timeout and caching
  • Line 141: Good use of instance variable to track OpenAI availability state

UTF-8 Encoding Fix

  • goose.py:168: Perfect - minimal, targeted fix for the specific issue
  • No side effects: Change is isolated and doesn't affect other functionality
  • Future-proof: Ensures consistent behavior across all platforms

Consistent Logging

  • All timing logs now use .2f format for professional, consistent output
  • Maintains the unique emoji prefixes for each coder while standardizing precision
  • Shows attention to user experience details

Final Recommendation:Strong Approval

This PR represents excellent software engineering practices:

Why This Excels:

  1. Problem-focused: Directly addresses Issues Coders should not rely on default encoding from operating system. #14, Divide by zero error in metacoder eval summary. #15, and Model eval gets caught in a loop if the endpoint returns HTTP errors. #18 with targeted solutions
  2. Best practices: UTF-8 encoding, defensive programming, and intelligent fallbacks throughout
  3. Value-added features: New evaluation infrastructure significantly enhances the project
  4. Risk management: Intelligent fallbacks prevent service disruption
  5. Code quality: Consistent formatting and improved error handling
  6. Zero breaking changes: All modifications maintain backward compatibility

Highlights:

  • Production-ready: Robust error handling and fallback mechanisms
  • Maintainable: Clean, well-structured code that follows existing patterns
  • User-friendly: Better error messages and output formatting
  • Cross-platform: Fixes real Windows compatibility issues
  • Test coverage: Includes proper test cases for UTF-8 functionality and evaluation fallbacks

This is exactly the kind of thoughtful, high-quality improvement that makes software more reliable and maintainable. 🎉

The test failure mentioned by @justaddcoffee is an environment-specific goose configuration issue, not a problem with this PR's code changes.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

4 participants