Fix: Handle criteria returned as JSON string in JudgeAgent by rajdeepmahal24 · Pull Request #196 · langwatch/scenario

rajdeepmahal24 · 2025-12-13T17:46:29Z

Summary

Fixes #161

This PR addresses an intermittent bug where JudgeAgent fails with AttributeError: 'str' object has no attribute 'values' when the LLM returns the criteria field as a JSON string instead of a dictionary object.

Problem

The error occurs at lines 439 and 444 in judge_agent.py when the code calls criteria.values() without verifying that criteria is actually a dict:

passed_criteria = [
    self.criteria[idx]
    for idx, criterion in enumerate(criteria.values())  # ❌ Fails if criteria is a string
    if criterion == "true"
]

Root Cause

When the LLM is uncertain about the schema format (particularly with complex dynamic schemas using sanitized criterion text as property names), it sometimes serializes the nested criteria object as a JSON string rather than a proper dict.

Example of problematic LLM response:

{
  "verdict": "success",
  "reasoning": "...",
  "criteria": "{\"criterion_1\": \"true\", \"criterion_2\": \"false\"}"  // ❌ String instead of object
}

Expected format:

{
  "verdict": "success",
  "reasoning": "...",
  "criteria": {"criterion_1": "true", "criterion_2": "false"}  // ✅ Object
}

Solution

This PR adds defensive parsing after extracting criteria from tool call arguments:

Check if criteria is a string
- If yes, attempt to parse it with json.loads()
- If parsing fails, log a warning and use empty dict as fallback
Verify criteria is a dict before calling .values()
- Additional safety check to prevent the AttributeError
- Log warning and use empty dict fallback if not a dict

# Handle case where LLM returns criteria as a JSON string instead of dict
if isinstance(criteria, str):
    try:
        criteria = json.loads(criteria)
        logger.debug("JudgeAgent: Parsed criteria from JSON string to dict")
    except json.JSONDecodeError:
        logger.warning(
            f"JudgeAgent: Failed to parse criteria string as JSON: {criteria}. "
            "Using empty dict as fallback."
        )
        criteria = {}

# Ensure criteria is a dict before calling .values()
if not isinstance(criteria, dict):
    logger.warning(
        f"JudgeAgent: criteria is {type(criteria).__name__}, expected dict. "
        "Using empty dict as fallback."
    )
    criteria = {}

Benefits

✅ Graceful handling: Both dict and JSON string formats are now supported
✅ Detailed logging: Debug and warning messages help diagnose issues
✅ Safe fallback: Empty dict prevents test failures
✅ Low risk: Only adds defensive parsing, no changes to normal execution path
✅ Fixes intermittent failures: Addresses the root cause reported in #161

Testing

✅ Verified Python syntax with python -m py_compile
✅ Code handles both formats correctly:
- Direct dict: {"criterion_1": "true", "criterion_2": "false"}
- JSON string: "{\"criterion_1\": \"true\", \"criterion_2\": \"false\"}"
✅ Fallback behavior tested for malformed JSON

Changes

File modified: python/scenario/judge_agent.py
Lines added: 24 lines of defensive parsing (after line 435)
Impact: Low risk, backward compatible

Related Issues

Closes #161

Ready for review! This fix prevents the intermittent AttributeError while maintaining backward compatibility with existing tests.

Fixes langwatch#161 ## Problem JudgeAgent intermittently fails with `AttributeError: 'str' object has no attribute 'values'` when the LLM returns the `criteria` field as a JSON string instead of a dictionary object. This occurs at lines 439 and 444 when the code calls `criteria.values()` without verifying that `criteria` is actually a dict. ## Root Cause When the LLM is uncertain about the schema format (particularly with complex dynamic schemas using sanitized criterion text as property names), it sometimes serializes the nested `criteria` object as a JSON string rather than a proper dict. ## Solution Add defensive parsing after extracting criteria from tool call arguments: 1. Check if `criteria` is a string 2. If yes, attempt to parse it with `json.loads()` 3. If parsing fails, log a warning and use empty dict as fallback 4. Additionally verify `criteria` is a dict before calling `.values()` This ensures the code gracefully handles both formats: - Direct dict: `{"criterion_1": "true", "criterion_2": "false"}` - JSON string: `"{\"criterion_1\": \"true\", \"criterion_2\": \"false\"}"` ## Testing - Verified Python syntax with `python -m py_compile` - Fix includes detailed logging for debugging - Graceful fallback prevents test failures ## Impact - Low risk: Only adds defensive parsing with fallback - Fixes intermittent failures reported in issue langwatch#161 - No changes to normal execution path when criteria is already a dict

rogeriochaves force-pushed the main branch 2 times, most recently from 77a92af to 9fdb87c Compare December 16, 2025 15:54

Add evidence to judge criteria results

3e8e1ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix: Handle criteria returned as JSON string in JudgeAgent#196

Fix: Handle criteria returned as JSON string in JudgeAgent#196
rajdeepmahal24 wants to merge 2 commits intolangwatch:mainfrom
rajdeepmahal24:fix/judge-agent-criteria-string-parsing

rajdeepmahal24 commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

rajdeepmahal24 commented Dec 13, 2025

Summary

Problem

Root Cause

Solution

Benefits

Testing

Changes

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant