Skip to content

Agent eval: add tool call comparison#24

Merged
tisnik merged 6 commits intolightspeed-core:mainfrom
asamal4:tool-eval
Aug 13, 2025
Merged

Agent eval: add tool call comparison#24
tisnik merged 6 commits intolightspeed-core:mainfrom
asamal4:tool-eval

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Aug 11, 2025

Added tool call comparison capability.

  • Expanded input data model to effectively handle multiple eval type. We can do tool_call eval along with other evaluation.
  • Switched to streaming endpoint as default (to have tool calling parsing)
  • Included tool call eval for query endpoint as well (to make it compatible with OLS at this moment)

Will create follow up stories to standardise responses of lightspeed-stack (less logic on eval side)

Summary by CodeRabbit

  • New Features

    • Multi-type evaluations per query via eval_types (action_eval, response_eval:sub-string, response_eval:accuracy, tool_eval); expected_tool_calls and eval_verify_script support; results include per-type items and captured tool_calls.
    • Selectable endpoint_type (streaming or query) with streaming parsing and agent streaming support.
    • LiteLLM integration documented; agent_auth_token_file/AGENT_API_TOKEN guidance added.
  • Documentation

    • README, YAML examples, CLI samples, and usage docs updated for new fields, token handling, and endpoint_type.
  • Tests

    • Expanded tests for streaming parser, tool-call comparison, multi-type flows, and CLI endpoint handling.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 11, 2025

Walkthrough

Replaces single eval_type with eval_types (List[str]), adds action_eval, response_eval:* and tool_eval (with expected_tool_calls and eval_verify_script), introduces endpoint_type (streaming|query), streaming parser and client paths, multi-result evaluation flow, and updates README, samples, CLI, models, utils, and tests.

Changes

Cohort / File(s) Summary
Spec & Docs
lsc_agent_eval/README.md, lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
Replace eval_typeeval_types (List[str]); document new eval kinds (action_eval, response_eval:sub-string, response_eval:accuracy, tool_eval), expected_tool_calls, eval_verify_script, endpoint_type, token handling, LiteLLM note, examples and updated sample YAMLs.
CLI
lsc_agent_eval/src/lsc_agent_eval/agent_eval.py
Add --endpoint_type CLI arg (values: streaming
Agent Goal Eval Flow
.../core/agent_goal_eval/agent_goal_eval.py, .../core/agent_goal_eval/evaluator.py, .../core/agent_goal_eval/eval_data.py
Move to multi-result semantics: run_evaluation(..., endpoint_type) aggregates per-type results from eval_types; dispatch per eval type; propagate endpoint_type; consolidate printing and setup-script failure handling to emit multiple results.
Models & Eval Utilities
.../core/agent_goal_eval/models.py, .../core/agent_goal_eval/utils.py, .../core/agent_goal_eval/tool_call_eval.py
Add eval_types: List[str] (remove scalar eval_type), introduce EVAL_TYPE_REQUIREMENTS; add expected_tool_calls on data config and tool_calls on results; replace single-result helpers with create_evaluation_results(...) (returns list); add compare_tool_calls(expected, actual) -> bool with strict matching.
HTTP Client & Streaming
.../core/utils/api_client.py, .../core/utils/streaming_parser.py, .../core/utils/__init__.py
query_agent now returns dict {response, conversation_id, tool_calls}; add streaming_query_agent using SSE-like stream and parse_streaming_response; re-export parse_streaming_response.
Tests
lsc_agent_eval/tests/... (agent_goal_eval/, utils/, test_agent_eval.py)
Update tests to use eval_types, cover multi-type aggregation, tool_eval, endpoint routing, streaming parser/client behavior; add tests for compare_tool_calls and parse_streaming_response; update CLI tests for --endpoint_type.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI (agent_eval)
  participant Runner as EvaluationRunner
  participant API as AgentHttpClient
  participant Parser as Streaming Parser
  participant Dispatcher as Eval Dispatcher

  CLI->>Runner: run_evaluation(data_config, provider, model, endpoint_type)
  alt endpoint_type == "streaming"
    Runner->>API: streaming_query_agent(api_input)
    API->>Parser: parse_streaming_response(response)
    Parser-->>API: {response, conversation_id, tool_calls}
    API-->>Runner: {response, conversation_id, tool_calls}
  else endpoint_type == "query"
    Runner->>API: query_agent(api_input)
    API-->>Runner: {response, conversation_id, tool_calls}
  end
  Runner->>Runner: for each eval_type in data_config.eval_types
  Runner->>Dispatcher: evaluate per-type (action_eval / response_eval:* / tool_eval)
  Dispatcher-->>Runner: per-type result items
  Runner->>Dispatcher: create_evaluation_results(...)
  Dispatcher-->>CLI: list[EvaluationResult]
Loading
sequenceDiagram
  participant API as AgentHttpClient.streaming_query_agent
  participant HTTP as Agent API (stream endpoint)
  participant Parser as parse_streaming_response

  API->>HTTP: POST /v1/streaming_query (JSON)
  HTTP-->>API: stream lines (data: {...})
  API->>Parser: parse_streaming_response(response)
  Parser-->>API: {response, conversation_id, tool_calls}
  API-->>Caller: structured dict
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested reviewers

  • tisnik
  • VladimirKadlec

Poem

I thump my paw — new evals take flight,
Multiple types, both day and night.
Streams of tokens, tools checked with care,
Sub-strings found, judges answer fair.
Hop, hop — results align, carrots shared! 🥕🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🔭 Outside diff range comments (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)

27-35: TLS verification is disabled by default (verify=False); make it configurable and secure by default

Disabling TLS verification is risky. At minimum, gate it via an env var and default to verification enabled.

Proposed change:

-            # enable verify, currently for eval it is set to False
-            self.client = httpx.Client(base_url=self.endpoint, verify=False)
+            # Allow toggling TLS verification; default to secure True
+            verify_env = os.getenv("AGENT_API_VERIFY", "true").lower()
+            verify_tls = verify_env in ("1", "true", "yes")
+            self.client = httpx.Client(base_url=self.endpoint, verify=verify_tls)

If the eval environment requires self-signed certs, set AGENT_API_VERIFY=false explicitly.

lsc_agent_eval/README.md (1)

221-229: Update evaluation flow wording for multi-type and tool_eval

The step still references a single eval_type. Adjust to eval_types and include tool_eval to align with implementation.

-     - Execute evaluation based on eval_type (either sub-string, judge-llm or script)
+     - Execute evaluations based on eval_types (any combination of sub-string, judge-llm, script, and tool-call comparison)
🧹 Nitpick comments (27)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (2)

31-45: Enhance debug logging with more context

The generic log message "Item %d does not match" at line 43 lacks context about what type of item failed. Consider including the comparison type for better debugging.

 def _compare_lists(
     expected: list,
     actual: list,
     compare_func: Callable[[Any, Any], bool],
     mismatch_message: str,
 ) -> bool:
     """Compare two lists using a comparison function."""
     if len(expected) != len(actual):
         logger.debug(mismatch_message, len(expected), len(actual))
         return False
     for i, (expected_item, actual_item) in enumerate(zip(expected, actual)):
         if not compare_func(expected_item, actual_item):
-            logger.debug("Item %d does not match", i)
+            logger.debug("Item %d does not match in %s", i, compare_func.__name__)
             return False
     return True

67-82: Consider removing redundant length check

The length comparison at lines 76-82 is redundant since the subsequent iteration over expected.items() will inherently catch any missing keys in actual. The check at line 86-88 already handles this case.

Either keep the length check for early exit optimization or remove it for simpler code:

 def _compare_tool_arguments(expected: dict, actual: dict) -> bool:
     """Compare tool arguments."""
     if not isinstance(expected, dict) or not isinstance(actual, dict):
         logger.debug(
             "Argument type mismatch: expected dict, got %s and %s",
             type(expected),
             type(actual),
         )
         return False
-    if len(expected) != len(actual):
-        logger.debug(
-            "Argument count mismatch: expected %d args, got %d args",
-            len(expected),
-            len(actual),
-        )
-        return False

     # Direct comparison is not done to have better debugging ability
     for key, expected_value in expected.items():
         if key not in actual:
             logger.debug("Missing argument key: '%s'", key)
             return False
         actual_value = actual[key]
         if expected_value != actual_value:
             logger.debug(
                 "Argument value mismatch for '%s': expected %s, got %s",
                 key,
                 expected_value,
                 actual_value,
             )
             return False
+    
+    # Check for extra keys in actual
+    extra_keys = set(actual.keys()) - set(expected.keys())
+    if extra_keys:
+        logger.debug("Unexpected argument keys: %s", extra_keys)
+        return False

     return True
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1)

23-23: Use constants for string prefixes instead of hardcoded indices

Using hardcoded indices (line 23: [6:] and line 81: [5:]) for removing prefixes is fragile and could break if prefix strings change.

+# Add at module level after imports
+DATA_PREFIX = "data: "
+TOOL_PREFIX = "Tool:"
+ARGUMENTS_SEPARATOR = " arguments:"
+
 def parse_streaming_response(response: httpx.Response) -> dict[str, Any]:
     """Parse streaming response and extract data."""
     conversation_id = ""
     final_response = ""
     tool_calls = []
 
     for line in response.iter_lines():
         line = line.strip()
-        if not line or not line.startswith("data: "):
+        if not line or not line.startswith(DATA_PREFIX):
             continue
 
-        json_data = line[6:]  # Remove "data: " prefix
+        json_data = line[len(DATA_PREFIX):]  # Remove data prefix
         parsed_data = _parse_streaming_line(json_data)

And similarly for line 81:

-        tool_name = parts[0][5:]  # Remove "Tool:" prefix
+        tool_name = parts[0][len(TOOL_PREFIX):]  # Remove Tool prefix

Also applies to: 81-81

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)

163-190: Consider extracting validation logic for better maintainability

The validation logic for each eval type could be extracted into a mapping for better maintainability as new types are added.

# At module level
EVAL_TYPE_REQUIREMENTS = {
    "response_eval:accuracy": ("expected_response", "requires non-empty 'expected_response'"),
    "response_eval:sub-string": ("expected_keywords", "requires non-empty 'expected_keywords'"),
    "action_eval": ("eval_verify_script", "requires non-empty 'eval_verify_script'"),
    "tool_eval": ("expected_tool_calls", "requires non-empty 'expected_tool_calls'"),
}

# In validate_eval_requirements method
for eval_type in self.eval_types:
    if eval_type in EVAL_TYPE_REQUIREMENTS:
        field_name, error_msg = EVAL_TYPE_REQUIREMENTS[eval_type]
        field_value = getattr(self, field_name)
        if not field_value or (isinstance(field_value, list) and len(field_value) == 0):
            raise ValueError(
                f"eval_types containing '{eval_type}' {error_msg}"
            )
lsc_agent_eval/tests/test_agent_eval.py (1)

55-57: Endpoint type override tested; add a negative test for invalid values

The explicit override to "query" is validated. Consider adding a test for an invalid value (e.g., "--endpoint_type invalid") to assert argparse exits with SystemExit, catching regressions in choices.

Example:

def test_args_parser_invalid_endpoint_type():
    with pytest.raises(SystemExit):
        _args_parser(["--eval_data_yaml","test.yaml","--agent_provider","x","--agent_model","y","--endpoint_type","invalid"])

Also applies to: 70-71

lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (1)

135-135: Extend stats coverage to tool_eval (optional)

Current assertions cover response_eval and action_eval. Consider adding a tool_eval result into sample_results and verifying its inclusion in by_eval_type to exercise the new path.

Example extension:

sample_results.append(EvaluationResult(
    eval_id="test_004", query="Call tool", response="...",
    eval_type="tool_eval", result="PASS",
    conversation_group="conv2", conversation_id="conv-id-456",
))
# Then assert "tool_eval" in stats.by_eval_type and passed count increments.

Also applies to: 156-159, 222-225

lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)

1-109: Comprehensive tool call comparison tests; add sequence-order and superset-args cases

Great coverage. Two optional additions:

  • If order between sequences is intended to matter, add a negative test where sequences are swapped; if order should not matter, add a positive test and update impl accordingly.
  • Add a case where actual has extra arguments beyond expected to define whether superset matches are allowed.
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (1)

192-195: Assert endpoint_type is forwarded to runner (optional)

To prevent regressions, assert that run_evaluation is called with endpoint_type from args.

Example:

mock_args.endpoint_type = "query"
# ... after evaluator.run_evaluation()
calls = mock_evaluation_runner.return_value.run_evaluation.call_args_list
for call in calls:
    # Support both kwargs and positional
    if "endpoint_type" in call.kwargs:
        assert call.kwargs["endpoint_type"] == "query"
    else:
        # If positional, ensure the last arg equals "query"
        assert call.args[-1] == "query"
lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (1)

30-34: Broad eval_types migration looks correct; refine messages and add tool_eval coverage

  • Good migration to eval_types lists across fixtures and validations.
  • Minor nit: tests and/or model error messages still say “eval_type must be one of...”. Consider updating to “eval_types contains invalid value...” for clarity.
  • Add a test for tool_eval requiring expected_tool_calls and its positive/negative cases.

Example tool_eval negative:

invalid = [{
  "conversation_group": "cg",
  "conversation": [{
    "eval_id": "t1", "eval_query": "q",
    "eval_types": ["tool_eval"]
  }]
}]
with (...patch opens...):
    with pytest.raises(EvaluationDataError, match="expected_tool_calls"):
        AgentGoalEvalDataManager("test.yaml")

Also applies to: 42-43, 56-58, 62-64, 68-70, 158-160, 201-203, 226-227, 277-279, 305-306, 330-331, 355-356, 380-381, 391-392, 516-517

lsc_agent_eval/tests/core/utils/test_api_client.py (3)

265-295: Streaming error handling tests are good; add a timeout case (optional)

You’ve covered parser ValueError and HTTP errors. Consider adding a TimeoutException scenario to assert the specific timeout message.

Example:

def test_streaming_query_agent_timeout():
    mock_client = Mock()
    mock_client.stream.side_effect = httpx.TimeoutException("Timeout")
    with patch("httpx.Client", return_value=mock_client):
        client = AgentHttpClient("http://localhost:8080")
        with pytest.raises(AgentAPIError, match="streaming query timeout"):
            client.streaming_query_agent({"query":"q","provider":"p","model":"m"})

Also applies to: 296-314


55-64: Tighten env token test to assert the exact environment key used

Currently patching os.getenv to return a token for any key can mask regressions. Assert the key being queried is the expected one (e.g., AGENT_API_TOKEN).

Example:

with patch("os.getenv") as getenv:
    getenv.side_effect = lambda k, default=None: "env-token-456" if k == "AGENT_API_TOKEN" else default
    AgentHttpClient("http://localhost:8080")
    getenv.assert_called()  # Optionally assert it was called with "AGENT_API_TOKEN"

3-4: TLS verify=False: make this configurable for production use

AgentHttpClient creates httpx.Client with verify=False. For production-grade systems, consider a configuration (env var/CLI/arg) to enable TLS verification and optionally supply CA bundle. Tests can override to False; default should be secure.

lsc_agent_eval/tests/core/utils/test_streaming_parser.py (2)

30-46: Add coverage for multiple tool_call events and sequence formatting

Current test validates a single tool_call producing a single-item sequence. Add a case with multiple tool_call lines to assert:

  • len(result["tool_calls"]) == number of tool_call events
  • Each entry is a single-item sequence (or per your grouping rules)

I can propose an additional test snippet if helpful.


66-80: Test JSON-decode fallback for tool-call arguments

_streaming_parser._parse_tool_call returns {} when arguments JSON is invalid. Add a test asserting this behavior to lock in robustness.

Example addition:

@@ def test_tool_call_parsing(self):
-        # Invalid format
-        assert _parse_tool_call("invalid format") is None
+        # Invalid format
+        assert _parse_tool_call("invalid format") is None
+
+        # Invalid JSON should fallback to empty arguments
+        result = _parse_tool_call('Tool:bad arguments:{not-json}')
+        assert result["name"] == "bad"
+        assert result["arguments"] == {}
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)

69-80: Tool-calls reformatting looks correct; consider stricter typing on input

Mapping name -> name and args -> arguments aligns with downstream expectations. Strengthen type hints to signal expected shape.

-    def _format_query_endpoint_tool_calls(
-        self, tool_calls: list
-    ) -> list[list[dict[str, Any]]]:
+    def _format_query_endpoint_tool_calls(
+        self, tool_calls: list[dict[str, Any]]
+    ) -> list[list[dict[str, Any]]]:

Optionally, guard non-list inputs with a quick isinstance check returning [].

Also applies to: 91-108

lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (2)

221-255: Consider adding a test for eval_types deduplication and order preservation

Models deduplicate eval_types while preserving order. Add a case like ["action_eval", "action_eval", "response_eval:accuracy"] and assert order and unique results.

I can draft the test if you want it included here.


498-548: EvaluationStats test is solid; add an ERROR scenario for completeness

Current stats cover PASS/FAIL across conversations and types. Add an EvaluationResult(result="ERROR") to assert errored count increments and success_rate adjusts accordingly.

Happy to provide a concrete snippet.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)

190-205: Minor: unreachable None-check and clearer failure reason for no tool calls

Because run_evaluation passes tool_calls or [] to _evaluate_tools, actual_tool_calls is never None. Consider removing the None branch and failing explicitly on empty actual calls to improve diagnostics.

-        if actual_tool_calls is None:
-            logger.error("No tool calls provided for evaluation")
-            return False
+        if not actual_tool_calls:
+            logger.error("No tool calls captured from agent for tool_eval")
+            return False
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (5)

132-139: Solid judge-LLM success path; minor assertion enhancement

The test validates list result semantics, fields, and exact streaming payload with conversation_id. Consider also asserting that query_agent was not called to lock in routing behavior when endpoint_type is not provided (defaults to streaming).

+        # Ensure non-streaming path was not used
+        mock_agent_client.query_agent.assert_not_called()

Also applies to: 141-147, 151-158


423-454: Add assertion that tool_calls propagate into tool_eval results

Since EvaluationResult.tool_calls is populated only for tool_eval, assert it in success case to guard the data path end-to-end.

         results = runner.run_evaluation(config, "openai", "gpt-4")
 
         assert len(results) == 1
         assert results[0].result == "PASS"
         assert results[0].eval_type == "tool_eval"
+        # Tool calls must be attached for tool_eval results
+        assert results[0].tool_calls == [[{"name": "list_versions", "arguments": {}}]]

456-486: Mirror the tool_calls assertion in the failure case

Even when the comparison fails, tool_calls should still be present for diagnostic fidelity.

         results = runner.run_evaluation(config, "openai", "gpt-4")
 
         assert len(results) == 1
         assert results[0].result == "FAIL"
         assert results[0].eval_type == "tool_eval"
+        # Tool calls should be present for debugging even on FAIL
+        assert results[0].tool_calls == [[{"name": "wrong_tool", "arguments": {}}]]

517-590: Great all-pass multi-type test; add checks for tool_calls presence only on tool_eval

Strengthens the contract that tool_calls is set for tool_eval and omitted for others.

         assert all(r.conversation_id == "conv-123" for r in results)
+        # Only tool_eval should carry tool_calls; others should be None
+        by_type = {r.eval_type: r for r in results}
+        assert by_type["tool_eval"].tool_calls == [
+            [{"name": "create_namespace", "arguments": {"name": "lightspeed"}}]
+        ]
+        assert by_type["action_eval"].tool_calls is None
+        assert by_type["response_eval:sub-string"].tool_calls is None
+        assert by_type["response_eval:accuracy"].tool_calls is None

703-739: Endpoint routing parity test covers both modes

Ensures both streaming and query paths yield equivalent results and that each endpoint is invoked as expected. Consider a follow-up test variant for tool_eval on query endpoint to also validate tool_calls formatting on non-streaming responses.

+    # Optional: add a tool_eval config here and assert tool_calls present/equivalent
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (2)

172-178: Use a safe default for endpoint_type to avoid AttributeError

If eval_args is missing endpoint_type (older configs), getattr with a default maintains backward compatibility.

-                eval_results = self.evaluation_runner.run_evaluation(
+                eval_results = self.evaluation_runner.run_evaluation(
                     eval_data,
                     self.eval_args.agent_provider,
                     self.eval_args.agent_model,
                     conversation_id,
-                    self.eval_args.endpoint_type,
+                    getattr(self.eval_args, "endpoint_type", "streaming"),
                 )

193-231: Print tool_calls only when relevant

Currently prints Tool Calls unconditionally when overall_result != PASS. For non-tool_eval batches this is None and can be noisy. Print tool_calls only if present or if any eval_type == "tool_eval".

-            pbar.write(f"   Tool Calls: {results[0].tool_calls}")
+            # Print tool calls only if present (tool_eval) to reduce noise
+            if any(r.tool_calls for r in results):
+                pbar.write(f"   Tool Calls: {results[0].tool_calls}")
lsc_agent_eval/README.md (2)

115-135: YAML example is good; quote hyphenated keyword for consistency

YAML allows unquoted hyphenated scalars, but quoting improves readability and avoids linter warnings in flow-style lists.

-      expected_keywords: ["yes", openshift-lightspeed]
+      expected_keywords: ["yes", "openshift-lightspeed"]

168-179: Usage section reflects endpoint_type and token handling

CLI demonstrates streaming default. Consider punctuating the note for clarity.

-Pass token text file or set `AGENT_API_TOKEN` env var
+Pass token text file or set `AGENT_API_TOKEN` env var.

Also applies to: 180-181

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffad71b and e9940eb.

📒 Files selected for processing (21)
  • lsc_agent_eval/README.md (8 hunks)
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/agent_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (5 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/__init__.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (4 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (13 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (14 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (21 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (10 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1 hunks)
  • lsc_agent_eval/tests/core/utils/test_api_client.py (4 hunks)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/test_agent_eval.py (3 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/README.md
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
  • lsc_agent_eval/README.md
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/README.md
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/README.md
📚 Learning: 2025-07-16T10:13:42.924Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:112-112
Timestamp: 2025-07-16T10:13:42.924Z
Learning: In the lsc_agent_eval judge LLM evaluation system, the _extract_numeric_result method should only accept numeric values 1 or 0 from judge responses. Boolean keywords like "true", "false", "yes", "no" are not acceptable values for judge LLM responses.

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
🧬 Code Graph Analysis (11)
lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
  • compare_tool_calls (9-18)
lsc_agent_eval/src/lsc_agent_eval/core/utils/__init__.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1)
  • parse_streaming_response (12-54)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (4)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
  • compare_tool_calls (9-18)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
  • create_evaluation_results (8-56)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
  • streaming_query_agent (109-137)
  • query_agent (50-89)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (3)
  • ScriptExecutionError (16-17)
  • JudgeModelError (20-21)
  • AgentAPIError (12-13)
lsc_agent_eval/tests/core/utils/test_streaming_parser.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (2)
  • _parse_tool_call (69-92)
  • parse_streaming_response (12-54)
lsc_agent_eval/tests/core/utils/test_api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (3)
  • query_agent (50-89)
  • AgentHttpClient (15-142)
  • streaming_query_agent (109-137)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
  • AgentAPIError (12-13)
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (4)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
  • EvaluationDataConfig (89-192)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
  • run_evaluation (34-106)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
  • run_evaluation (59-94)
lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (1)
  • sample_results (20-50)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationDataConfig (89-192)
  • EvaluationResult (261-288)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
  • AgentAPIError (12-13)
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1)
  • parse_streaming_response (12-54)
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationResult (261-288)
  • EvaluationDataConfig (89-192)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
  • create_evaluation_results (8-56)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
  • run_evaluation (34-106)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (7)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
  • streaming_query_agent (109-137)
  • query_agent (50-89)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
  • run_evaluation (34-106)
  • EvaluationRunner (20-208)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
  • run_evaluation (59-94)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationResult (261-288)
  • EvaluationDataConfig (89-192)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (2)
  • AgentAPIError (12-13)
  • ScriptExecutionError (16-17)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1)
  • run_script (28-76)
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
  • evaluate_response (85-126)
🔇 Additional comments (33)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)

119-138: LGTM! Well-structured validation logic

The validation for eval_types is well-implemented:

  • Properly handles duplicates while preserving order
  • Validates each type individually
  • Ensures non-empty result
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py (1)

131-132: LGTM! Correctly updated for multi-type evaluation

The loop properly iterates over the new eval_types list structure to collect statistics for each evaluation type.

lsc_agent_eval/src/lsc_agent_eval/agent_eval.py (1)

38-44: ✅ Endpoint_type propagation verified—no action needed

  • The --endpoint_type argument is defined in agent_eval.py (lines 38–44).
  • It’s passed into AgentGoalEval and forwarded in core/agent_goal_eval/agent_goal_eval.py (lines 172–178).
  • The evaluator in core/agent_goal_eval/evaluator.py (lines 52–55) uses endpoint_type to switch between streaming and query calls.

Optional: for improved maintainability, you could centralize the allowed choices in a constant and log the selected endpoint at startup.

lsc_agent_eval/tests/test_agent_eval.py (1)

32-33: Good coverage for default endpoint_type

Asserting the default "streaming" improves confidence in CLI compatibility.

lsc_agent_eval/src/lsc_agent_eval/core/utils/__init__.py (1)

1-7: Re-export of parse_streaming_response via utils is clean

Public API ergonomics improved and all accurately reflects exports.

lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (1)

27-31: Updated eval_type labels align with the new multi-type schema

Switching to response_eval:accuracy, action_eval, and response_eval:sub-string is consistent with models and evaluator changes.

Also applies to: 36-40, 45-49

lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (1)

44-46: eval_types transition is correct

Using eval_types lists in fixtures and updated eval_type strings in results matches the multi-type refactor.

Also applies to: 50-52, 64-65, 73-74

lsc_agent_eval/tests/core/utils/test_api_client.py (3)

70-99: Non-streaming response shape and tool_calls formatting validated

The test correctly asserts dict return and formatting into sequences.


106-135: Empty tool_calls handling covered

Good assertion that an empty list returns [] after formatting.


217-265: Streaming path and parser integration coverage is solid

Validates endpoint, timeout, and parser invocation. Nice use of context manager mocking for stream.

lsc_agent_eval/tests/core/utils/test_streaming_parser.py (3)

16-29: Basic streaming parsing test is solid

Covers conversation_id/response extraction and empty tool_calls. Good baseline.


47-65: Error-path expectations are correct

Confirms ValueError messages for missing final response and missing conversation_id. Matches parser behavior.


81-96: Malformed input handling test is valuable

Validates ignoring malformed/empty/non-data lines while still producing final results. Good resilience coverage.

lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)

109-137: Streaming query path: LGTM

Uses httpx streaming and parse_streaming_response with appropriate error mapping. Fits the new streaming-first flow.

lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (2)

147-164: Adoption of eval_types across model tests is correct

Validates per-type requirements (expected_response for accuracy, etc.). Matches model validators.


256-330: Tool-eval config tests look good, including complex sequences and missing/empty validations

Structure aligns with downstream tool-call comparison.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)

75-91: Per-type error handling aligns with FAIL vs ERROR semantics

Exceptions during a single evaluation become ERROR; boolean false becomes FAIL. Matches documented behavior.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)

35-56: Per-evaluation result construction: LGTM

Correctly includes tool_calls only for tool_eval and raises on missing inputs.

lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (9)

29-36: Good agent client mock shape and endpoint parity

Mock returns response, conversation_id passthrough/generation, and tool_calls defaulting to empty list; both streaming and query endpoints wired via side_effect. This aligns with the API client contract and keeps tests endpoint-agnostic.

Also applies to: 38-39


62-63: Migration to eval_types looks correct across fixtures

Fixtures consistently use eval_types (list) per the new multi-type model. This prevents accidental single-type assumptions in tests.

Also applies to: 86-87, 96-97


261-272: Substring success test OK; covers case-insensitivity via separate test

Response contains both keywords with different cases; pass condition follows the stricter “all keywords” rule. Looks good.

Also applies to: 275-286


294-304: Substring failure path verified

Negative case correctly checks FAIL with no error. Test isolates absence of required substrings.

Also applies to: 307-318


325-336: Agent API error handling verified

ERROR result with preserved eval_type and error message validated. This matches the contract of converting API exceptions to per-type ERRORs.

Also applies to: 338-343


355-357: Comprehensive substring evaluation semantics covered

All/one/none keywords and case-insensitive scenarios are exercised and match the expected AND semantics from prior learnings.

Also applies to: 360-371, 373-373, 376-387, 389-389, 392-403, 405-405, 408-419, 421-421


495-505: Conversation ID propagation coverage is on point

Test ensures provided conversation_id flows to API and result objects. Good guard for multi-type expansion.

Also applies to: 508-515


591-649: Mixed-results multi-type test is well-constructed

Validates each type independently with expected PASS/FAIL states and shared conversation_id. Nicely isolates tool mismatch and script/judge failures.


650-685: Error propagation across multiple eval types looks correct

Action eval raises ScriptExecutionError resulting in ERROR; sub-string and accuracy behave independently. This mirrors per-type error handling.

Also applies to: 687-702

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (4)

17-17: Importing create_evaluation_results aligns with multi-result flow

Brings in the new utility to standardize result creation across success/error cases.


116-122: Setup failure now generates per-type ERROR results

Extending results with create_evaluation_results for each eval_data is the right behavior under multi-type semantics. Message prefix “Setup script failed:” provides helpful context.

Also applies to: 121-123


181-184: Conversation ID handoff from first eval results is correct

Grabbing conversation_id from the first returned EvaluationResult ensures subsequent evals share context without requiring direct access to the API response.


219-224: Readable, informative multi-result summary

Concise roll-up per eval_id with per-type markers and error messages when applicable. Nicely balances signal vs verbosity.

Also applies to: 245-254

lsc_agent_eval/README.md (2)

9-13: Feature list correctly introduces multi-type and tool_eval

Clear enumeration of available evaluation types and their semantics, including the “ALL keywords” rule.


270-271: By Evaluation Type metrics include tool_eval

This aligns with the new evaluation type set and output expectations.

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nit comments, little bit tool long for a review.

@asamal4
Copy link
Collaborator Author

asamal4 commented Aug 12, 2025

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)

50-81: Harden parsing: validate types before calling .strip() and normalize inputs

If the agent returns unexpected types, current code will raise generic exceptions. Fail fast with targeted messages and normalize conversation_id and tool_calls types.

Apply this diff:

-            response_data = response.json()
-            if "response" not in response_data:
-                raise AgentAPIError("Agent response missing 'response' field")
-
-            agent_response = response_data["response"].strip()
-            conversation_id = response_data.get("conversation_id", "").strip()
-            tool_calls = response_data.get("tool_calls", [])
+            response_data = response.json()
+            if "response" not in response_data:
+                raise AgentAPIError("Agent response missing 'response' field")
+
+            resp_value = response_data["response"]
+            if not isinstance(resp_value, str):
+                raise AgentAPIError("Agent response field 'response' must be a string")
+            agent_response = resp_value.strip()
+
+            # Normalize conversation_id to string
+            raw_conv_id = response_data.get("conversation_id", "")
+            conversation_id = str(raw_conv_id).strip() if raw_conv_id is not None else ""
+
+            # Normalize tool_calls to a list
+            tool_calls = response_data.get("tool_calls") or []
+            if not isinstance(tool_calls, list):
+                tool_calls = []
 
             # Format tool calls to match expected structure (list of sequences)
             formatted_tool_calls = self._format_query_endpoint_tool_calls(tool_calls)
 
             return {
                 "response": agent_response,
                 "conversation_id": conversation_id,
                 "tool_calls": formatted_tool_calls,
             }
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (1)

712-748: Handle unsupported endpoint_type gracefully and add missing test

The current run_evaluation implementation will raise an uncaught ValueError for invalid endpoint_type, causing a crash instead of returning a structured ERROR result. Let’s update the exception handling and add coverage:

• In lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py, extend the outer except to include ValueError so that an unsupported endpoint yields an ERROR result:

@@ def run_evaluation(…, endpoint_type: str = "streaming") -> list["EvaluationResult"]:
-        else:
-            raise ValueError(f"Unsupported endpoint_type: {endpoint_type}")
+        else:
+            raise ValueError(f"Unsupported endpoint_type: {endpoint_type}")-except (AgentAPIError, ScriptExecutionError, JudgeModelError) as e:
+except (AgentAPIError, ScriptExecutionError, JudgeModelError, ValueError) as e:
     logger.error("Evaluation failed for %s: %s", data_config.eval_id, e)
     return create_evaluation_results(
         data_config, error_message=str(e), conversation_id=conversation_id
     )

• In lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py, add:

def test_run_evaluation_invalid_endpoint_type_returns_error(
    mock_agent_client, mock_script_runner, mock_judge_manager, sample_config_judge_llm
):
    runner = EvaluationRunner(mock_agent_client, mock_script_runner, mock_judge_manager)
    results = runner.run_evaluation(
        sample_config_judge_llm,
        "watsonx",
        "ibm/granite-3-3-8b-instruct",
        "conv-id-123",
        endpoint_type="invalid-mode",
    )
    assert len(results) == 1
    # The eval_type should still be the configured response_eval type
    assert results[0].eval_type == sample_config_judge_llm.eval_types[0]
    assert results[0].result == "ERROR"
    assert "Unsupported endpoint_type" in (results[0].error or "")

This ensures invalid modes no longer crash and conform to FAIL vs ERROR semantics.

♻️ Duplicate comments (2)
lsc_agent_eval/README.md (1)

66-71: Great: tool_eval and expected_tool_calls are now documented

This addresses the previous doc gap and matches the models and examples.

lsc_agent_eval/sample_data/agent_goal_eval_example.yaml (1)

106-111: Update eval_types in sample_data/agent_goal_eval_example.yaml

The eval_types list must use tool_eval (not tool_call) to pass validation.

• File: lsc_agent_eval/sample_data/agent_goal_eval_example.yaml, around line 106

Apply this diff:

       eval_types:
-        - tool_call
+        - tool_eval
         - action_eval
         - response_eval:sub-string
         - response_eval:accuracy
🧹 Nitpick comments (11)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)

21-28: Tighten type annotations for clarity and static checks

These helpers accept dicts with arbitrary structure; being explicit helps both readers and linters.

Apply this diff to improve the annotations:

-def _compare_tool_call_sequence(expected: list[dict], actual: list[dict]) -> bool:
+def _compare_tool_call_sequence(
+    expected: list[dict[str, Any]], actual: list[dict[str, Any]]
+) -> bool:
@@
-def _compare_single_tool_call(expected: dict, actual: dict) -> bool:
+def _compare_single_tool_call(expected: dict[str, Any], actual: dict[str, Any]) -> bool:
@@
-def _compare_tool_arguments(expected: dict, actual: dict) -> bool:
+def _compare_tool_arguments(expected: dict[str, Any], actual: dict[str, Any]) -> bool:

Also applies to: 48-49, 67-69

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)

23-36: Improve error message readability for invalid eval_type

Using dict_keys([...]) in error messages is noisy. Present a clean, comma-separated list.

Apply this diff:

-    if eval_type not in EVAL_TYPE_REQUIREMENTS:
-        raise ValueError(
-            f"eval_type must be one of {EVAL_TYPE_REQUIREMENTS.keys()}, got '{eval_type}'"
-        )
+    if eval_type not in EVAL_TYPE_REQUIREMENTS:
+        allowed = ", ".join(sorted(EVAL_TYPE_REQUIREMENTS.keys()))
+        raise ValueError(
+            f"eval_type must be one of [{allowed}], got '{eval_type}'"
+        )

100-107: Include tool_eval in the field description

The eval_types field description misses tool_eval, which is otherwise supported across the codebase and README.

Apply this diff:

-        description=(
-            "List of evaluation types."
-            " -> action_eval, response_eval:sub-string, response_eval:accuracy"
-        ),
+        description=(
+            "List of evaluation types."
+            " -> action_eval, response_eval:sub-string, response_eval:accuracy, tool_eval"
+        ),
lsc_agent_eval/README.md (2)

116-136: Consider clarifying tool_calls shape across endpoints

To reduce confusion, add a short note that:

  • query endpoint: each tool call is wrapped as its own single-item sequence
  • streaming endpoint: tool calls are grouped per turn in a single sequence, in order

Apply this diff to add a clarifying note just after expected_tool_calls:

 - `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval)
+  Note: On the non-streaming "query" endpoint, each tool call is wrapped as its own single-item sequence.
+  On the "streaming" endpoint, multiple tool calls within a turn are grouped in a single sequence in order of occurrence.

145-147: Minor grammar fix in example

Use “an” before “openshift-lightspeed”.

Apply this diff:

-      eval_query: is there a openshift-lightspeed namespace ?
+      eval_query: is there an openshift-lightspeed namespace?
lsc_agent_eval/tests/core/utils/test_api_client.py (1)

217-313: Optional: add a streaming timeout test for parity

For completeness, add a timeout test to mirror the non-streaming timeout coverage.

Apply this addition to the test file:

+    def test_streaming_query_agent_timeout(self):
+        """Test streaming agent query with timeout."""
+        mock_client = Mock()
+        mock_client.stream.side_effect = httpx.TimeoutException("Request timeout")
+
+        with patch("httpx.Client", return_value=mock_client):
+            client = AgentHttpClient("http://localhost:8080")
+            api_input = {"query": "Test query", "provider": "openai", "model": "gpt-4"}
+            with pytest.raises(AgentAPIError, match="Agent streaming query timeout"):
+                client.streaming_query_agent(api_input)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)

91-108: Be defensive in tool_calls formatting (skip invalid entries)

Guard against non-list inputs and skip malformed tool call objects instead of assuming shape.

Apply this diff:

     def _format_query_endpoint_tool_calls(
         self, tool_calls: list
     ) -> list[list[dict[str, Any]]]:
         """Format tool calls from query endpoint to match expected structure."""
-        if not tool_calls:
+        if not tool_calls:
             return []
 
-        formatted_sequences = []
-        for tool_call in tool_calls:
-            # OLS dependency
-            formatted_tool = {
-                "tool_name": tool_call.get("name", ""),
-                "arguments": tool_call.get("args", {}),
-            }
-            formatted_sequences.append([formatted_tool])
+        if not isinstance(tool_calls, list):
+            return []
+
+        formatted_sequences: list[list[dict[str, Any]]] = []
+        for tool_call in tool_calls:
+            if not isinstance(tool_call, dict):
+                continue
+            # OLS dependency
+            formatted_tool = {
+                "tool_name": tool_call.get("name", ""),
+                "arguments": tool_call.get("args", {}),
+            }
+            formatted_sequences.append([formatted_tool])
 
         return formatted_sequences
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)

59-62: Make agent_response parsing resilient to missing keys

If the streaming parser ever omits tool_calls, this will KeyError and crash. Safer to use .get with defaults (especially for tool_calls) so we keep producing FAIL/ERROR results instead of aborting.

Apply this diff:

-            response = agent_response["response"]
-            conversation_id = agent_response["conversation_id"]
-            tool_calls = agent_response["tool_calls"]
+            response = agent_response.get("response", "")  # required by models
+            conversation_id = agent_response.get("conversation_id")
+            tool_calls = agent_response.get("tool_calls", [])

125-129: Clarify TODO or remove it

“TODO(future): We should do tool_eval always ??” is ambiguous. Either capture the intended behavior in an issue or remove the comment.

Proposed rewording:

  • “Consider always running tool_eval if tool_calls are present in the response, even when not requested; evaluate trade-offs.”
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)

8-16: Strengthen typing for evaluation_results to avoid key typos

evaluation_results is typed as list[dict], which is easy to misuse. Define a TypedDict for the items and use it in the signature for better static checking.

Apply this diff:

-from typing import Optional
+from typing import Optional, TypedDict
@@
-def create_evaluation_results(  # pylint: disable=too-many-arguments,too-many-positional-arguments
+class EvalResultItem(TypedDict):
+    eval_type: str
+    result: str
+    error: Optional[str]
+
+def create_evaluation_results(  # pylint: disable=too-many-arguments,too-many-positional-arguments
     eval_config: EvaluationDataConfig,
     response: str = "",
-    evaluation_results: Optional[list[dict]] = None,
+    evaluation_results: Optional[list[EvalResultItem]] = None,
     error_message: Optional[str] = None,
     conversation_id: Optional[str] = None,
     tool_calls: Optional[list[list[dict]]] = None,
 ) -> list[EvaluationResult]:
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)

185-191: Optional: validate endpoint_type early at CLI layer

You’re passing getattr(self.eval_args, "endpoint_type", "streaming") through. Consider validating it against {"streaming", "query"} here to fail fast at the orchestrator layer (still keeping runner-side validation). This improves error messaging near the user input surface.

Example:

endpoint = getattr(self.eval_args, "endpoint_type", "streaming")
if endpoint not in {"streaming", "query"}:
    raise AgentEvaluationError(f"Invalid endpoint_type: {endpoint}")
eval_results = self.evaluation_runner.run_evaluation(..., endpoint_type=endpoint)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9940eb and 640f624.

📒 Files selected for processing (14)
  • lsc_agent_eval/README.md (9 hunks)
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (5 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (14 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1 hunks)
  • lsc_agent_eval/tests/core/utils/test_api_client.py (4 hunks)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/test_agent_eval.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py
  • lsc_agent_eval/tests/test_agent_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/README.md
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
🧬 Code Graph Analysis (6)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
  • AgentAPIError (12-13)
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1)
  • parse_streaming_response (14-56)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationDataConfig (95-176)
  • EvaluationResult (245-272)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
  • create_evaluation_results (8-56)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (3)
  • ConversationDataConfig (179-242)
  • EvaluationDataConfig (95-176)
  • EvaluationResult (245-272)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
  • run_evaluation (34-108)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (7)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
  • streaming_query_agent (109-137)
  • query_agent (50-89)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
  • run_evaluation (34-108)
  • EvaluationRunner (20-209)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
  • run_evaluation (72-107)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationResult (245-272)
  • EvaluationDataConfig (95-176)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (2)
  • AgentAPIError (12-13)
  • ScriptExecutionError (16-17)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1)
  • run_script (28-76)
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
  • evaluate_response (85-126)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (5)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
  • compare_tool_calls (9-18)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
  • create_evaluation_results (8-56)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
  • run_evaluation (72-107)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
  • streaming_query_agent (109-137)
  • query_agent (50-89)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (3)
  • ScriptExecutionError (16-17)
  • JudgeModelError (20-21)
  • AgentAPIError (12-13)
lsc_agent_eval/tests/core/utils/test_api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (3)
  • query_agent (50-89)
  • AgentHttpClient (15-142)
  • streaming_query_agent (109-137)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
  • AgentAPIError (12-13)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: mypy
🔇 Additional comments (17)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (3)

9-18: Solid, minimal public API for tool-call comparison

The top-level compare_tool_calls wrapper is clear and composes well with the internal helpers. Early length check + short-circuit is appropriate.


31-45: Helpful debug logging on mismatch

Index + function name in the debug message makes triage easier when sequences are long. Good attention to diagnosability.


67-98: Strict arguments equality is intentional and well-implemented

The key-set and value equality checks (with explicit logging for missing/extra keys) align with precise tool-call validation needs.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (4)

9-20: Clear, centralized eval-type requirements

The mapping cleanly encodes per-type field requirements and keeps validation logic declarative.


125-145: Eval-types validator is robust and order-preserving

Deduplication while preserving order, plus per-item validation, is exactly what we want here.


165-176: Per-type requirements check is correct

Iterates selected eval_types and enforces required fields succinctly. Good separation of concerns with the global requirements map.


251-253: Result model now captures tool_calls

Adding tool_calls here aligns the model with the new evaluation type and downstream reporting needs.

lsc_agent_eval/README.md (2)

9-14: Docs correctly enumerate multi-type evaluations (AND semantics for substring)

The list includes tool_eval and explicitly states ALL keywords must be present for substring evaluation, which matches the team's prior decision.

Using team learning: This aligns with the stricter AND semantics captured in our previous learning about substring evaluation.


209-210: Key argument docs look good (endpoint_type)

Defaulting to streaming is documented and consistent with the implementation and tests.

lsc_agent_eval/tests/core/utils/test_api_client.py (5)

66-105: Updated non-streaming query test reflects new return shape and tool_calls formatting

Asserting the dict payload and the nested tool-calls sequence matches the implementation. Good coverage.


106-135: Good addition: empty tool_calls path

Covers the empty-list normalization and ensures a stable shape for consumers.


217-264: Streaming success path is well-tested

Context manager, endpoint invocation, and parser handoff are verified cleanly.


265-295: Parser error propagates with helpful message

Validates ValueError -> AgentAPIError path with an actionable message.


296-313: HTTP error handling test for streaming is appropriate

Mirrors non-streaming behavior and asserts consistent error semantics.

lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)

109-138: Streaming endpoint integration and error mapping look correct

Using stream with raise_for_status() and delegating parsing to parse_streaming_response with precise error mapping is solid.


50-81: No tuple-style unpacking of query_agent detected; no downstream updates required

A search for destructuring assignments of the form

a, b = client.query_agent(...)

returned no matches. All call sites (in tests and in evaluator.py) assign the returned dict to a single variable and then access its keys explicitly.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)

52-58: Good fix: explicit validation of endpoint_type

Switching on "streaming" vs "query" and raising on unsupported values prevents silent misconfiguration. This matches the framework’s ERROR vs FAIL semantics.

@asamal4 asamal4 force-pushed the tool-eval branch 3 times, most recently from 7081a00 to fd493ec Compare August 12, 2025 11:44
@asamal4 asamal4 marked this pull request as ready for review August 13, 2025 12:53
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🔭 Outside diff range comments (3)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)

234-241: Avoid O(n^2) duplicate detection for eval_id(s) in a conversation

Current duplicate collection uses count() inside a list comprehension, which is quadratic and also repeats duplicates in the error message. Use a set-based approach for linear-time detection and clean messaging.

Apply this diff:

-        eval_ids = [eval_config.eval_id for eval_config in self.conversation]
-        duplicates = [eval_id for eval_id in eval_ids if eval_ids.count(eval_id) > 1]
-        if duplicates:
+        eval_ids = [eval_config.eval_id for eval_config in self.conversation]
+        seen: set[str] = set()
+        duplicates_set: set[str] = set()
+        for eid in eval_ids:
+            if eid in seen:
+                duplicates_set.add(eid)
+            else:
+                seen.add(eid)
+        if duplicates_set:
             raise ValueError(
-                f"Duplicate eval_id(s) in conversation '{self.conversation_group}': {duplicates}"
+                f"Duplicate eval_id(s) in conversation '{self.conversation_group}': {sorted(duplicates_set)}"
             )
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)

131-137: Return False vs raise for missing verify script is inconsistent with ERROR semantics

Missing eval_verify_script is a configuration error. To surface it as ERROR (not FAIL), raise ValueError instead of returning False. The top-level per-type handler already converts ValueError into an ERROR result.

     def _evaluate_script(self, data_config: "EvaluationDataConfig") -> bool:
         """Evaluate using script execution."""
         if not data_config.eval_verify_script:
-            logger.error("No verify script provided for script evaluation")
-            return False
+            raise ValueError("No verify script provided for script evaluation")
 
         return self.script_runner.run_script(data_config.eval_verify_script)

171-178: Missing judge_manager should be ERROR, not FAIL

Per framework semantics, absence of judge_manager is an execution/config error. Raise ValueError to surface an ERROR result.

         if not self.judge_manager:
-            logger.error("Judge model manager not available for judge-llm evaluation")
-            return False
+            raise ValueError("Judge model manager not available for judge-llm evaluation")
 
         if not data_config.expected_response:
-            logger.error("Expected response not provided for judge-llm evaluation")
-            return False
+            raise ValueError("Expected response not provided for judge-llm evaluation")
♻️ Duplicate comments (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)

236-238: LGTM! Tool calls printing fixed as previously addressed.

The code correctly finds the first result with actual tool_calls data rather than potentially printing empty tool_calls from the first result, addressing the previous review feedback.

🧹 Nitpick comments (11)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)

33-47: Optimize list comparison with early exit on length mismatch.

The current implementation processes all items even when lengths differ. Consider moving the length check to the beginning for better performance.

Apply this diff for improved efficiency:

 def _compare_lists(
     expected: list,
     actual: list,
     compare_func: Callable[[Any, Any], bool],
     mismatch_message: str,
 ) -> bool:
     """Compare two lists using a comparison function."""
     if len(expected) != len(actual):
         logger.debug(mismatch_message, len(expected), len(actual))
         return False
+    
     for i, (expected_item, actual_item) in enumerate(zip(expected, actual)):
         if not compare_func(expected_item, actual_item):
             logger.debug("Item %d does not match in %s", i, compare_func.__name__)
             return False
     return True
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)

114-116: Clarify expected_tool_calls shape or provide a typed schema

The field accepts list[list[dict]] which is flexible but easy to misuse. Consider documenting or enforcing the expected keys (e.g., name/tool_name and arguments) with a typed model to catch shape errors early.

Would you like me to introduce a lightweight Pydantic model (e.g., ToolCall with name: str, arguments: dict[str, Any]) and validators to enforce the nested list-of-sequences structure?

lsc_agent_eval/README.md (2)

9-13: Document that substring matching is case-insensitive

The implementation lower-cases both response and keywords. Add “(case-insensitive)” to make behavior explicit for users.

-  - `response_eval:sub-string`: Simple substring matching evaluation (ALL keywords must be present in response)
+  - `response_eval:sub-string`: Simple substring matching evaluation (ALL keywords must be present in response; case-insensitive)

254-272: Consider documenting tool_calls presence in outputs

CSV/JSON sections don’t mention tool_calls. Since EvaluationResult carries tool_calls for tool_eval, note whether:

  • CSV excludes tool_calls (OK), and
  • JSON includes tool_calls for tool_eval results.

If desired, I can scan the CSV/JSON writers to confirm what’s emitted and propose doc updates accordingly.

lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (1)

516-519: Integration test reflects eval_types usage

Single-eval integration test now uses eval_types. Consider adding a small case with multiple eval_types in one eval to catch YAML parsing/validation nuances.

I can add a follow-up integration test with combined eval_types (e.g., [response_eval:sub-string, response_eval:accuracy]) if helpful.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (3)

143-151: Propagate config errors in substring eval as ERROR

Given model-level validation, this branch shouldn’t hit, but if it does, it should be treated as a configuration error (ERROR). Raise ValueError rather than returning False.

         if not data_config.expected_keywords:
-            return False
+            raise ValueError("expected_keywords not provided for sub-string evaluation")

59-62: Preserve None when tool_calls is absent to distinguish “missing” vs “empty”

Currently, agent_response.get("tool_calls", []) always yields a list. This makes it impossible for _evaluate_tools to detect a truly missing tool_calls (and raise a config/execution ERROR) vs an empty but present list (agent made no calls → FAIL). Dropping the default keeps this distinction.

-            tool_calls = agent_response.get("tool_calls", [])
+            tool_calls = agent_response.get("tool_calls")

If query_agent/streaming_query_agent always include tool_calls, this change won’t affect behavior. Otherwise it will surface absent tool_calls as ERROR (via _evaluate_tools).


191-204: Evaluate-tools error signaling is consistent; consider harmonizing empty vs None semantics

Raising on missing expected_tool_calls and missing actual_tool_calls matches framework ERROR semantics. If you adopt preserving None upstream, this will also properly signal absent tool_calls as ERROR; keep empty list as FAIL to reflect “agent made no calls.”

Optionally add a short comment documenting the distinction:

  • None: field absent → ERROR
  • []: field present but empty → FAIL
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (3)

186-189: Add assertions for tool_calls and judge invocation on failure path

Even on FAIL, tool_calls should be None (since tool_eval isn’t requested). Also assert judge was invoked.

Apply this diff:

         result = results[0]
         assert result.result == "FAIL"
         assert result.error is None
+        assert result.tool_calls is None
+        mock_judge_manager.evaluate_response.assert_called_once()

309-319: Verify conversation_id propagation when not provided

In this test you don’t pass a conversation_id; the agent response provides one. Assert it’s propagated into the result to cover this branch.

Apply this diff:

         result = results[0]
         assert result.eval_id == "test_003"
         assert result.result == "FAIL"
         assert result.eval_type == "response_eval:sub-string"
         assert result.error is None
+        assert result.conversation_id == "test-conversation-id"

586-599: Assert tool_calls are attached to all results when tool_eval is included

Given the current design, tool_calls are propagated to results when tool_eval is among eval_types. Make this explicit to catch regressions.

Apply this diff:

         assert all(r.eval_id == "multi_pass_test" for r in results)
         assert all(r.query == "create openshift-lightspeed namespace" for r in results)
         assert all(r.conversation_id == "conv-123" for r in results)
+        expected_calls = [
+            [{"tool_name": "create_namespace", "arguments": {"name": "lightspeed"}}]
+        ]
+        assert all(r.tool_calls == expected_calls for r in results)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 640f624 and fd493ec.

📒 Files selected for processing (16)
  • lsc_agent_eval/README.md (9 hunks)
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (3 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (5 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (12 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (14 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (21 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1 hunks)
  • lsc_agent_eval/tests/core/utils/test_api_client.py (4 hunks)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/test_agent_eval.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/README.md
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
🧬 Code Graph Analysis (7)
lsc_agent_eval/tests/test_agent_eval.py (1)
lsc_agent_eval/src/lsc_agent_eval/agent_eval.py (1)
  • _args_parser (14-78)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationDataConfig (95-176)
  • EvaluationResult (245-272)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (6)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
  • streaming_query_agent (109-137)
  • query_agent (50-89)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
  • run_evaluation (34-108)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationResult (245-272)
  • EvaluationDataConfig (95-176)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (2)
  • AgentAPIError (12-13)
  • ScriptExecutionError (16-17)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1)
  • run_script (28-76)
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
  • evaluate_response (85-126)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
  • create_evaluation_results (16-64)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (3)
  • ConversationDataConfig (179-242)
  • EvaluationDataConfig (95-176)
  • EvaluationResult (245-272)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
  • run_evaluation (34-108)
lsc_agent_eval/tests/core/utils/test_api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (3)
  • query_agent (50-89)
  • AgentHttpClient (15-142)
  • streaming_query_agent (109-137)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
  • AgentAPIError (12-13)
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationResult (245-272)
  • EvaluationDataConfig (95-176)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (4)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
  • compare_tool_calls (9-18)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
  • create_evaluation_results (16-64)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
  • streaming_query_agent (109-137)
  • query_agent (50-89)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (3)
  • ScriptExecutionError (16-17)
  • JudgeModelError (20-21)
  • AgentAPIError (12-13)
🔇 Additional comments (37)
lsc_agent_eval/tests/test_agent_eval.py (3)

32-32: LGTM! Test assertion for default endpoint_type is correct.

The test correctly verifies that the endpoint_type defaults to "streaming", which aligns with the argument parser configuration in lsc_agent_eval/src/lsc_agent_eval/agent_eval.py line 27.


55-56: LGTM! Test coverage for query endpoint_type is comprehensive.

The test correctly verifies that when --endpoint_type "query" is provided, it's properly parsed and available in the namespace. This ensures the CLI argument parsing works for both endpoint types.

Also applies to: 70-70


111-125: LGTM! Invalid endpoint_type validation test is well-implemented.

The test correctly verifies that invalid endpoint_type values trigger a SystemExit exception, ensuring proper validation of the choices constraint defined in the argument parser.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (3)

9-18: LGTM! Clean public API with appropriate type hints.

The main function signature is well-defined with clear type hints for nested tool call structures. The delegation to internal helpers maintains good separation of concerns.


50-66: LGTM! Well-structured tool call comparison logic.

The function correctly compares both tool names and arguments, with appropriate debug logging for mismatches. The separation of name and argument validation is clean.


69-100: LGTM! Comprehensive argument comparison with detailed logging.

The function handles all edge cases well:

  • Type validation for both dictionaries
  • Missing key detection
  • Value mismatch reporting
  • Extra key detection

The detailed debug logging will be valuable for troubleshooting evaluation failures.

lsc_agent_eval/tests/core/utils/test_api_client.py (6)

70-76: LGTM! Updated test data reflects real-world tool call structure.

The test data now includes realistic tool call structure with name and args fields that match the OLS API format, providing better test coverage.


91-99: LGTM! Proper validation of tool call formatting.

The test correctly validates that tool calls are formatted into the expected nested sequence structure, ensuring the _format_query_endpoint_tool_calls method works as intended.


106-135: LGTM! Good test coverage for empty tool_calls scenario.

This test ensures that when the API returns empty tool_calls, the client properly handles it and returns an empty list, covering an important edge case.


218-264: LGTM! Comprehensive streaming query test.

The test properly mocks the streaming response and validates that:

  • The correct endpoint /v1/streaming_query is called
  • The streaming parser is invoked
  • The expected result structure is returned

265-295: LGTM! Good error handling test for streaming parser.

The test correctly verifies that ValueError from the streaming parser is properly caught and re-raised as AgentAPIError with appropriate error context.


296-331: LGTM! Complete error scenario coverage for streaming.

The timeout and HTTP error tests ensure robust error handling for the streaming endpoint, maintaining consistency with the non-streaming error handling patterns.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (3)

8-14: LGTM! Well-defined TypedDict for evaluation results.

The EvalResultItem structure provides clear typing for individual evaluation results with appropriate optional error field.


40-40: LGTM! Consistent tool_calls handling addressed.

The implementation correctly attaches tool_calls only to tool_eval results in both global error and per-evaluation paths, maintaining consistency as requested in previous reviews.

Also applies to: 57-58


61-63: LGTM! Proper input validation with clear error message.

The mutual exclusivity check ensures the function is called correctly and provides a helpful error message when neither required parameter is provided.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (7)

25-36: LGTM! Clean utility function for result markers.

The helper function provides consistent visual markers for different evaluation outcomes, improving the user experience with clear status indicators.


129-134: LGTM! Proper multi-result error handling for setup failures.

The code correctly uses create_evaluation_results to generate error results for all evaluation types when setup script fails, and properly extends the results list.


185-191: LGTM! Endpoint-type parameter integration.

The evaluation runner correctly receives the endpoint_type parameter with a sensible default fallback, enabling proper routing between streaming and query endpoints.


194-196: LGTM! Conversation ID handling for multi-result context.

The logic correctly extracts the conversation_id from the first result in the list, maintaining conversation continuity across multiple evaluation types.


206-230: LGTM! Enhanced multi-result printing logic.

The refactored printing function properly:

  • Computes overall status across multiple evaluation types
  • Displays all eval types in the summary line
  • Uses consistent result markers

247-249: LGTM! Expected tool calls debugging support added.

The addition of expected_tool_calls in the debug output provides valuable context for troubleshooting tool evaluation failures.


253-257: LGTM! Individual result breakdown enhances debugging.

The detailed per-result output with individual markers and error messages provides clear visibility into which specific evaluation types failed and why.

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (4)

9-20: Good centralization of per-type requirements

EVAL_TYPE_REQUIREMENTS clearly encodes allowed eval types and their required fields. This is a solid foundation for consistent validation across the model. No action needed.


23-36: Tight eval_type validation is correct and user-friendly

The validator handles type, emptiness, and allowed values with a helpful error message listing allowed types. Looks good.


125-145: Order-preserving dedupe on eval_types is great

Validating each type and de-duplicating while preserving order prevents subtle UX issues. Nicely done.


251-256: Tool calls captured in results aligns with tool_eval

Including tool_calls only for tool_eval via downstream utils is appropriate. No issues.

lsc_agent_eval/README.md (1)

66-71: No changes needed: “tool_name” is already used consistently

I verified that every tool‐call definition uses the tool_name key—there are no top‐level name keys in code, samples, or docs:

• lsc_agent_eval/README.md (all examples use tool_name)
• lsc_agent_eval/sample_data/agent_goal_eval_example.yaml (uses tool_name)
• streaming_parser.py and tool_call_eval.py expect and emit tool_name

Since there’s no discrepancy to resolve, you can ignore the suggestion to switch to name.

Likely an incorrect or invalid review comment.

lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (3)

291-294: Assertion matches new validator message

The expectation for “eval_type must be one of …” aligns with models._validate_eval_type. Good coverage for invalid types.


24-45: Nice coverage for both sub-string and accuracy eval_types in basic YAML

This validates multi-type migration in data manager. Looks good.


150-160: Model-level requirement assertions are consistent with EVAL_TYPE_REQUIREMENTS

These tests ensure missing required fields per eval_type are surfaced early. Solid.

Also applies to: 200-206, 224-231, 268-278, 296-307, 321-331, 346-356

lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)

52-58: Validating endpoint_type prevents silent misconfig

Explicitly handling “streaming” vs “query” and raising on anything else is the right call. This ensures misconfigurations are surfaced as structured ERRORs upstream.


63-102: Per-type result aggregation and tool_calls forwarding look correct

The loop cleanly records PASS/FAIL/ERROR per eval_type and only forwards tool_calls when tool_eval is requested. Matches the new result model/flow.

lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (3)

94-142: Great multi-type result scenario coverage

This test validates that multiple EvaluationResult entries can co-exist for the same eval_id with different eval_types and outcomes. Good signal for downstream stats.


341-373: Model requirement tests are spot-on

These assert the EVAL_TYPE_REQUIREMENTS mapping is enforced. Nicely done.


495-548: Stats by eval_type updated and validated

Ensures the new eval_type names populate stats correctly. LGTM.

lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (2)

28-39: Solid agent client mock setup and consistent response shape

Good use of a single mock function for both streaming and query paths, and returning a stable payload including conversation_id and tool_calls. This keeps tests deterministic across endpoint types.


353-375: Nice coverage of strict AND semantics for substring evaluation

These cases exercise all/one/none/CI paths and align with the stricter ALL-keywords requirement noted in previous changes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (16)
lsc_agent_eval/tests/core/utils/test_api_client.py (2)

70-76: Good coverage of tool_calls mapping from query endpoint

  • Validates transforming backend shape {"name","args"} into the internal {"tool_name","arguments"} format and wrapping into sequences. This aligns with the framework’s expected_tool_calls schema.
  • Consider adding a test for when the backend omits the tool_calls key entirely (not just an empty list) to ensure we default to [] and maintain a stable return shape.

Example additional test to add:

def test_query_agent_success_missing_tool_calls_key():
    mock_response = Mock()
    response_text = "There are 80 namespaces."
    mock_response.json.return_value = {
        "response": response_text,
        "conversation_id": "conv-id-123",
        # tool_calls key intentionally omitted
    }
    mock_response.raise_for_status.return_value = None

    mock_client = Mock()
    mock_client.post.return_value = mock_response

    with patch("httpx.Client", return_value=mock_client):
        client = AgentHttpClient("http://localhost:8080")
        api_input = {
            "query": "How many namespaces are there?",
            "provider": "watsonx",
            "model": "ibm/granite-3-3-8b-instruct",
        }
        result = client.query_agent(api_input)
        assert result["response"] == response_text
        assert result["conversation_id"] == "conv-id-123"
        assert result["tool_calls"] == []  # default shape

Also applies to: 91-99


217-265: Streaming path wiring and parser invocation look right

  • Verifies POST /v1/streaming_query with the input payload and timeout.
  • Properly delegates to parse_streaming_response and returns its result.
  • Minor enhancement: add a (separate) test that initializes the client with an auth token (env or file) and verifies Authorization header is propagated for the streaming path as well.

Example test addition:

@patch("builtins.open", mock_open(read_data="stream-token-xyz"))
def test_streaming_query_agent_with_token_file():
    mock_response = Mock()
    mock_response.raise_for_status.return_value = None

    cm = MagicMock()
    cm.__enter__.return_value = mock_response
    cm.__exit__.return_value = None

    mock_client = Mock()
    mock_client.stream.return_value = cm

    expected_result = {"response": "ok", "conversation_id": "c1", "tool_calls": []}

    with (
        patch("httpx.Client", return_value=mock_client) as client_ctor,
        patch("lsc_agent_eval.core.utils.api_client.parse_streaming_response") as parser,
    ):
        parser.return_value = expected_result
        client = AgentHttpClient("http://localhost:8080", "token.txt")
        api_input = {"query": "q", "provider": "p", "model": "m"}
        result = client.streaming_query_agent(api_input)
        assert result == expected_result
        # Verify auth header update happened on the client used for streaming
        client_ctor.return_value.headers.update.assert_called_once_with(
            {"Authorization": "Bearer stream-token-xyz"}
        )
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (6)

20-37: Eval type changes reflected correctly in EvaluationResult tests

Updated eval_type literals to the new canonical forms and kept result validation intact. Consider adding a small test that constructs an EvaluationResult with tool_calls set (to exercise the new field on the result model).

Example:

def test_evaluation_result_with_tool_calls():
    tool_calls = [[{"tool_name": "oc_get", "arguments": {"oc_get_args": ["pods"]}}]]
    result = EvaluationResult(
        eval_id="tool_res_001",
        query="list pods",
        response="Found 3 pods",
        tool_calls=tool_calls,
        eval_type="tool_eval",
        result="PASS",
    )
    assert result.tool_calls == tool_calls

Also applies to: 38-55, 56-67, 68-80


94-142: Nice addition validating multiple results for the same core query

Covers coexistence of action_eval, response_eval:sub-string, and response_eval:accuracy for the same eval_id/conversation. Consider adding a tool_eval case with a non-empty tool_calls to ensure mixed-type handling extends to tool eval results too.

Example extension:

results.append(
    EvaluationResult(
        eval_id="multi_001",
        query="create openshift-lightspeed namespace",
        response="tools used",
        tool_calls=[[{"tool_name": "oc_get", "arguments": {"oc_get_args": ["ns"]}}]],
        eval_type="tool_eval",
        result="PASS",
        conversation_group="test_conv",
        conversation_id="conv-123",
    )
)
assert "tool_eval" in [r.eval_type for r in results]

147-216: EvaluationDataConfig tests updated to eval_types are sound

Covers accuracy, action_eval, and sub-string variants with the new list-based eval_types. To harden behavior, add a test that supplies duplicate eval types and asserts they are de-duplicated in order (as per validate_eval_types).

Example:

def test_evaluation_data_config_eval_types_deduplication():
    cfg = EvaluationDataConfig(
        eval_id="dedup_1",
        eval_query="Q",
        eval_types=["action_eval", "action_eval", "response_eval:sub-string", "response_eval:sub-string"],
        expected_keywords=["k"],
        eval_verify_script=None,
    )
    assert cfg.eval_types == ["action_eval", "response_eval:sub-string"]

221-255: Good multi-type config test with script+substring+accuracy

Validates multiple requirements simultaneously (keywords, response, script). One more edge case worth testing: expected_keywords input containing whitespace/empty strings to confirm trimming and filtering behavior, and that an all-empty list raises a validation error (per validate_keywords).

Example:

def test_expected_keywords_trimming_and_empty_filtering():
    cfg = EvaluationDataConfig(
        eval_id="kw_1",
        eval_query="Q",
        eval_types=["response_eval:sub-string"],
        expected_keywords=["  alpha  ", "", "  " , "beta"],
    )
    assert cfg.expected_keywords == ["alpha", "beta"]

def test_expected_keywords_all_empty_raises():
    with pytest.raises(ValidationError):
        EvaluationDataConfig(
            eval_id="kw_2",
            eval_query="Q",
            eval_types=["response_eval:sub-string"],
            expected_keywords=["", "   "],
        )

331-470: Conversation-level validation looks comprehensive

  • Covers minimal config, scripts path validation, group name empty, non-existent script, and duplicate eval_ids within a conversation.
  • Consider adding a positive test demonstrating the same eval_id used across different conversation_group values is allowed (and maybe logs a warning), aligning with design notes.

Example:

def test_same_eval_id_across_conversation_groups_allowed():
    conv1 = ConversationDataConfig(
        conversation_group="g1",
        conversation=[
            EvaluationDataConfig(
                eval_id="shared",
                eval_query="Q1",
                eval_types=["response_eval:accuracy"],
                expected_response="R1",
            )
        ],
    )
    conv2 = ConversationDataConfig(
        conversation_group="g2",
        conversation=[
            EvaluationDataConfig(
                eval_id="shared",
                eval_query="Q2",
                eval_types=["response_eval:accuracy"],
                expected_response="R2",
            )
        ],
    )
    assert conv1.conversation[0].eval_id == conv2.conversation[0].eval_id == "shared"

498-548: Stats aggregation test looks good

Covers totals, by-conversation, and by-eval-type for three types. Optional: add a tool_eval result to assert inclusion in by_eval_type and ensure overall counts stay consistent.

lsc_agent_eval/README.md (8)

66-71: Clarify expected_tool_calls schema and naming (“tool_name”, “arguments”)

To reduce configuration mistakes, strengthen the wording and include a compact example under the field list.

Apply this refinement:

-- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval)
+- `expected_tool_calls`: Expected tool call sequences (list of lists) for tool_eval.
+  Each tool call entry must use:
+    - `tool_name`: string
+    - `arguments`: object/dict of named parameters
+
+  Example:
+    expected_tool_calls:
+      - - tool_name: oc_get
+          arguments:
+            oc_get_args: [namespaces]

73-75: Tighten note on duplicate eval_id wording

Minor clarity/grammar tweak.

-Note: `eval_id` can't contain duplicate values within a conversation group. But it is okay for cross conversation group (A warning is logged anyway for awareness)
+Note: `eval_id` cannot be duplicated within the same conversation group. Reusing the same `eval_id` across different conversation groups is allowed (a warning is logged for awareness).

116-136: Tool call examples match implementation (tool_name + arguments)

Examples align with the comparator/formatter expectations. Consider adding an additional example that demonstrates multiple tool calls within a single sequence for completeness.

Example:

expected_tool_calls:
  - - tool_name: create_pod
      arguments: {name: demo, image: nginx}
    - tool_name: apply_service
      arguments: {pod_name: demo, port: 80}

144-147: Consistent quoting in YAML example

Optional consistency: quote lightspeed in the expected_keywords list to match “yes”.

-      expected_keywords: ["yes", lightspeed]
+      expected_keywords: ["yes", "lightspeed"]

181-182: Token guidance is helpful; consider noting precedence

If both AGENT_API_TOKEN and agent_auth_token_file are provided, note which one wins to avoid ambiguity.

Proposed addition immediately after the sentence:
“If both are provided, agent_auth_token_file takes precedence.”


246-248: Clarify FAIL vs ERROR semantics

Aligns with framework behavior but could be clearer for readers at a glance.

-- **Evaluation Failure**: Individual evaluation marked as ERROR or FAIL
+- **Evaluation Failure**:
+  - `FAIL`: The evaluation ran successfully but did not meet the criteria (e.g., incorrect response, script verification failed).
+  - `ERROR`: An error occurred during evaluation execution (e.g., setup failed, agent API error).

254-265: Consider documenting tool_calls in CSV output (optional)

If CSV includes tool_calls or a truncated version for tool_eval, add to the column list. If not, consider adding it to aid debugging of tool evaluations.

Proposed addition:

 - `response`: The agent's response
+- `tool_calls`: Serialized tool calls (if present; for tool_eval)
 - `error`: Error message (if any)

26-28: Prerequisites note: streaming response expectations (optional)

Since streaming is now the default, consider noting that the Agent API’s streaming endpoint must emit a compatible stream (e.g., chunked JSON tokens) that parse_streaming_response can handle.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd493ec and c75e64a.

📒 Files selected for processing (10)
  • lsc_agent_eval/README.md (9 hunks)
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (4 hunks)
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (14 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (21 hunks)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1 hunks)
  • lsc_agent_eval/tests/core/utils/test_api_client.py (4 hunks)
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
  • lsc_agent_eval/tests/core/utils/test_streaming_parser.py
  • lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
  • lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py
  • lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
  • lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-08-13T14:07:44.146Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.146Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-08-13T14:07:44.147Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.147Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • lsc_agent_eval/README.md
  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

  • lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
🧬 Code Graph Analysis (1)
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
  • EvaluationResult (245-272)
  • EvaluationDataConfig (95-176)
🔇 Additional comments (5)
lsc_agent_eval/tests/core/utils/test_api_client.py (4)

106-135: Empty tool_calls handling is correct

Confirms we return an empty list when the backend provides tool_calls: [] and keep the dict-based API shape. Looks good.


266-295: Parser error mapping is robust

Catches ValueError from the streaming parser and raises AgentAPIError with a clear message. Good error surface for callers.


296-312: Timeout behavior for streaming is correctly captured

Maps httpx.TimeoutException to a dedicated AgentAPIError. Good.


313-330: HTTP error propagation on streaming is correct

Asserts proper wrapping of HTTPStatusError with status code included. Looks solid.

lsc_agent_eval/README.md (1)

9-14: Clear summary of multi-type evaluation including tool_eval

The bullets accurately reflect the new eval_types surface and the tool_eval addition. No changes required.

@asamal4
Copy link
Collaborator Author

asamal4 commented Aug 13, 2025

@tisnik There are few nitpicks and one good review comment.. I will modify in a follow up PR..
Comments don't impact the overall functionality.

Please merge this if possible.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit d39ef22 into lightspeed-core:main Aug 13, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants