Agent eval: add tool call comparison#24
Conversation
WalkthroughReplaces single eval_type with eval_types (List[str]), adds action_eval, response_eval:* and tool_eval (with expected_tool_calls and eval_verify_script), introduces endpoint_type (streaming|query), streaming parser and client paths, multi-result evaluation flow, and updates README, samples, CLI, models, utils, and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (agent_eval)
participant Runner as EvaluationRunner
participant API as AgentHttpClient
participant Parser as Streaming Parser
participant Dispatcher as Eval Dispatcher
CLI->>Runner: run_evaluation(data_config, provider, model, endpoint_type)
alt endpoint_type == "streaming"
Runner->>API: streaming_query_agent(api_input)
API->>Parser: parse_streaming_response(response)
Parser-->>API: {response, conversation_id, tool_calls}
API-->>Runner: {response, conversation_id, tool_calls}
else endpoint_type == "query"
Runner->>API: query_agent(api_input)
API-->>Runner: {response, conversation_id, tool_calls}
end
Runner->>Runner: for each eval_type in data_config.eval_types
Runner->>Dispatcher: evaluate per-type (action_eval / response_eval:* / tool_eval)
Dispatcher-->>Runner: per-type result items
Runner->>Dispatcher: create_evaluation_results(...)
Dispatcher-->>CLI: list[EvaluationResult]
sequenceDiagram
participant API as AgentHttpClient.streaming_query_agent
participant HTTP as Agent API (stream endpoint)
participant Parser as parse_streaming_response
API->>HTTP: POST /v1/streaming_query (JSON)
HTTP-->>API: stream lines (data: {...})
API->>Parser: parse_streaming_response(response)
Parser-->>API: {response, conversation_id, tool_calls}
API-->>Caller: structured dict
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 6
🔭 Outside diff range comments (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
27-35: TLS verification is disabled by default (verify=False); make it configurable and secure by defaultDisabling TLS verification is risky. At minimum, gate it via an env var and default to verification enabled.
Proposed change:
- # enable verify, currently for eval it is set to False - self.client = httpx.Client(base_url=self.endpoint, verify=False) + # Allow toggling TLS verification; default to secure True + verify_env = os.getenv("AGENT_API_VERIFY", "true").lower() + verify_tls = verify_env in ("1", "true", "yes") + self.client = httpx.Client(base_url=self.endpoint, verify=verify_tls)If the eval environment requires self-signed certs, set AGENT_API_VERIFY=false explicitly.
lsc_agent_eval/README.md (1)
221-229: Update evaluation flow wording for multi-type and tool_evalThe step still references a single eval_type. Adjust to eval_types and include tool_eval to align with implementation.
- - Execute evaluation based on eval_type (either sub-string, judge-llm or script) + - Execute evaluations based on eval_types (any combination of sub-string, judge-llm, script, and tool-call comparison)
🧹 Nitpick comments (27)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (2)
31-45: Enhance debug logging with more contextThe generic log message "Item %d does not match" at line 43 lacks context about what type of item failed. Consider including the comparison type for better debugging.
def _compare_lists( expected: list, actual: list, compare_func: Callable[[Any, Any], bool], mismatch_message: str, ) -> bool: """Compare two lists using a comparison function.""" if len(expected) != len(actual): logger.debug(mismatch_message, len(expected), len(actual)) return False for i, (expected_item, actual_item) in enumerate(zip(expected, actual)): if not compare_func(expected_item, actual_item): - logger.debug("Item %d does not match", i) + logger.debug("Item %d does not match in %s", i, compare_func.__name__) return False return True
67-82: Consider removing redundant length checkThe length comparison at lines 76-82 is redundant since the subsequent iteration over
expected.items()will inherently catch any missing keys inactual. The check at line 86-88 already handles this case.Either keep the length check for early exit optimization or remove it for simpler code:
def _compare_tool_arguments(expected: dict, actual: dict) -> bool: """Compare tool arguments.""" if not isinstance(expected, dict) or not isinstance(actual, dict): logger.debug( "Argument type mismatch: expected dict, got %s and %s", type(expected), type(actual), ) return False - if len(expected) != len(actual): - logger.debug( - "Argument count mismatch: expected %d args, got %d args", - len(expected), - len(actual), - ) - return False # Direct comparison is not done to have better debugging ability for key, expected_value in expected.items(): if key not in actual: logger.debug("Missing argument key: '%s'", key) return False actual_value = actual[key] if expected_value != actual_value: logger.debug( "Argument value mismatch for '%s': expected %s, got %s", key, expected_value, actual_value, ) return False + + # Check for extra keys in actual + extra_keys = set(actual.keys()) - set(expected.keys()) + if extra_keys: + logger.debug("Unexpected argument keys: %s", extra_keys) + return False return Truelsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1)
23-23: Use constants for string prefixes instead of hardcoded indicesUsing hardcoded indices (line 23:
[6:]and line 81:[5:]) for removing prefixes is fragile and could break if prefix strings change.+# Add at module level after imports +DATA_PREFIX = "data: " +TOOL_PREFIX = "Tool:" +ARGUMENTS_SEPARATOR = " arguments:" + def parse_streaming_response(response: httpx.Response) -> dict[str, Any]: """Parse streaming response and extract data.""" conversation_id = "" final_response = "" tool_calls = [] for line in response.iter_lines(): line = line.strip() - if not line or not line.startswith("data: "): + if not line or not line.startswith(DATA_PREFIX): continue - json_data = line[6:] # Remove "data: " prefix + json_data = line[len(DATA_PREFIX):] # Remove data prefix parsed_data = _parse_streaming_line(json_data)And similarly for line 81:
- tool_name = parts[0][5:] # Remove "Tool:" prefix + tool_name = parts[0][len(TOOL_PREFIX):] # Remove Tool prefixAlso applies to: 81-81
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
163-190: Consider extracting validation logic for better maintainabilityThe validation logic for each eval type could be extracted into a mapping for better maintainability as new types are added.
# At module level EVAL_TYPE_REQUIREMENTS = { "response_eval:accuracy": ("expected_response", "requires non-empty 'expected_response'"), "response_eval:sub-string": ("expected_keywords", "requires non-empty 'expected_keywords'"), "action_eval": ("eval_verify_script", "requires non-empty 'eval_verify_script'"), "tool_eval": ("expected_tool_calls", "requires non-empty 'expected_tool_calls'"), } # In validate_eval_requirements method for eval_type in self.eval_types: if eval_type in EVAL_TYPE_REQUIREMENTS: field_name, error_msg = EVAL_TYPE_REQUIREMENTS[eval_type] field_value = getattr(self, field_name) if not field_value or (isinstance(field_value, list) and len(field_value) == 0): raise ValueError( f"eval_types containing '{eval_type}' {error_msg}" )lsc_agent_eval/tests/test_agent_eval.py (1)
55-57: Endpoint type override tested; add a negative test for invalid valuesThe explicit override to "query" is validated. Consider adding a test for an invalid value (e.g., "--endpoint_type invalid") to assert argparse exits with SystemExit, catching regressions in choices.
Example:
def test_args_parser_invalid_endpoint_type(): with pytest.raises(SystemExit): _args_parser(["--eval_data_yaml","test.yaml","--agent_provider","x","--agent_model","y","--endpoint_type","invalid"])Also applies to: 70-71
lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (1)
135-135: Extend stats coverage to tool_eval (optional)Current assertions cover response_eval and action_eval. Consider adding a tool_eval result into sample_results and verifying its inclusion in by_eval_type to exercise the new path.
Example extension:
sample_results.append(EvaluationResult( eval_id="test_004", query="Call tool", response="...", eval_type="tool_eval", result="PASS", conversation_group="conv2", conversation_id="conv-id-456", )) # Then assert "tool_eval" in stats.by_eval_type and passed count increments.Also applies to: 156-159, 222-225
lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)
1-109: Comprehensive tool call comparison tests; add sequence-order and superset-args casesGreat coverage. Two optional additions:
- If order between sequences is intended to matter, add a negative test where sequences are swapped; if order should not matter, add a positive test and update impl accordingly.
- Add a case where actual has extra arguments beyond expected to define whether superset matches are allowed.
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (1)
192-195: Assert endpoint_type is forwarded to runner (optional)To prevent regressions, assert that run_evaluation is called with endpoint_type from args.
Example:
mock_args.endpoint_type = "query" # ... after evaluator.run_evaluation() calls = mock_evaluation_runner.return_value.run_evaluation.call_args_list for call in calls: # Support both kwargs and positional if "endpoint_type" in call.kwargs: assert call.kwargs["endpoint_type"] == "query" else: # If positional, ensure the last arg equals "query" assert call.args[-1] == "query"lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (1)
30-34: Broad eval_types migration looks correct; refine messages and add tool_eval coverage
- Good migration to eval_types lists across fixtures and validations.
- Minor nit: tests and/or model error messages still say “eval_type must be one of...”. Consider updating to “eval_types contains invalid value...” for clarity.
- Add a test for tool_eval requiring expected_tool_calls and its positive/negative cases.
Example tool_eval negative:
invalid = [{ "conversation_group": "cg", "conversation": [{ "eval_id": "t1", "eval_query": "q", "eval_types": ["tool_eval"] }] }] with (...patch opens...): with pytest.raises(EvaluationDataError, match="expected_tool_calls"): AgentGoalEvalDataManager("test.yaml")Also applies to: 42-43, 56-58, 62-64, 68-70, 158-160, 201-203, 226-227, 277-279, 305-306, 330-331, 355-356, 380-381, 391-392, 516-517
lsc_agent_eval/tests/core/utils/test_api_client.py (3)
265-295: Streaming error handling tests are good; add a timeout case (optional)You’ve covered parser ValueError and HTTP errors. Consider adding a TimeoutException scenario to assert the specific timeout message.
Example:
def test_streaming_query_agent_timeout(): mock_client = Mock() mock_client.stream.side_effect = httpx.TimeoutException("Timeout") with patch("httpx.Client", return_value=mock_client): client = AgentHttpClient("http://localhost:8080") with pytest.raises(AgentAPIError, match="streaming query timeout"): client.streaming_query_agent({"query":"q","provider":"p","model":"m"})Also applies to: 296-314
55-64: Tighten env token test to assert the exact environment key usedCurrently patching os.getenv to return a token for any key can mask regressions. Assert the key being queried is the expected one (e.g., AGENT_API_TOKEN).
Example:
with patch("os.getenv") as getenv: getenv.side_effect = lambda k, default=None: "env-token-456" if k == "AGENT_API_TOKEN" else default AgentHttpClient("http://localhost:8080") getenv.assert_called() # Optionally assert it was called with "AGENT_API_TOKEN"
3-4: TLS verify=False: make this configurable for production useAgentHttpClient creates httpx.Client with verify=False. For production-grade systems, consider a configuration (env var/CLI/arg) to enable TLS verification and optionally supply CA bundle. Tests can override to False; default should be secure.
lsc_agent_eval/tests/core/utils/test_streaming_parser.py (2)
30-46: Add coverage for multiple tool_call events and sequence formattingCurrent test validates a single tool_call producing a single-item sequence. Add a case with multiple tool_call lines to assert:
- len(result["tool_calls"]) == number of tool_call events
- Each entry is a single-item sequence (or per your grouping rules)
I can propose an additional test snippet if helpful.
66-80: Test JSON-decode fallback for tool-call arguments_streaming_parser._parse_tool_call returns {} when arguments JSON is invalid. Add a test asserting this behavior to lock in robustness.
Example addition:
@@ def test_tool_call_parsing(self): - # Invalid format - assert _parse_tool_call("invalid format") is None + # Invalid format + assert _parse_tool_call("invalid format") is None + + # Invalid JSON should fallback to empty arguments + result = _parse_tool_call('Tool:bad arguments:{not-json}') + assert result["name"] == "bad" + assert result["arguments"] == {}lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
69-80: Tool-calls reformatting looks correct; consider stricter typing on inputMapping name -> name and args -> arguments aligns with downstream expectations. Strengthen type hints to signal expected shape.
- def _format_query_endpoint_tool_calls( - self, tool_calls: list - ) -> list[list[dict[str, Any]]]: + def _format_query_endpoint_tool_calls( + self, tool_calls: list[dict[str, Any]] + ) -> list[list[dict[str, Any]]]:Optionally, guard non-list inputs with a quick isinstance check returning [].
Also applies to: 91-108
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (2)
221-255: Consider adding a test for eval_types deduplication and order preservationModels deduplicate eval_types while preserving order. Add a case like ["action_eval", "action_eval", "response_eval:accuracy"] and assert order and unique results.
I can draft the test if you want it included here.
498-548: EvaluationStats test is solid; add an ERROR scenario for completenessCurrent stats cover PASS/FAIL across conversations and types. Add an EvaluationResult(result="ERROR") to assert errored count increments and success_rate adjusts accordingly.
Happy to provide a concrete snippet.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
190-205: Minor: unreachable None-check and clearer failure reason for no tool callsBecause run_evaluation passes tool_calls or [] to _evaluate_tools, actual_tool_calls is never None. Consider removing the None branch and failing explicitly on empty actual calls to improve diagnostics.
- if actual_tool_calls is None: - logger.error("No tool calls provided for evaluation") - return False + if not actual_tool_calls: + logger.error("No tool calls captured from agent for tool_eval") + return Falselsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (5)
132-139: Solid judge-LLM success path; minor assertion enhancementThe test validates list result semantics, fields, and exact streaming payload with conversation_id. Consider also asserting that query_agent was not called to lock in routing behavior when endpoint_type is not provided (defaults to streaming).
+ # Ensure non-streaming path was not used + mock_agent_client.query_agent.assert_not_called()Also applies to: 141-147, 151-158
423-454: Add assertion that tool_calls propagate into tool_eval resultsSince EvaluationResult.tool_calls is populated only for tool_eval, assert it in success case to guard the data path end-to-end.
results = runner.run_evaluation(config, "openai", "gpt-4") assert len(results) == 1 assert results[0].result == "PASS" assert results[0].eval_type == "tool_eval" + # Tool calls must be attached for tool_eval results + assert results[0].tool_calls == [[{"name": "list_versions", "arguments": {}}]]
456-486: Mirror the tool_calls assertion in the failure caseEven when the comparison fails, tool_calls should still be present for diagnostic fidelity.
results = runner.run_evaluation(config, "openai", "gpt-4") assert len(results) == 1 assert results[0].result == "FAIL" assert results[0].eval_type == "tool_eval" + # Tool calls should be present for debugging even on FAIL + assert results[0].tool_calls == [[{"name": "wrong_tool", "arguments": {}}]]
517-590: Great all-pass multi-type test; add checks for tool_calls presence only on tool_evalStrengthens the contract that tool_calls is set for tool_eval and omitted for others.
assert all(r.conversation_id == "conv-123" for r in results) + # Only tool_eval should carry tool_calls; others should be None + by_type = {r.eval_type: r for r in results} + assert by_type["tool_eval"].tool_calls == [ + [{"name": "create_namespace", "arguments": {"name": "lightspeed"}}] + ] + assert by_type["action_eval"].tool_calls is None + assert by_type["response_eval:sub-string"].tool_calls is None + assert by_type["response_eval:accuracy"].tool_calls is None
703-739: Endpoint routing parity test covers both modesEnsures both streaming and query paths yield equivalent results and that each endpoint is invoked as expected. Consider a follow-up test variant for tool_eval on query endpoint to also validate tool_calls formatting on non-streaming responses.
+ # Optional: add a tool_eval config here and assert tool_calls present/equivalentlsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (2)
172-178: Use a safe default for endpoint_type to avoid AttributeErrorIf eval_args is missing endpoint_type (older configs), getattr with a default maintains backward compatibility.
- eval_results = self.evaluation_runner.run_evaluation( + eval_results = self.evaluation_runner.run_evaluation( eval_data, self.eval_args.agent_provider, self.eval_args.agent_model, conversation_id, - self.eval_args.endpoint_type, + getattr(self.eval_args, "endpoint_type", "streaming"), )
193-231: Print tool_calls only when relevantCurrently prints Tool Calls unconditionally when overall_result != PASS. For non-tool_eval batches this is None and can be noisy. Print tool_calls only if present or if any eval_type == "tool_eval".
- pbar.write(f" Tool Calls: {results[0].tool_calls}") + # Print tool calls only if present (tool_eval) to reduce noise + if any(r.tool_calls for r in results): + pbar.write(f" Tool Calls: {results[0].tool_calls}")lsc_agent_eval/README.md (2)
115-135: YAML example is good; quote hyphenated keyword for consistencyYAML allows unquoted hyphenated scalars, but quoting improves readability and avoids linter warnings in flow-style lists.
- expected_keywords: ["yes", openshift-lightspeed] + expected_keywords: ["yes", "openshift-lightspeed"]
168-179: Usage section reflects endpoint_type and token handlingCLI demonstrates streaming default. Consider punctuating the note for clarity.
-Pass token text file or set `AGENT_API_TOKEN` env var +Pass token text file or set `AGENT_API_TOKEN` env var.Also applies to: 180-181
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
lsc_agent_eval/README.md(8 hunks)lsc_agent_eval/sample_data/agent_goal_eval_example.yaml(4 hunks)lsc_agent_eval/src/lsc_agent_eval/agent_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py(5 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/__init__.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py(1 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py(4 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py(13 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py(14 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_models.py(21 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_results.py(10 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py(1 hunks)lsc_agent_eval/tests/core/utils/test_api_client.py(4 hunks)lsc_agent_eval/tests/core/utils/test_streaming_parser.py(1 hunks)lsc_agent_eval/tests/test_agent_eval.py(3 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_results.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.pylsc_agent_eval/sample_data/agent_goal_eval_example.yamllsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/README.md
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_results.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.pylsc_agent_eval/sample_data/agent_goal_eval_example.yamllsc_agent_eval/README.md
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_results.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_results.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/README.md
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_results.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/README.md
📚 Learning: 2025-07-16T10:13:42.924Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:112-112
Timestamp: 2025-07-16T10:13:42.924Z
Learning: In the lsc_agent_eval judge LLM evaluation system, the _extract_numeric_result method should only accept numeric values 1 or 0 from judge responses. Boolean keywords like "true", "false", "yes", "no" are not acceptable values for judge LLM responses.
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_results.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
🧬 Code Graph Analysis (11)
lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
compare_tool_calls(9-18)
lsc_agent_eval/src/lsc_agent_eval/core/utils/__init__.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1)
parse_streaming_response(12-54)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (4)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
compare_tool_calls(9-18)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
create_evaluation_results(8-56)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
streaming_query_agent(109-137)query_agent(50-89)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (3)
ScriptExecutionError(16-17)JudgeModelError(20-21)AgentAPIError(12-13)
lsc_agent_eval/tests/core/utils/test_streaming_parser.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (2)
_parse_tool_call(69-92)parse_streaming_response(12-54)
lsc_agent_eval/tests/core/utils/test_api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (3)
query_agent(50-89)AgentHttpClient(15-142)streaming_query_agent(109-137)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
AgentAPIError(12-13)
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (4)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
EvaluationDataConfig(89-192)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
run_evaluation(34-106)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
run_evaluation(59-94)lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (1)
sample_results(20-50)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationDataConfig(89-192)EvaluationResult(261-288)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
AgentAPIError(12-13)lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1)
parse_streaming_response(12-54)
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationResult(261-288)EvaluationDataConfig(89-192)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
create_evaluation_results(8-56)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
run_evaluation(34-106)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (7)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
streaming_query_agent(109-137)query_agent(50-89)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
run_evaluation(34-106)EvaluationRunner(20-208)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
run_evaluation(59-94)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationResult(261-288)EvaluationDataConfig(89-192)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (2)
AgentAPIError(12-13)ScriptExecutionError(16-17)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1)
run_script(28-76)lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
evaluate_response(85-126)
🔇 Additional comments (33)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
119-138: LGTM! Well-structured validation logicThe validation for
eval_typesis well-implemented:
- Properly handles duplicates while preserving order
- Validates each type individually
- Ensures non-empty result
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py (1)
131-132: LGTM! Correctly updated for multi-type evaluationThe loop properly iterates over the new
eval_typeslist structure to collect statistics for each evaluation type.lsc_agent_eval/src/lsc_agent_eval/agent_eval.py (1)
38-44: ✅ Endpoint_type propagation verified—no action needed
- The
--endpoint_typeargument is defined inagent_eval.py(lines 38–44).- It’s passed into
AgentGoalEvaland forwarded incore/agent_goal_eval/agent_goal_eval.py(lines 172–178).- The evaluator in
core/agent_goal_eval/evaluator.py(lines 52–55) usesendpoint_typeto switch between streaming and query calls.Optional: for improved maintainability, you could centralize the allowed choices in a constant and log the selected endpoint at startup.
lsc_agent_eval/tests/test_agent_eval.py (1)
32-33: Good coverage for default endpoint_typeAsserting the default "streaming" improves confidence in CLI compatibility.
lsc_agent_eval/src/lsc_agent_eval/core/utils/__init__.py (1)
1-7: Re-export of parse_streaming_response via utils is cleanPublic API ergonomics improved and all accurately reflects exports.
lsc_agent_eval/tests/core/agent_goal_eval/test_results.py (1)
27-31: Updated eval_type labels align with the new multi-type schemaSwitching to response_eval:accuracy, action_eval, and response_eval:sub-string is consistent with models and evaluator changes.
Also applies to: 36-40, 45-49
lsc_agent_eval/tests/core/agent_goal_eval/test_agent_goal_eval.py (1)
44-46: eval_types transition is correctUsing eval_types lists in fixtures and updated eval_type strings in results matches the multi-type refactor.
Also applies to: 50-52, 64-65, 73-74
lsc_agent_eval/tests/core/utils/test_api_client.py (3)
70-99: Non-streaming response shape and tool_calls formatting validatedThe test correctly asserts dict return and formatting into sequences.
106-135: Empty tool_calls handling coveredGood assertion that an empty list returns [] after formatting.
217-265: Streaming path and parser integration coverage is solidValidates endpoint, timeout, and parser invocation. Nice use of context manager mocking for stream.
lsc_agent_eval/tests/core/utils/test_streaming_parser.py (3)
16-29: Basic streaming parsing test is solidCovers conversation_id/response extraction and empty tool_calls. Good baseline.
47-65: Error-path expectations are correctConfirms ValueError messages for missing final response and missing conversation_id. Matches parser behavior.
81-96: Malformed input handling test is valuableValidates ignoring malformed/empty/non-data lines while still producing final results. Good resilience coverage.
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
109-137: Streaming query path: LGTMUses httpx streaming and parse_streaming_response with appropriate error mapping. Fits the new streaming-first flow.
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (2)
147-164: Adoption of eval_types across model tests is correctValidates per-type requirements (expected_response for accuracy, etc.). Matches model validators.
256-330: Tool-eval config tests look good, including complex sequences and missing/empty validationsStructure aligns with downstream tool-call comparison.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
75-91: Per-type error handling aligns with FAIL vs ERROR semanticsExceptions during a single evaluation become ERROR; boolean false becomes FAIL. Matches documented behavior.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
35-56: Per-evaluation result construction: LGTMCorrectly includes tool_calls only for tool_eval and raises on missing inputs.
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (9)
29-36: Good agent client mock shape and endpoint parityMock returns response, conversation_id passthrough/generation, and tool_calls defaulting to empty list; both streaming and query endpoints wired via side_effect. This aligns with the API client contract and keeps tests endpoint-agnostic.
Also applies to: 38-39
62-63: Migration to eval_types looks correct across fixturesFixtures consistently use eval_types (list) per the new multi-type model. This prevents accidental single-type assumptions in tests.
Also applies to: 86-87, 96-97
261-272: Substring success test OK; covers case-insensitivity via separate testResponse contains both keywords with different cases; pass condition follows the stricter “all keywords” rule. Looks good.
Also applies to: 275-286
294-304: Substring failure path verifiedNegative case correctly checks FAIL with no error. Test isolates absence of required substrings.
Also applies to: 307-318
325-336: Agent API error handling verifiedERROR result with preserved eval_type and error message validated. This matches the contract of converting API exceptions to per-type ERRORs.
Also applies to: 338-343
355-357: Comprehensive substring evaluation semantics coveredAll/one/none keywords and case-insensitive scenarios are exercised and match the expected AND semantics from prior learnings.
Also applies to: 360-371, 373-373, 376-387, 389-389, 392-403, 405-405, 408-419, 421-421
495-505: Conversation ID propagation coverage is on pointTest ensures provided conversation_id flows to API and result objects. Good guard for multi-type expansion.
Also applies to: 508-515
591-649: Mixed-results multi-type test is well-constructedValidates each type independently with expected PASS/FAIL states and shared conversation_id. Nicely isolates tool mismatch and script/judge failures.
650-685: Error propagation across multiple eval types looks correctAction eval raises ScriptExecutionError resulting in ERROR; sub-string and accuracy behave independently. This mirrors per-type error handling.
Also applies to: 687-702
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (4)
17-17: Importing create_evaluation_results aligns with multi-result flowBrings in the new utility to standardize result creation across success/error cases.
116-122: Setup failure now generates per-type ERROR resultsExtending results with create_evaluation_results for each eval_data is the right behavior under multi-type semantics. Message prefix “Setup script failed:” provides helpful context.
Also applies to: 121-123
181-184: Conversation ID handoff from first eval results is correctGrabbing conversation_id from the first returned EvaluationResult ensures subsequent evals share context without requiring direct access to the API response.
219-224: Readable, informative multi-result summaryConcise roll-up per eval_id with per-type markers and error messages when applicable. Nicely balances signal vs verbosity.
Also applies to: 245-254
lsc_agent_eval/README.md (2)
9-13: Feature list correctly introduces multi-type and tool_evalClear enumeration of available evaluation types and their semantics, including the “ALL keywords” rule.
270-271: By Evaluation Type metrics include tool_evalThis aligns with the new evaluation type set and output expectations.
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
Outdated
Show resolved
Hide resolved
VladimirKadlec
left a comment
There was a problem hiding this comment.
Some nit comments, little bit tool long for a review.
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
Outdated
Show resolved
Hide resolved
lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
Outdated
Show resolved
Hide resolved
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
Outdated
Show resolved
Hide resolved
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🔭 Outside diff range comments (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
50-81: Harden parsing: validate types before calling .strip() and normalize inputsIf the agent returns unexpected types, current code will raise generic exceptions. Fail fast with targeted messages and normalize
conversation_idandtool_callstypes.Apply this diff:
- response_data = response.json() - if "response" not in response_data: - raise AgentAPIError("Agent response missing 'response' field") - - agent_response = response_data["response"].strip() - conversation_id = response_data.get("conversation_id", "").strip() - tool_calls = response_data.get("tool_calls", []) + response_data = response.json() + if "response" not in response_data: + raise AgentAPIError("Agent response missing 'response' field") + + resp_value = response_data["response"] + if not isinstance(resp_value, str): + raise AgentAPIError("Agent response field 'response' must be a string") + agent_response = resp_value.strip() + + # Normalize conversation_id to string + raw_conv_id = response_data.get("conversation_id", "") + conversation_id = str(raw_conv_id).strip() if raw_conv_id is not None else "" + + # Normalize tool_calls to a list + tool_calls = response_data.get("tool_calls") or [] + if not isinstance(tool_calls, list): + tool_calls = [] # Format tool calls to match expected structure (list of sequences) formatted_tool_calls = self._format_query_endpoint_tool_calls(tool_calls) return { "response": agent_response, "conversation_id": conversation_id, "tool_calls": formatted_tool_calls, }lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (1)
712-748: Handle unsupported endpoint_type gracefully and add missing testThe current
run_evaluationimplementation will raise an uncaughtValueErrorfor invalidendpoint_type, causing a crash instead of returning a structured ERROR result. Let’s update the exception handling and add coverage:• In
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py, extend the outerexceptto includeValueErrorso that an unsupported endpoint yields an ERROR result:@@ def run_evaluation(…, endpoint_type: str = "streaming") -> list["EvaluationResult"]: - else: - raise ValueError(f"Unsupported endpoint_type: {endpoint_type}") + else: + raise ValueError(f"Unsupported endpoint_type: {endpoint_type}") … -except (AgentAPIError, ScriptExecutionError, JudgeModelError) as e: +except (AgentAPIError, ScriptExecutionError, JudgeModelError, ValueError) as e: logger.error("Evaluation failed for %s: %s", data_config.eval_id, e) return create_evaluation_results( data_config, error_message=str(e), conversation_id=conversation_id )• In
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py, add:def test_run_evaluation_invalid_endpoint_type_returns_error( mock_agent_client, mock_script_runner, mock_judge_manager, sample_config_judge_llm ): runner = EvaluationRunner(mock_agent_client, mock_script_runner, mock_judge_manager) results = runner.run_evaluation( sample_config_judge_llm, "watsonx", "ibm/granite-3-3-8b-instruct", "conv-id-123", endpoint_type="invalid-mode", ) assert len(results) == 1 # The eval_type should still be the configured response_eval type assert results[0].eval_type == sample_config_judge_llm.eval_types[0] assert results[0].result == "ERROR" assert "Unsupported endpoint_type" in (results[0].error or "")This ensures invalid modes no longer crash and conform to FAIL vs ERROR semantics.
♻️ Duplicate comments (2)
lsc_agent_eval/README.md (1)
66-71: Great: tool_eval and expected_tool_calls are now documentedThis addresses the previous doc gap and matches the models and examples.
lsc_agent_eval/sample_data/agent_goal_eval_example.yaml (1)
106-111: Update eval_types in sample_data/agent_goal_eval_example.yamlThe
eval_typeslist must usetool_eval(nottool_call) to pass validation.• File: lsc_agent_eval/sample_data/agent_goal_eval_example.yaml, around line 106
Apply this diff:
eval_types: - - tool_call + - tool_eval - action_eval - response_eval:sub-string - response_eval:accuracy
🧹 Nitpick comments (11)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
21-28: Tighten type annotations for clarity and static checksThese helpers accept dicts with arbitrary structure; being explicit helps both readers and linters.
Apply this diff to improve the annotations:
-def _compare_tool_call_sequence(expected: list[dict], actual: list[dict]) -> bool: +def _compare_tool_call_sequence( + expected: list[dict[str, Any]], actual: list[dict[str, Any]] +) -> bool: @@ -def _compare_single_tool_call(expected: dict, actual: dict) -> bool: +def _compare_single_tool_call(expected: dict[str, Any], actual: dict[str, Any]) -> bool: @@ -def _compare_tool_arguments(expected: dict, actual: dict) -> bool: +def _compare_tool_arguments(expected: dict[str, Any], actual: dict[str, Any]) -> bool:Also applies to: 48-49, 67-69
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
23-36: Improve error message readability for invalid eval_typeUsing
dict_keys([...])in error messages is noisy. Present a clean, comma-separated list.Apply this diff:
- if eval_type not in EVAL_TYPE_REQUIREMENTS: - raise ValueError( - f"eval_type must be one of {EVAL_TYPE_REQUIREMENTS.keys()}, got '{eval_type}'" - ) + if eval_type not in EVAL_TYPE_REQUIREMENTS: + allowed = ", ".join(sorted(EVAL_TYPE_REQUIREMENTS.keys())) + raise ValueError( + f"eval_type must be one of [{allowed}], got '{eval_type}'" + )
100-107: Include tool_eval in the field descriptionThe
eval_typesfield description missestool_eval, which is otherwise supported across the codebase and README.Apply this diff:
- description=( - "List of evaluation types." - " -> action_eval, response_eval:sub-string, response_eval:accuracy" - ), + description=( + "List of evaluation types." + " -> action_eval, response_eval:sub-string, response_eval:accuracy, tool_eval" + ),lsc_agent_eval/README.md (2)
116-136: Consider clarifying tool_calls shape across endpointsTo reduce confusion, add a short note that:
- query endpoint: each tool call is wrapped as its own single-item sequence
- streaming endpoint: tool calls are grouped per turn in a single sequence, in order
Apply this diff to add a clarifying note just after expected_tool_calls:
- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval) + Note: On the non-streaming "query" endpoint, each tool call is wrapped as its own single-item sequence. + On the "streaming" endpoint, multiple tool calls within a turn are grouped in a single sequence in order of occurrence.
145-147: Minor grammar fix in exampleUse “an” before “openshift-lightspeed”.
Apply this diff:
- eval_query: is there a openshift-lightspeed namespace ? + eval_query: is there an openshift-lightspeed namespace?lsc_agent_eval/tests/core/utils/test_api_client.py (1)
217-313: Optional: add a streaming timeout test for parityFor completeness, add a timeout test to mirror the non-streaming timeout coverage.
Apply this addition to the test file:
+ def test_streaming_query_agent_timeout(self): + """Test streaming agent query with timeout.""" + mock_client = Mock() + mock_client.stream.side_effect = httpx.TimeoutException("Request timeout") + + with patch("httpx.Client", return_value=mock_client): + client = AgentHttpClient("http://localhost:8080") + api_input = {"query": "Test query", "provider": "openai", "model": "gpt-4"} + with pytest.raises(AgentAPIError, match="Agent streaming query timeout"): + client.streaming_query_agent(api_input)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (1)
91-108: Be defensive in tool_calls formatting (skip invalid entries)Guard against non-list inputs and skip malformed tool call objects instead of assuming shape.
Apply this diff:
def _format_query_endpoint_tool_calls( self, tool_calls: list ) -> list[list[dict[str, Any]]]: """Format tool calls from query endpoint to match expected structure.""" - if not tool_calls: + if not tool_calls: return [] - formatted_sequences = [] - for tool_call in tool_calls: - # OLS dependency - formatted_tool = { - "tool_name": tool_call.get("name", ""), - "arguments": tool_call.get("args", {}), - } - formatted_sequences.append([formatted_tool]) + if not isinstance(tool_calls, list): + return [] + + formatted_sequences: list[list[dict[str, Any]]] = [] + for tool_call in tool_calls: + if not isinstance(tool_call, dict): + continue + # OLS dependency + formatted_tool = { + "tool_name": tool_call.get("name", ""), + "arguments": tool_call.get("args", {}), + } + formatted_sequences.append([formatted_tool]) return formatted_sequenceslsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
59-62: Make agent_response parsing resilient to missing keysIf the streaming parser ever omits tool_calls, this will KeyError and crash. Safer to use .get with defaults (especially for tool_calls) so we keep producing FAIL/ERROR results instead of aborting.
Apply this diff:
- response = agent_response["response"] - conversation_id = agent_response["conversation_id"] - tool_calls = agent_response["tool_calls"] + response = agent_response.get("response", "") # required by models + conversation_id = agent_response.get("conversation_id") + tool_calls = agent_response.get("tool_calls", [])
125-129: Clarify TODO or remove it“TODO(future): We should do tool_eval always ??” is ambiguous. Either capture the intended behavior in an issue or remove the comment.
Proposed rewording:
- “Consider always running tool_eval if tool_calls are present in the response, even when not requested; evaluate trade-offs.”
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
8-16: Strengthen typing for evaluation_results to avoid key typosevaluation_results is typed as list[dict], which is easy to misuse. Define a TypedDict for the items and use it in the signature for better static checking.
Apply this diff:
-from typing import Optional +from typing import Optional, TypedDict @@ -def create_evaluation_results( # pylint: disable=too-many-arguments,too-many-positional-arguments +class EvalResultItem(TypedDict): + eval_type: str + result: str + error: Optional[str] + +def create_evaluation_results( # pylint: disable=too-many-arguments,too-many-positional-arguments eval_config: EvaluationDataConfig, response: str = "", - evaluation_results: Optional[list[dict]] = None, + evaluation_results: Optional[list[EvalResultItem]] = None, error_message: Optional[str] = None, conversation_id: Optional[str] = None, tool_calls: Optional[list[list[dict]]] = None, ) -> list[EvaluationResult]:lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
185-191: Optional: validate endpoint_type early at CLI layerYou’re passing getattr(self.eval_args, "endpoint_type", "streaming") through. Consider validating it against {"streaming", "query"} here to fail fast at the orchestrator layer (still keeping runner-side validation). This improves error messaging near the user input surface.
Example:
endpoint = getattr(self.eval_args, "endpoint_type", "streaming") if endpoint not in {"streaming", "query"}: raise AgentEvaluationError(f"Invalid endpoint_type: {endpoint}") eval_results = self.evaluation_runner.run_evaluation(..., endpoint_type=endpoint)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
lsc_agent_eval/README.md(9 hunks)lsc_agent_eval/sample_data/agent_goal_eval_example.yaml(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py(5 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py(1 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py(14 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py(1 hunks)lsc_agent_eval/tests/core/utils/test_api_client.py(4 hunks)lsc_agent_eval/tests/core/utils/test_streaming_parser.py(1 hunks)lsc_agent_eval/tests/test_agent_eval.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- lsc_agent_eval/tests/core/utils/test_streaming_parser.py
- lsc_agent_eval/tests/test_agent_eval.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/README.mdlsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/sample_data/agent_goal_eval_example.yaml
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/sample_data/agent_goal_eval_example.yaml
🧬 Code Graph Analysis (6)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
AgentAPIError(12-13)lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py (1)
parse_streaming_response(14-56)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationDataConfig(95-176)EvaluationResult(245-272)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
create_evaluation_results(8-56)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (3)
ConversationDataConfig(179-242)EvaluationDataConfig(95-176)EvaluationResult(245-272)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
run_evaluation(34-108)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (7)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
streaming_query_agent(109-137)query_agent(50-89)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
run_evaluation(34-108)EvaluationRunner(20-209)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
run_evaluation(72-107)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationResult(245-272)EvaluationDataConfig(95-176)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (2)
AgentAPIError(12-13)ScriptExecutionError(16-17)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1)
run_script(28-76)lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
evaluate_response(85-126)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (5)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
compare_tool_calls(9-18)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
create_evaluation_results(8-56)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
run_evaluation(72-107)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
streaming_query_agent(109-137)query_agent(50-89)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (3)
ScriptExecutionError(16-17)JudgeModelError(20-21)AgentAPIError(12-13)
lsc_agent_eval/tests/core/utils/test_api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (3)
query_agent(50-89)AgentHttpClient(15-142)streaming_query_agent(109-137)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
AgentAPIError(12-13)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: mypy
🔇 Additional comments (17)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (3)
9-18: Solid, minimal public API for tool-call comparisonThe top-level
compare_tool_callswrapper is clear and composes well with the internal helpers. Early length check + short-circuit is appropriate.
31-45: Helpful debug logging on mismatchIndex + function name in the debug message makes triage easier when sequences are long. Good attention to diagnosability.
67-98: Strict arguments equality is intentional and well-implementedThe key-set and value equality checks (with explicit logging for missing/extra keys) align with precise tool-call validation needs.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (4)
9-20: Clear, centralized eval-type requirementsThe mapping cleanly encodes per-type field requirements and keeps validation logic declarative.
125-145: Eval-types validator is robust and order-preservingDeduplication while preserving order, plus per-item validation, is exactly what we want here.
165-176: Per-type requirements check is correctIterates selected
eval_typesand enforces required fields succinctly. Good separation of concerns with the global requirements map.
251-253: Result model now captures tool_callsAdding
tool_callshere aligns the model with the new evaluation type and downstream reporting needs.lsc_agent_eval/README.md (2)
9-14: Docs correctly enumerate multi-type evaluations (AND semantics for substring)The list includes
tool_evaland explicitly states ALL keywords must be present for substring evaluation, which matches the team's prior decision.Using team learning: This aligns with the stricter AND semantics captured in our previous learning about substring evaluation.
209-210: Key argument docs look good (endpoint_type)Defaulting to streaming is documented and consistent with the implementation and tests.
lsc_agent_eval/tests/core/utils/test_api_client.py (5)
66-105: Updated non-streaming query test reflects new return shape and tool_calls formattingAsserting the dict payload and the nested tool-calls sequence matches the implementation. Good coverage.
106-135: Good addition: empty tool_calls pathCovers the empty-list normalization and ensures a stable shape for consumers.
217-264: Streaming success path is well-testedContext manager, endpoint invocation, and parser handoff are verified cleanly.
265-295: Parser error propagates with helpful messageValidates ValueError -> AgentAPIError path with an actionable message.
296-313: HTTP error handling test for streaming is appropriateMirrors non-streaming behavior and asserts consistent error semantics.
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
109-138: Streaming endpoint integration and error mapping look correctUsing
streamwithraise_for_status()and delegating parsing toparse_streaming_responsewith precise error mapping is solid.
50-81: No tuple-style unpacking ofquery_agentdetected; no downstream updates requiredA search for destructuring assignments of the form
a, b = client.query_agent(...)returned no matches. All call sites (in tests and in
evaluator.py) assign the returned dict to a single variable and then access its keys explicitly.lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
52-58: Good fix: explicit validation of endpoint_typeSwitching on "streaming" vs "query" and raising on unsupported values prevents silent misconfiguration. This matches the framework’s ERROR vs FAIL semantics.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
Show resolved
Hide resolved
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
Outdated
Show resolved
Hide resolved
7081a00 to
fd493ec
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🔭 Outside diff range comments (3)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
234-241: Avoid O(n^2) duplicate detection for eval_id(s) in a conversationCurrent duplicate collection uses count() inside a list comprehension, which is quadratic and also repeats duplicates in the error message. Use a set-based approach for linear-time detection and clean messaging.
Apply this diff:
- eval_ids = [eval_config.eval_id for eval_config in self.conversation] - duplicates = [eval_id for eval_id in eval_ids if eval_ids.count(eval_id) > 1] - if duplicates: + eval_ids = [eval_config.eval_id for eval_config in self.conversation] + seen: set[str] = set() + duplicates_set: set[str] = set() + for eid in eval_ids: + if eid in seen: + duplicates_set.add(eid) + else: + seen.add(eid) + if duplicates_set: raise ValueError( - f"Duplicate eval_id(s) in conversation '{self.conversation_group}': {duplicates}" + f"Duplicate eval_id(s) in conversation '{self.conversation_group}': {sorted(duplicates_set)}" )lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
131-137: Return False vs raise for missing verify script is inconsistent with ERROR semanticsMissing eval_verify_script is a configuration error. To surface it as ERROR (not FAIL), raise ValueError instead of returning False. The top-level per-type handler already converts ValueError into an ERROR result.
def _evaluate_script(self, data_config: "EvaluationDataConfig") -> bool: """Evaluate using script execution.""" if not data_config.eval_verify_script: - logger.error("No verify script provided for script evaluation") - return False + raise ValueError("No verify script provided for script evaluation") return self.script_runner.run_script(data_config.eval_verify_script)
171-178: Missing judge_manager should be ERROR, not FAILPer framework semantics, absence of judge_manager is an execution/config error. Raise ValueError to surface an ERROR result.
if not self.judge_manager: - logger.error("Judge model manager not available for judge-llm evaluation") - return False + raise ValueError("Judge model manager not available for judge-llm evaluation") if not data_config.expected_response: - logger.error("Expected response not provided for judge-llm evaluation") - return False + raise ValueError("Expected response not provided for judge-llm evaluation")
♻️ Duplicate comments (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (1)
236-238: LGTM! Tool calls printing fixed as previously addressed.The code correctly finds the first result with actual tool_calls data rather than potentially printing empty tool_calls from the first result, addressing the previous review feedback.
🧹 Nitpick comments (11)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
33-47: Optimize list comparison with early exit on length mismatch.The current implementation processes all items even when lengths differ. Consider moving the length check to the beginning for better performance.
Apply this diff for improved efficiency:
def _compare_lists( expected: list, actual: list, compare_func: Callable[[Any, Any], bool], mismatch_message: str, ) -> bool: """Compare two lists using a comparison function.""" if len(expected) != len(actual): logger.debug(mismatch_message, len(expected), len(actual)) return False + for i, (expected_item, actual_item) in enumerate(zip(expected, actual)): if not compare_func(expected_item, actual_item): logger.debug("Item %d does not match in %s", i, compare_func.__name__) return False return Truelsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
114-116: Clarify expected_tool_calls shape or provide a typed schemaThe field accepts list[list[dict]] which is flexible but easy to misuse. Consider documenting or enforcing the expected keys (e.g., name/tool_name and arguments) with a typed model to catch shape errors early.
Would you like me to introduce a lightweight Pydantic model (e.g., ToolCall with name: str, arguments: dict[str, Any]) and validators to enforce the nested list-of-sequences structure?
lsc_agent_eval/README.md (2)
9-13: Document that substring matching is case-insensitiveThe implementation lower-cases both response and keywords. Add “(case-insensitive)” to make behavior explicit for users.
- - `response_eval:sub-string`: Simple substring matching evaluation (ALL keywords must be present in response) + - `response_eval:sub-string`: Simple substring matching evaluation (ALL keywords must be present in response; case-insensitive)
254-272: Consider documenting tool_calls presence in outputsCSV/JSON sections don’t mention tool_calls. Since EvaluationResult carries tool_calls for tool_eval, note whether:
- CSV excludes tool_calls (OK), and
- JSON includes tool_calls for tool_eval results.
If desired, I can scan the CSV/JSON writers to confirm what’s emitted and propose doc updates accordingly.
lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (1)
516-519: Integration test reflects eval_types usageSingle-eval integration test now uses eval_types. Consider adding a small case with multiple eval_types in one eval to catch YAML parsing/validation nuances.
I can add a follow-up integration test with combined eval_types (e.g., [response_eval:sub-string, response_eval:accuracy]) if helpful.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (3)
143-151: Propagate config errors in substring eval as ERRORGiven model-level validation, this branch shouldn’t hit, but if it does, it should be treated as a configuration error (ERROR). Raise ValueError rather than returning False.
if not data_config.expected_keywords: - return False + raise ValueError("expected_keywords not provided for sub-string evaluation")
59-62: Preserve None when tool_calls is absent to distinguish “missing” vs “empty”Currently, agent_response.get("tool_calls", []) always yields a list. This makes it impossible for _evaluate_tools to detect a truly missing tool_calls (and raise a config/execution ERROR) vs an empty but present list (agent made no calls → FAIL). Dropping the default keeps this distinction.
- tool_calls = agent_response.get("tool_calls", []) + tool_calls = agent_response.get("tool_calls")If query_agent/streaming_query_agent always include tool_calls, this change won’t affect behavior. Otherwise it will surface absent tool_calls as ERROR (via _evaluate_tools).
191-204: Evaluate-tools error signaling is consistent; consider harmonizing empty vs None semanticsRaising on missing expected_tool_calls and missing actual_tool_calls matches framework ERROR semantics. If you adopt preserving None upstream, this will also properly signal absent tool_calls as ERROR; keep empty list as FAIL to reflect “agent made no calls.”
Optionally add a short comment documenting the distinction:
- None: field absent → ERROR
- []: field present but empty → FAIL
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (3)
186-189: Add assertions for tool_calls and judge invocation on failure pathEven on FAIL, tool_calls should be None (since tool_eval isn’t requested). Also assert judge was invoked.
Apply this diff:
result = results[0] assert result.result == "FAIL" assert result.error is None + assert result.tool_calls is None + mock_judge_manager.evaluate_response.assert_called_once()
309-319: Verify conversation_id propagation when not providedIn this test you don’t pass a conversation_id; the agent response provides one. Assert it’s propagated into the result to cover this branch.
Apply this diff:
result = results[0] assert result.eval_id == "test_003" assert result.result == "FAIL" assert result.eval_type == "response_eval:sub-string" assert result.error is None + assert result.conversation_id == "test-conversation-id"
586-599: Assert tool_calls are attached to all results when tool_eval is includedGiven the current design, tool_calls are propagated to results when tool_eval is among eval_types. Make this explicit to catch regressions.
Apply this diff:
assert all(r.eval_id == "multi_pass_test" for r in results) assert all(r.query == "create openshift-lightspeed namespace" for r in results) assert all(r.conversation_id == "conv-123" for r in results) + expected_calls = [ + [{"tool_name": "create_namespace", "arguments": {"name": "lightspeed"}}] + ] + assert all(r.tool_calls == expected_calls for r in results)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
lsc_agent_eval/README.md(9 hunks)lsc_agent_eval/sample_data/agent_goal_eval_example.yaml(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py(3 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py(5 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py(1 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py(12 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py(14 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_models.py(21 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py(1 hunks)lsc_agent_eval/tests/core/utils/test_api_client.py(4 hunks)lsc_agent_eval/tests/core/utils/test_streaming_parser.py(1 hunks)lsc_agent_eval/tests/test_agent_eval.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- lsc_agent_eval/tests/core/utils/test_streaming_parser.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py
- lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
- lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.pylsc_agent_eval/README.mdlsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.pylsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.pylsc_agent_eval/tests/core/agent_goal_eval/test_models.py
🧬 Code Graph Analysis (7)
lsc_agent_eval/tests/test_agent_eval.py (1)
lsc_agent_eval/src/lsc_agent_eval/agent_eval.py (1)
_args_parser(14-78)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationDataConfig(95-176)EvaluationResult(245-272)
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (6)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
streaming_query_agent(109-137)query_agent(50-89)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
run_evaluation(34-108)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationResult(245-272)EvaluationDataConfig(95-176)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (2)
AgentAPIError(12-13)ScriptExecutionError(16-17)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/script_runner.py (1)
run_script(28-76)lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
evaluate_response(85-126)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (3)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
create_evaluation_results(16-64)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (3)
ConversationDataConfig(179-242)EvaluationDataConfig(95-176)EvaluationResult(245-272)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (1)
run_evaluation(34-108)
lsc_agent_eval/tests/core/utils/test_api_client.py (2)
lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (3)
query_agent(50-89)AgentHttpClient(15-142)streaming_query_agent(109-137)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (1)
AgentAPIError(12-13)
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationResult(245-272)EvaluationDataConfig(95-176)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (4)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)
compare_tool_calls(9-18)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (1)
create_evaluation_results(16-64)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py (2)
streaming_query_agent(109-137)query_agent(50-89)lsc_agent_eval/src/lsc_agent_eval/core/utils/exceptions.py (3)
ScriptExecutionError(16-17)JudgeModelError(20-21)AgentAPIError(12-13)
🔇 Additional comments (37)
lsc_agent_eval/tests/test_agent_eval.py (3)
32-32: LGTM! Test assertion for default endpoint_type is correct.The test correctly verifies that the
endpoint_typedefaults to "streaming", which aligns with the argument parser configuration inlsc_agent_eval/src/lsc_agent_eval/agent_eval.pyline 27.
55-56: LGTM! Test coverage for query endpoint_type is comprehensive.The test correctly verifies that when
--endpoint_type "query"is provided, it's properly parsed and available in the namespace. This ensures the CLI argument parsing works for both endpoint types.Also applies to: 70-70
111-125: LGTM! Invalid endpoint_type validation test is well-implemented.The test correctly verifies that invalid endpoint_type values trigger a
SystemExitexception, ensuring proper validation of the choices constraint defined in the argument parser.lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (3)
9-18: LGTM! Clean public API with appropriate type hints.The main function signature is well-defined with clear type hints for nested tool call structures. The delegation to internal helpers maintains good separation of concerns.
50-66: LGTM! Well-structured tool call comparison logic.The function correctly compares both tool names and arguments, with appropriate debug logging for mismatches. The separation of name and argument validation is clean.
69-100: LGTM! Comprehensive argument comparison with detailed logging.The function handles all edge cases well:
- Type validation for both dictionaries
- Missing key detection
- Value mismatch reporting
- Extra key detection
The detailed debug logging will be valuable for troubleshooting evaluation failures.
lsc_agent_eval/tests/core/utils/test_api_client.py (6)
70-76: LGTM! Updated test data reflects real-world tool call structure.The test data now includes realistic tool call structure with
nameandargsfields that match the OLS API format, providing better test coverage.
91-99: LGTM! Proper validation of tool call formatting.The test correctly validates that tool calls are formatted into the expected nested sequence structure, ensuring the
_format_query_endpoint_tool_callsmethod works as intended.
106-135: LGTM! Good test coverage for empty tool_calls scenario.This test ensures that when the API returns empty tool_calls, the client properly handles it and returns an empty list, covering an important edge case.
218-264: LGTM! Comprehensive streaming query test.The test properly mocks the streaming response and validates that:
- The correct endpoint
/v1/streaming_queryis called- The streaming parser is invoked
- The expected result structure is returned
265-295: LGTM! Good error handling test for streaming parser.The test correctly verifies that
ValueErrorfrom the streaming parser is properly caught and re-raised asAgentAPIErrorwith appropriate error context.
296-331: LGTM! Complete error scenario coverage for streaming.The timeout and HTTP error tests ensure robust error handling for the streaming endpoint, maintaining consistency with the non-streaming error handling patterns.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/utils.py (3)
8-14: LGTM! Well-defined TypedDict for evaluation results.The
EvalResultItemstructure provides clear typing for individual evaluation results with appropriate optional error field.
40-40: LGTM! Consistent tool_calls handling addressed.The implementation correctly attaches
tool_callsonly totool_evalresults in both global error and per-evaluation paths, maintaining consistency as requested in previous reviews.Also applies to: 57-58
61-63: LGTM! Proper input validation with clear error message.The mutual exclusivity check ensures the function is called correctly and provides a helpful error message when neither required parameter is provided.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/agent_goal_eval.py (7)
25-36: LGTM! Clean utility function for result markers.The helper function provides consistent visual markers for different evaluation outcomes, improving the user experience with clear status indicators.
129-134: LGTM! Proper multi-result error handling for setup failures.The code correctly uses
create_evaluation_resultsto generate error results for all evaluation types when setup script fails, and properly extends the results list.
185-191: LGTM! Endpoint-type parameter integration.The evaluation runner correctly receives the endpoint_type parameter with a sensible default fallback, enabling proper routing between streaming and query endpoints.
194-196: LGTM! Conversation ID handling for multi-result context.The logic correctly extracts the conversation_id from the first result in the list, maintaining conversation continuity across multiple evaluation types.
206-230: LGTM! Enhanced multi-result printing logic.The refactored printing function properly:
- Computes overall status across multiple evaluation types
- Displays all eval types in the summary line
- Uses consistent result markers
247-249: LGTM! Expected tool calls debugging support added.The addition of expected_tool_calls in the debug output provides valuable context for troubleshooting tool evaluation failures.
253-257: LGTM! Individual result breakdown enhances debugging.The detailed per-result output with individual markers and error messages provides clear visibility into which specific evaluation types failed and why.
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (4)
9-20: Good centralization of per-type requirementsEVAL_TYPE_REQUIREMENTS clearly encodes allowed eval types and their required fields. This is a solid foundation for consistent validation across the model. No action needed.
23-36: Tight eval_type validation is correct and user-friendlyThe validator handles type, emptiness, and allowed values with a helpful error message listing allowed types. Looks good.
125-145: Order-preserving dedupe on eval_types is greatValidating each type and de-duplicating while preserving order prevents subtle UX issues. Nicely done.
251-256: Tool calls captured in results aligns with tool_evalIncluding tool_calls only for tool_eval via downstream utils is appropriate. No issues.
lsc_agent_eval/README.md (1)
66-71: No changes needed: “tool_name” is already used consistentlyI verified that every tool‐call definition uses the
tool_namekey—there are no top‐levelnamekeys in code, samples, or docs:• lsc_agent_eval/README.md (all examples use
tool_name)
• lsc_agent_eval/sample_data/agent_goal_eval_example.yaml (usestool_name)
• streaming_parser.py and tool_call_eval.py expect and emittool_nameSince there’s no discrepancy to resolve, you can ignore the suggestion to switch to
name.Likely an incorrect or invalid review comment.
lsc_agent_eval/tests/core/agent_goal_eval/test_eval_data.py (3)
291-294: Assertion matches new validator messageThe expectation for “eval_type must be one of …” aligns with models._validate_eval_type. Good coverage for invalid types.
24-45: Nice coverage for both sub-string and accuracy eval_types in basic YAMLThis validates multi-type migration in data manager. Looks good.
150-160: Model-level requirement assertions are consistent with EVAL_TYPE_REQUIREMENTSThese tests ensure missing required fields per eval_type are surfaced early. Solid.
Also applies to: 200-206, 224-231, 268-278, 296-307, 321-331, 346-356
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py (2)
52-58: Validating endpoint_type prevents silent misconfigExplicitly handling “streaming” vs “query” and raising on anything else is the right call. This ensures misconfigurations are surfaced as structured ERRORs upstream.
63-102: Per-type result aggregation and tool_calls forwarding look correctThe loop cleanly records PASS/FAIL/ERROR per eval_type and only forwards tool_calls when tool_eval is requested. Matches the new result model/flow.
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (3)
94-142: Great multi-type result scenario coverageThis test validates that multiple EvaluationResult entries can co-exist for the same eval_id with different eval_types and outcomes. Good signal for downstream stats.
341-373: Model requirement tests are spot-onThese assert the EVAL_TYPE_REQUIREMENTS mapping is enforced. Nicely done.
495-548: Stats by eval_type updated and validatedEnsures the new eval_type names populate stats correctly. LGTM.
lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py (2)
28-39: Solid agent client mock setup and consistent response shapeGood use of a single mock function for both streaming and query paths, and returning a stable payload including conversation_id and tool_calls. This keeps tests deterministic across endpoint types.
353-375: Nice coverage of strict AND semantics for substring evaluationThese cases exercise all/one/none/CI paths and align with the stricter ALL-keywords requirement noted in previous changes.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (16)
lsc_agent_eval/tests/core/utils/test_api_client.py (2)
70-76: Good coverage of tool_calls mapping from query endpoint
- Validates transforming backend shape {"name","args"} into the internal {"tool_name","arguments"} format and wrapping into sequences. This aligns with the framework’s expected_tool_calls schema.
- Consider adding a test for when the backend omits the tool_calls key entirely (not just an empty list) to ensure we default to [] and maintain a stable return shape.
Example additional test to add:
def test_query_agent_success_missing_tool_calls_key(): mock_response = Mock() response_text = "There are 80 namespaces." mock_response.json.return_value = { "response": response_text, "conversation_id": "conv-id-123", # tool_calls key intentionally omitted } mock_response.raise_for_status.return_value = None mock_client = Mock() mock_client.post.return_value = mock_response with patch("httpx.Client", return_value=mock_client): client = AgentHttpClient("http://localhost:8080") api_input = { "query": "How many namespaces are there?", "provider": "watsonx", "model": "ibm/granite-3-3-8b-instruct", } result = client.query_agent(api_input) assert result["response"] == response_text assert result["conversation_id"] == "conv-id-123" assert result["tool_calls"] == [] # default shapeAlso applies to: 91-99
217-265: Streaming path wiring and parser invocation look right
- Verifies POST /v1/streaming_query with the input payload and timeout.
- Properly delegates to parse_streaming_response and returns its result.
- Minor enhancement: add a (separate) test that initializes the client with an auth token (env or file) and verifies Authorization header is propagated for the streaming path as well.
Example test addition:
@patch("builtins.open", mock_open(read_data="stream-token-xyz")) def test_streaming_query_agent_with_token_file(): mock_response = Mock() mock_response.raise_for_status.return_value = None cm = MagicMock() cm.__enter__.return_value = mock_response cm.__exit__.return_value = None mock_client = Mock() mock_client.stream.return_value = cm expected_result = {"response": "ok", "conversation_id": "c1", "tool_calls": []} with ( patch("httpx.Client", return_value=mock_client) as client_ctor, patch("lsc_agent_eval.core.utils.api_client.parse_streaming_response") as parser, ): parser.return_value = expected_result client = AgentHttpClient("http://localhost:8080", "token.txt") api_input = {"query": "q", "provider": "p", "model": "m"} result = client.streaming_query_agent(api_input) assert result == expected_result # Verify auth header update happened on the client used for streaming client_ctor.return_value.headers.update.assert_called_once_with( {"Authorization": "Bearer stream-token-xyz"} )lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (6)
20-37: Eval type changes reflected correctly in EvaluationResult testsUpdated eval_type literals to the new canonical forms and kept result validation intact. Consider adding a small test that constructs an EvaluationResult with tool_calls set (to exercise the new field on the result model).
Example:
def test_evaluation_result_with_tool_calls(): tool_calls = [[{"tool_name": "oc_get", "arguments": {"oc_get_args": ["pods"]}}]] result = EvaluationResult( eval_id="tool_res_001", query="list pods", response="Found 3 pods", tool_calls=tool_calls, eval_type="tool_eval", result="PASS", ) assert result.tool_calls == tool_callsAlso applies to: 38-55, 56-67, 68-80
94-142: Nice addition validating multiple results for the same core queryCovers coexistence of action_eval, response_eval:sub-string, and response_eval:accuracy for the same eval_id/conversation. Consider adding a tool_eval case with a non-empty tool_calls to ensure mixed-type handling extends to tool eval results too.
Example extension:
results.append( EvaluationResult( eval_id="multi_001", query="create openshift-lightspeed namespace", response="tools used", tool_calls=[[{"tool_name": "oc_get", "arguments": {"oc_get_args": ["ns"]}}]], eval_type="tool_eval", result="PASS", conversation_group="test_conv", conversation_id="conv-123", ) ) assert "tool_eval" in [r.eval_type for r in results]
147-216: EvaluationDataConfig tests updated to eval_types are soundCovers accuracy, action_eval, and sub-string variants with the new list-based eval_types. To harden behavior, add a test that supplies duplicate eval types and asserts they are de-duplicated in order (as per validate_eval_types).
Example:
def test_evaluation_data_config_eval_types_deduplication(): cfg = EvaluationDataConfig( eval_id="dedup_1", eval_query="Q", eval_types=["action_eval", "action_eval", "response_eval:sub-string", "response_eval:sub-string"], expected_keywords=["k"], eval_verify_script=None, ) assert cfg.eval_types == ["action_eval", "response_eval:sub-string"]
221-255: Good multi-type config test with script+substring+accuracyValidates multiple requirements simultaneously (keywords, response, script). One more edge case worth testing: expected_keywords input containing whitespace/empty strings to confirm trimming and filtering behavior, and that an all-empty list raises a validation error (per validate_keywords).
Example:
def test_expected_keywords_trimming_and_empty_filtering(): cfg = EvaluationDataConfig( eval_id="kw_1", eval_query="Q", eval_types=["response_eval:sub-string"], expected_keywords=[" alpha ", "", " " , "beta"], ) assert cfg.expected_keywords == ["alpha", "beta"] def test_expected_keywords_all_empty_raises(): with pytest.raises(ValidationError): EvaluationDataConfig( eval_id="kw_2", eval_query="Q", eval_types=["response_eval:sub-string"], expected_keywords=["", " "], )
331-470: Conversation-level validation looks comprehensive
- Covers minimal config, scripts path validation, group name empty, non-existent script, and duplicate eval_ids within a conversation.
- Consider adding a positive test demonstrating the same eval_id used across different conversation_group values is allowed (and maybe logs a warning), aligning with design notes.
Example:
def test_same_eval_id_across_conversation_groups_allowed(): conv1 = ConversationDataConfig( conversation_group="g1", conversation=[ EvaluationDataConfig( eval_id="shared", eval_query="Q1", eval_types=["response_eval:accuracy"], expected_response="R1", ) ], ) conv2 = ConversationDataConfig( conversation_group="g2", conversation=[ EvaluationDataConfig( eval_id="shared", eval_query="Q2", eval_types=["response_eval:accuracy"], expected_response="R2", ) ], ) assert conv1.conversation[0].eval_id == conv2.conversation[0].eval_id == "shared"
498-548: Stats aggregation test looks goodCovers totals, by-conversation, and by-eval-type for three types. Optional: add a tool_eval result to assert inclusion in by_eval_type and ensure overall counts stay consistent.
lsc_agent_eval/README.md (8)
66-71: Clarify expected_tool_calls schema and naming (“tool_name”, “arguments”)To reduce configuration mistakes, strengthen the wording and include a compact example under the field list.
Apply this refinement:
-- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval) +- `expected_tool_calls`: Expected tool call sequences (list of lists) for tool_eval. + Each tool call entry must use: + - `tool_name`: string + - `arguments`: object/dict of named parameters + + Example: + expected_tool_calls: + - - tool_name: oc_get + arguments: + oc_get_args: [namespaces]
73-75: Tighten note on duplicate eval_id wordingMinor clarity/grammar tweak.
-Note: `eval_id` can't contain duplicate values within a conversation group. But it is okay for cross conversation group (A warning is logged anyway for awareness) +Note: `eval_id` cannot be duplicated within the same conversation group. Reusing the same `eval_id` across different conversation groups is allowed (a warning is logged for awareness).
116-136: Tool call examples match implementation (tool_name + arguments)Examples align with the comparator/formatter expectations. Consider adding an additional example that demonstrates multiple tool calls within a single sequence for completeness.
Example:
expected_tool_calls: - - tool_name: create_pod arguments: {name: demo, image: nginx} - tool_name: apply_service arguments: {pod_name: demo, port: 80}
144-147: Consistent quoting in YAML exampleOptional consistency: quote lightspeed in the expected_keywords list to match “yes”.
- expected_keywords: ["yes", lightspeed] + expected_keywords: ["yes", "lightspeed"]
181-182: Token guidance is helpful; consider noting precedenceIf both AGENT_API_TOKEN and agent_auth_token_file are provided, note which one wins to avoid ambiguity.
Proposed addition immediately after the sentence:
“If both are provided, agent_auth_token_file takes precedence.”
246-248: Clarify FAIL vs ERROR semanticsAligns with framework behavior but could be clearer for readers at a glance.
-- **Evaluation Failure**: Individual evaluation marked as ERROR or FAIL +- **Evaluation Failure**: + - `FAIL`: The evaluation ran successfully but did not meet the criteria (e.g., incorrect response, script verification failed). + - `ERROR`: An error occurred during evaluation execution (e.g., setup failed, agent API error).
254-265: Consider documenting tool_calls in CSV output (optional)If CSV includes tool_calls or a truncated version for tool_eval, add to the column list. If not, consider adding it to aid debugging of tool evaluations.
Proposed addition:
- `response`: The agent's response +- `tool_calls`: Serialized tool calls (if present; for tool_eval) - `error`: Error message (if any)
26-28: Prerequisites note: streaming response expectations (optional)Since streaming is now the default, consider noting that the Agent API’s streaming endpoint must emit a compatible stream (e.g., chunked JSON tokens) that parse_streaming_response can handle.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
lsc_agent_eval/README.md(9 hunks)lsc_agent_eval/sample_data/agent_goal_eval_example.yaml(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py(4 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py(1 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py(14 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_models.py(21 hunks)lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py(1 hunks)lsc_agent_eval/tests/core/utils/test_api_client.py(4 hunks)lsc_agent_eval/tests/core/utils/test_streaming_parser.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
- lsc_agent_eval/tests/core/utils/test_streaming_parser.py
- lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py
- lsc_agent_eval/src/lsc_agent_eval/core/utils/streaming_parser.py
- lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py
- lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py
- lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-08-13T14:07:44.146Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.146Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-08-13T14:07:44.147Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.147Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
lsc_agent_eval/README.mdlsc_agent_eval/tests/core/agent_goal_eval/test_models.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py
🧬 Code Graph Analysis (1)
lsc_agent_eval/tests/core/agent_goal_eval/test_models.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (2)
EvaluationResult(245-272)EvaluationDataConfig(95-176)
🔇 Additional comments (5)
lsc_agent_eval/tests/core/utils/test_api_client.py (4)
106-135: Empty tool_calls handling is correctConfirms we return an empty list when the backend provides tool_calls: [] and keep the dict-based API shape. Looks good.
266-295: Parser error mapping is robustCatches ValueError from the streaming parser and raises AgentAPIError with a clear message. Good error surface for callers.
296-312: Timeout behavior for streaming is correctly capturedMaps httpx.TimeoutException to a dedicated AgentAPIError. Good.
313-330: HTTP error propagation on streaming is correctAsserts proper wrapping of HTTPStatusError with status code included. Looks solid.
lsc_agent_eval/README.md (1)
9-14: Clear summary of multi-type evaluation including tool_evalThe bullets accurately reflect the new eval_types surface and the tool_eval addition. No changes required.
|
@tisnik There are few nitpicks and one good review comment.. I will modify in a follow up PR.. Please merge this if possible. |
Added tool call comparison capability.
Will create follow up stories to standardise responses of lightspeed-stack (less logic on eval side)
Summary by CodeRabbit
New Features
Documentation
Tests