Conversation
Co-authored-by: Itzik Ezra <iezra@redhat.com>
WalkthroughAdds a new per-turn custom metric “intent_eval” for intent alignment. Introduces an INTENT_EVALUATION_PROMPT, handler logic, validation requirements, config metadata, and TurnData.expected_intent. Updates README to document the metric, examples, and usage. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Runner as Evaluator
participant CM as CustomMetrics
participant Handler as _evaluate_intent
participant LLM as LLM
participant Parser as Score Parser
Runner->>CM: evaluate("custom:intent_eval", turn)
CM->>Handler: _evaluate_intent(turn)
alt Missing fields
Handler-->>Runner: NotApplicable / Validation error
else Valid turn
Handler->>LLM: Prompt(query, response, expected_intent)
LLM-->>Handler: Text output
Handler->>Parser: Parse score (0/1) and reason
Parser-->>Handler: score, reason or parse error
Handler-->>Runner: Result (score, reason) or parse error message
end
note over Handler,LLM: Uses INTENT_EVALUATION_PROMPT
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/metrics/custom/custom.py (1)
203-243: Handle None response safely and consider consistent return format.The implementation follows the correct pattern for turn-level evaluation with proper validation and error handling. However, there are two considerations:
Line 226: The code uses
response=responsewhich could be None. For consistency with_evaluate_answer_correctness(line 154), consider usingresponse=response or ""to handle None values safely.Line 241: Returns just
reasonwhile_evaluate_answer_correctnessreturns a formatted string likef"Custom answer correctness: {score:.2f} - {reason}". For consistency, consider returning a formatted string, though the simpler format may be intentional for binary evaluation.Apply this diff if you want to align with the answer_correctness pattern:
prompt = INTENT_EVALUATION_PROMPT.format( query=query, - response=response, + response=response or "", expected_intent=expected_intent, ) # Make LLM call and parse response try: llm_response = self._call_llm(prompt) score, reason = self._parse_score_response(llm_response) if score is None: return ( None, f"Could not parse score from LLM response: {llm_response[:100]}...", ) - return score, reason + return score, f"Intent evaluation: {'match' if score == 1 else 'no match'} - {reason}" except LLMError as e: return None, f"Intent evaluation failed: {str(e)}"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.md(5 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/core/metrics/custom/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/custom/custom.py(3 hunks)src/lightspeed_evaluation/core/metrics/custom/prompts.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(1 hunks)src/lightspeed_evaluation/core/system/validator.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
README.mdconfig/system.yaml
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)
evaluate_tool_calls(10-34)
src/lightspeed_evaluation/core/metrics/custom/custom.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
TurnData(35-135)src/lightspeed_evaluation/core/system/exceptions.py (1)
LLMError(24-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: mypy
🔇 Additional comments (10)
src/lightspeed_evaluation/core/models/data.py (1)
62-64: LGTM!The
expected_intentfield is properly defined with appropriate validation (min_length=1) and follows the same pattern as other optional fields likeexpected_response. The field description clearly indicates its purpose for intent evaluation.config/system.yaml (1)
73-75: LGTM!The metric configuration is consistent with existing custom metrics. The binary threshold (1) correctly aligns with the intent evaluation's binary scoring model (0 or 1), and the description clearly communicates the metric's purpose.
src/lightspeed_evaluation/core/metrics/custom/prompts.py (2)
3-4: LGTM!The pylint directive is appropriate given the nature of evaluation prompts, which require long strings for clear instructions.
24-46: LGTM!The prompt is well-structured with clear evaluation criteria, helpful examples, and explicit binary scoring instructions. The format requirements ensure consistent LLM responses for parsing.
src/lightspeed_evaluation/core/metrics/custom/custom.py (2)
8-11: LGTM!The import is properly structured and follows the existing pattern for importing prompts.
32-32: LGTM!The new metric is correctly registered in the supported_metrics dictionary following the established pattern.
src/lightspeed_evaluation/core/system/validator.py (1)
47-50: LGTM!The validation requirement is properly configured with the correct required fields ("response" and "expected_intent") and follows the established pattern for metric requirements. This ensures proper validation before evaluation is attempted.
README.md (2)
88-88: LGTM!The metric is properly documented with a clear description and correct reference to the implementation file.
140-142: LGTM!The documentation comprehensively covers the new metric across multiple sections:
- Configuration examples show proper threshold and description
- Data structure examples demonstrate expected_intent usage
- Table entries accurately describe the field requirements
- Usage examples clarify when the field is required
The documentation is clear, complete, and consistent with existing patterns.
Also applies to: 224-225, 230-230, 287-288, 298-299
src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)
4-7: LGTM!The new prompt constant is properly exposed in the public API following the established pattern. The import structure and
__all__list are correctly updated with clear organization.Also applies to: 13-15
|
@VladimirKadlec @tisnik PTAL |
VladimirKadlec
left a comment
There was a problem hiding this comment.
Nice 💪 , let's add link to the Jira issue to the description.
LGTM
Adding intent eval (feature parity with lsc-agent-eval)
Originally added with PR #46
LCORE-734
Summary by CodeRabbit
New Features
Documentation