Skip to content

add intent eval#77

Merged
tisnik merged 1 commit intolightspeed-core:mainfrom
asamal4:intent-eval
Oct 10, 2025
Merged

add intent eval#77
tisnik merged 1 commit intolightspeed-core:mainfrom
asamal4:intent-eval

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Oct 10, 2025

Adding intent eval (feature parity with lsc-agent-eval)
Originally added with PR #46

LCORE-734

Summary by CodeRabbit

  • New Features

    • Added a new per-turn metric for intent evaluation (custom:intent_eval) to assess whether responses align with the expected intent.
    • Supports binary scoring (threshold = 1) and appears alongside existing custom metrics.
    • Validation now requires expected_intent when using this metric.
  • Documentation

    • Updated usage guides and examples to include expected_intent and intent_eval in per-turn data and metrics.
    • Clarified when expected_intent is required and provided example turn_metrics entries reflecting the new metric.

Co-authored-by: Itzik Ezra <iezra@redhat.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 10, 2025

Walkthrough

Adds a new per-turn custom metric “intent_eval” for intent alignment. Introduces an INTENT_EVALUATION_PROMPT, handler logic, validation requirements, config metadata, and TurnData.expected_intent. Updates README to document the metric, examples, and usage.

Changes

Cohort / File(s) Summary
Docs
README.md
Documents new custom metric intent_eval, adds expected_intent field usage, examples, and turn-level metrics guidance.
Config: Metrics metadata
config/system.yaml
Registers custom:intent_eval in metrics_metadata.turn_level with threshold 1 and description.
Prompts and exports
src/lightspeed_evaluation/core/metrics/custom/prompts.py, src/lightspeed_evaluation/core/metrics/custom/__init__.py
Adds INTENT_EVALUATION_PROMPT prompt template; exports it (and ANSWER_CORRECTNESS_PROMPT) via __all__.
Custom metric logic
src/lightspeed_evaluation/core/metrics/custom/custom.py
Adds intent_eval metric routing and _evaluate_intent handler: validates expected_intent, builds prompt, calls LLM, parses binary score/reason, handles errors.
Data model
src/lightspeed_evaluation/core/models/data.py
Adds optional expected_intent: Optional[str] to TurnData.
Validation rules
src/lightspeed_evaluation/core/system/validator.py
Extends METRIC_REQUIREMENTS with custom:intent_eval requiring response and expected_intent.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Runner as Evaluator
  participant CM as CustomMetrics
  participant Handler as _evaluate_intent
  participant LLM as LLM
  participant Parser as Score Parser

  Runner->>CM: evaluate("custom:intent_eval", turn)
  CM->>Handler: _evaluate_intent(turn)
  alt Missing fields
    Handler-->>Runner: NotApplicable / Validation error
  else Valid turn
    Handler->>LLM: Prompt(query, response, expected_intent)
    LLM-->>Handler: Text output
    Handler->>Parser: Parse score (0/1) and reason
    Parser-->>Handler: score, reason or parse error
    Handler-->>Runner: Result (score, reason) or parse error message
  end

  note over Handler,LLM: Uses INTENT_EVALUATION_PROMPT
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • tisnik

Poem

A rabbit taps its metrics drum,
Intent aligned? The verdict: 1.
New prompt carrots, crisp and bright,
Turn by turn we gauge the light.
If parsing fails, we nibble slow—
But when it clicks, off we go! 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title “add intent eval” clearly summarizes the primary change of introducing intent evaluation into the system, is concise and directly related to the main functionality added in this pull request. It avoids unnecessary detail and noise while remaining understandable to teammates reviewing the commit history.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/metrics/custom/custom.py (1)

203-243: Handle None response safely and consider consistent return format.

The implementation follows the correct pattern for turn-level evaluation with proper validation and error handling. However, there are two considerations:

  1. Line 226: The code uses response=response which could be None. For consistency with _evaluate_answer_correctness (line 154), consider using response=response or "" to handle None values safely.

  2. Line 241: Returns just reason while _evaluate_answer_correctness returns a formatted string like f"Custom answer correctness: {score:.2f} - {reason}". For consistency, consider returning a formatted string, though the simpler format may be intentional for binary evaluation.

Apply this diff if you want to align with the answer_correctness pattern:

         prompt = INTENT_EVALUATION_PROMPT.format(
             query=query,
-            response=response,
+            response=response or "",
             expected_intent=expected_intent,
         )
 
         # Make LLM call and parse response
         try:
             llm_response = self._call_llm(prompt)
             score, reason = self._parse_score_response(llm_response)
 
             if score is None:
                 return (
                     None,
                     f"Could not parse score from LLM response: {llm_response[:100]}...",
                 )
 
-            return score, reason
+            return score, f"Intent evaluation: {'match' if score == 1 else 'no match'} - {reason}"
         except LLMError as e:
             return None, f"Intent evaluation failed: {str(e)}"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4ab3dc and efa31e6.

📒 Files selected for processing (7)
  • README.md (5 hunks)
  • config/system.yaml (1 hunks)
  • src/lightspeed_evaluation/core/metrics/custom/__init__.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/custom/custom.py (3 hunks)
  • src/lightspeed_evaluation/core/metrics/custom/prompts.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (1 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • README.md
  • config/system.yaml
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)
  • evaluate_tool_calls (10-34)
src/lightspeed_evaluation/core/metrics/custom/custom.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • TurnData (35-135)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • LLMError (24-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: mypy
🔇 Additional comments (10)
src/lightspeed_evaluation/core/models/data.py (1)

62-64: LGTM!

The expected_intent field is properly defined with appropriate validation (min_length=1) and follows the same pattern as other optional fields like expected_response. The field description clearly indicates its purpose for intent evaluation.

config/system.yaml (1)

73-75: LGTM!

The metric configuration is consistent with existing custom metrics. The binary threshold (1) correctly aligns with the intent evaluation's binary scoring model (0 or 1), and the description clearly communicates the metric's purpose.

src/lightspeed_evaluation/core/metrics/custom/prompts.py (2)

3-4: LGTM!

The pylint directive is appropriate given the nature of evaluation prompts, which require long strings for clear instructions.


24-46: LGTM!

The prompt is well-structured with clear evaluation criteria, helpful examples, and explicit binary scoring instructions. The format requirements ensure consistent LLM responses for parsing.

src/lightspeed_evaluation/core/metrics/custom/custom.py (2)

8-11: LGTM!

The import is properly structured and follows the existing pattern for importing prompts.


32-32: LGTM!

The new metric is correctly registered in the supported_metrics dictionary following the established pattern.

src/lightspeed_evaluation/core/system/validator.py (1)

47-50: LGTM!

The validation requirement is properly configured with the correct required fields ("response" and "expected_intent") and follows the established pattern for metric requirements. This ensures proper validation before evaluation is attempted.

README.md (2)

88-88: LGTM!

The metric is properly documented with a clear description and correct reference to the implementation file.


140-142: LGTM!

The documentation comprehensively covers the new metric across multiple sections:

  • Configuration examples show proper threshold and description
  • Data structure examples demonstrate expected_intent usage
  • Table entries accurately describe the field requirements
  • Usage examples clarify when the field is required

The documentation is clear, complete, and consistent with existing patterns.

Also applies to: 224-225, 230-230, 287-288, 298-299

src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)

4-7: LGTM!

The new prompt constant is properly exposed in the public API following the established pattern. The import structure and __all__ list are correctly updated with clear organization.

Also applies to: 13-15

@asamal4
Copy link
Collaborator Author

asamal4 commented Oct 10, 2025

@VladimirKadlec @tisnik PTAL

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 💪 , let's add link to the Jira issue to the description.
LGTM

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit f92850a into lightspeed-core:main Oct 10, 2025
15 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants