add intent eval by asamal4 · Pull Request #77 · lightspeed-core/lightspeed-evaluation

asamal4 · 2025-10-10T08:57:17Z

Adding intent eval (feature parity with lsc-agent-eval)
Originally added with PR #46

Summary by CodeRabbit

New Features
- Added a new per-turn metric for intent evaluation (custom:intent_eval) to assess whether responses align with the expected intent.
- Supports binary scoring (threshold = 1) and appears alongside existing custom metrics.
- Validation now requires expected_intent when using this metric.
Documentation
- Updated usage guides and examples to include expected_intent and intent_eval in per-turn data and metrics.
- Clarified when expected_intent is required and provided example turn_metrics entries reflecting the new metric.

Co-authored-by: Itzik Ezra <iezra@redhat.com>

coderabbitai · 2025-10-10T08:57:31Z

Walkthrough

Adds a new per-turn custom metric “intent_eval” for intent alignment. Introduces an INTENT_EVALUATION_PROMPT, handler logic, validation requirements, config metadata, and TurnData.expected_intent. Updates README to document the metric, examples, and usage.

Changes

Cohort / File(s)	Summary
Docs `README.md`	Documents new custom metric `intent_eval`, adds `expected_intent` field usage, examples, and turn-level metrics guidance.
Config: Metrics metadata `config/system.yaml`	Registers `custom:intent_eval` in `metrics_metadata.turn_level` with threshold 1 and description.
Prompts and exports `src/lightspeed_evaluation/core/metrics/custom/prompts.py`, `src/lightspeed_evaluation/core/metrics/custom/__init__.py`	Adds `INTENT_EVALUATION_PROMPT` prompt template; exports it (and `ANSWER_CORRECTNESS_PROMPT`) via `__all__`.
Custom metric logic `src/lightspeed_evaluation/core/metrics/custom/custom.py`	Adds `intent_eval` metric routing and `_evaluate_intent` handler: validates `expected_intent`, builds prompt, calls LLM, parses binary score/reason, handles errors.
Data model `src/lightspeed_evaluation/core/models/data.py`	Adds optional `expected_intent: Optional[str]` to `TurnData`.
Validation rules `src/lightspeed_evaluation/core/system/validator.py`	Extends `METRIC_REQUIREMENTS` with `custom:intent_eval` requiring `response` and `expected_intent`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Runner as Evaluator
  participant CM as CustomMetrics
  participant Handler as _evaluate_intent
  participant LLM as LLM
  participant Parser as Score Parser

  Runner->>CM: evaluate("custom:intent_eval", turn)
  CM->>Handler: _evaluate_intent(turn)
  alt Missing fields
    Handler-->>Runner: NotApplicable / Validation error
  else Valid turn
    Handler->>LLM: Prompt(query, response, expected_intent)
    LLM-->>Handler: Text output
    Handler->>Parser: Parse score (0/1) and reason
    Parser-->>Handler: score, reason or parse error
    Handler-->>Runner: Result (score, reason) or parse error message
  end

  note over Handler,LLM: Uses INTENT_EVALUATION_PROMPT

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Simplify custom prompt handling & re-organize #73 — Also modifies custom metric prompts and exports, touching prompts.py and __init__.py, likely aligned with adding/adjusting evaluation prompts.

Suggested reviewers

VladimirKadlec
tisnik

Poem

A rabbit taps its metrics drum,
Intent aligned? The verdict: 1.
New prompt carrots, crisp and bright,
Turn by turn we gauge the light.
If parsing fails, we nibble slow—
But when it clicks, off we go! 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title “add intent eval” clearly summarizes the primary change of introducing intent evaluation into the system, is concise and directly related to the main functionality added in this pull request. It avoids unnecessary detail and noise while remaining understandable to teammates reviewing the commit history.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/metrics/custom/custom.py (1)
203-243: Handle None response safely and consider consistent return format.

The implementation follows the correct pattern for turn-level evaluation with proper validation and error handling. However, there are two considerations:

Line 226: The code uses response=response which could be None. For consistency with _evaluate_answer_correctness (line 154), consider using response=response or "" to handle None values safely.

Line 241: Returns just reason while _evaluate_answer_correctness returns a formatted string like f"Custom answer correctness: {score:.2f} - {reason}". For consistency, consider returning a formatted string, though the simpler format may be intentional for binary evaluation.

Apply this diff if you want to align with the answer_correctness pattern:
         prompt = INTENT_EVALUATION_PROMPT.format(
             query=query,
-            response=response,
+            response=response or "",
             expected_intent=expected_intent,
         )
 
         # Make LLM call and parse response
         try:
             llm_response = self._call_llm(prompt)
             score, reason = self._parse_score_response(llm_response)
 
             if score is None:
                 return (
                     None,
                     f"Could not parse score from LLM response: {llm_response[:100]}...",
                 )
 
-            return score, reason
+            return score, f"Intent evaluation: {'match' if score == 1 else 'no match'} - {reason}"
         except LLMError as e:
             return None, f"Intent evaluation failed: {str(e)}"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4ab3dc and efa31e6.

📒 Files selected for processing (7)

README.md (5 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/core/metrics/custom/__init__.py (1 hunks)
src/lightspeed_evaluation/core/metrics/custom/custom.py (3 hunks)
src/lightspeed_evaluation/core/metrics/custom/prompts.py (2 hunks)
src/lightspeed_evaluation/core/models/data.py (1 hunks)
src/lightspeed_evaluation/core/system/validator.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

README.md
config/system.yaml

🧬 Code graph analysis (2)

src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)

evaluate_tool_calls (10-34)

src/lightspeed_evaluation/core/metrics/custom/custom.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

TurnData (35-135)

src/lightspeed_evaluation/core/system/exceptions.py (1)

LLMError (24-25)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.12)
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: mypy

🔇 Additional comments (10)

src/lightspeed_evaluation/core/models/data.py (1)

62-64: LGTM!

The expected_intent field is properly defined with appropriate validation (min_length=1) and follows the same pattern as other optional fields like expected_response. The field description clearly indicates its purpose for intent evaluation.

config/system.yaml (1)

73-75: LGTM!

The metric configuration is consistent with existing custom metrics. The binary threshold (1) correctly aligns with the intent evaluation's binary scoring model (0 or 1), and the description clearly communicates the metric's purpose.

src/lightspeed_evaluation/core/metrics/custom/prompts.py (2)

3-4: LGTM!

The pylint directive is appropriate given the nature of evaluation prompts, which require long strings for clear instructions.

24-46: LGTM!

The prompt is well-structured with clear evaluation criteria, helpful examples, and explicit binary scoring instructions. The format requirements ensure consistent LLM responses for parsing.

src/lightspeed_evaluation/core/metrics/custom/custom.py (2)

8-11: LGTM!

The import is properly structured and follows the existing pattern for importing prompts.

32-32: LGTM!

The new metric is correctly registered in the supported_metrics dictionary following the established pattern.

src/lightspeed_evaluation/core/system/validator.py (1)

47-50: LGTM!

The validation requirement is properly configured with the correct required fields ("response" and "expected_intent") and follows the established pattern for metric requirements. This ensures proper validation before evaluation is attempted.

README.md (2)

88-88: LGTM!

The metric is properly documented with a clear description and correct reference to the implementation file.

140-142: LGTM!

The documentation comprehensively covers the new metric across multiple sections:

Configuration examples show proper threshold and description

Data structure examples demonstrate expected_intent usage

Table entries accurately describe the field requirements

Usage examples clarify when the field is required

The documentation is clear, complete, and consistent with existing patterns.

Also applies to: 224-225, 230-230, 287-288, 298-299

src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)

4-7: LGTM!

The new prompt constant is properly exposed in the public API following the established pattern. The import structure and __all__ list are correctly updated with clear organization.

Also applies to: 13-15

asamal4 · 2025-10-10T10:41:57Z

@VladimirKadlec @tisnik PTAL

VladimirKadlec

Nice 💪 , let's add link to the Jira issue to the description.
LGTM

tisnik

LGTM

add intent eval

efa31e6

Co-authored-by: Itzik Ezra <iezra@redhat.com>

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

VladimirKadlec approved these changes Oct 10, 2025

View reviewed changes

tisnik approved these changes Oct 10, 2025

View reviewed changes

tisnik merged commit f92850a into lightspeed-core:main Oct 10, 2025
15 checks passed

coderabbitai bot mentioned this pull request Nov 6, 2025

Add keyword eval metric #93

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add intent eval#77

add intent eval#77
tisnik merged 1 commit intolightspeed-core:mainfrom
asamal4:intent-eval

asamal4 commented Oct 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

asamal4 commented Oct 10, 2025

Uh oh!

VladimirKadlec left a comment

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

asamal4 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Oct 10, 2025

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asamal4 commented Oct 10, 2025 •

edited

Loading

coderabbitai bot commented Oct 10, 2025 •

edited

Loading