Add ability to evaluate the tool results by bparees · Pull Request #151 · lightspeed-core/lightspeed-evaluation

bparees · 2026-01-30T22:07:37Z

Description

Adds support for checking the tool call results, in addition to the tool call names + arguments.

This is needed for aladdin evaluation in particular because some of the mcp tools are non-deterministic in nature.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Generated by: Cursor

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Evaluation can now optionally validate tool call results (exact or regex) and includes a new example evaluation group demonstrating result-based checks.
- Evaluation configs support optional setup and cleanup scripts for pre/post evaluation steps.
Documentation
- Guides and README expanded with a "Result Validation" section, examples, and updated usage notes.
Tests
- Increased test coverage for result validation, regex behavior, edge cases, and related exports.

coderabbitai · 2026-01-30T22:07:53Z

Walkthrough

Adds optional result-field support for tool-call validation: configs and docs updated; parser and client capture tool_call results; metric logic compares results via regex; validator description updated; tests extended for parsing and result-comparison and exports.

Changes

Cohort / File(s)	Summary
Configuration `config/evaluation_data.yaml`	Added `tool_eval_with_result` conversation group demonstrating explicit and regex-based result validation; inserted before `script_eval_example`; added surrounding `setup_script`/`cleanup_script` for `script_eval_example`.
Documentation `docs/EVALUATION_GUIDE.md`, `README.md`	Documented optional result validation for tool_eval, added YAML examples, updated metric descriptions and usage notes.
API / Parsing `src/lightspeed_evaluation/core/api/streaming_parser.py`, `src/lightspeed_evaluation/core/api/client.py`	Parser now extracts an optional `result` from tool-call tokens and logs it; client includes `result` in formatted tool_call dict when present; docstrings/logging updated.
Metric Logic `src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`	Added `_compare_tool_result(expected, actual)` and integrated regex-based result comparison into `_compare_single_tool_call`; logging for mismatches and invalid patterns.
Validator Text `src/lightspeed_evaluation/core/system/validator.py`	Updated `custom:tool_eval` metric description to mention optional `result` field.
Tests `tests/unit/core/api/test_streaming_parser.py`, `tests/unit/core/metrics/custom/test_tool_eval.py`	Added parser tests for `result` presence/absence/None; extensive tests for `_compare_tool_result` (regex, edge cases), end-to-end tool-eval scenarios; exported `_compare_tool_result`.
Other `README.md`	Minor README updates and examples reflecting result validation usage.

Sequence Diagram(s)

sequenceDiagram
    participant Parser as Streaming Parser
    participant Client as Client API
    participant Evaluator as ToolEval Metric

    Parser->>Client: parsed tool_call {tool_name, arguments, result?}
    Client->>Evaluator: formatted tool_call (includes result if present)
    Evaluator->>Evaluator: compare tool_name
    Evaluator->>Evaluator: compare arguments (regex)
    Evaluator->>Evaluator: if expected result -> compare result (regex)
    Evaluator->>Client: return match status / messages

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[LEADS-141] Add Latency Metrics to Evaluation Reports #127 — modifies _parse_tool_call in src/lightspeed_evaluation/core/api/streaming_parser.py; likely related parsing changes.

Suggested reviewers

VladimirKadlec
asamal4
tisnik

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately summarizes the main change: adding support for evaluating tool results in addition to tool names and arguments.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

asamal4

Thank you !!
Could you please add a brief note in main readme about this capability.

FYI: with more enhancements to tool_eval, the reason field in final csv report has become very generic now (for eval failure scenarios). I will create a story for the team to work on this.

asamal4 · 2026-02-03T12:01:38Z

Sorry, Bad timing.. Yesterday we merged a PR enforcing lint checks on test cases (aligning with lightspeed-stack repo). Because of that mypy check is failing now.

bparees · 2026-02-03T15:32:38Z

Sorry, Bad timing.. Yesterday we merged a PR enforcing lint checks on test cases (aligning with lightspeed-stack repo). Because of that mypy check is failing now.

fixed

asamal4 reviewed Feb 2, 2026

View reviewed changes

bparees force-pushed the eval_tool_results branch from 597ab09 to 3891fe9 Compare February 2, 2026 20:33

Add ability to evaluate the tool results

06e9542

bparees force-pushed the eval_tool_results branch from 3891fe9 to 06e9542 Compare February 3, 2026 15:11

asamal4 merged commit 138e6e9 into lightspeed-core:main Feb 3, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to evaluate the tool results#151

Add ability to evaluate the tool results#151
asamal4 merged 1 commit intolightspeed-core:mainfrom
bparees:eval_tool_results

bparees commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

asamal4 left a comment •

edited

Loading

Uh oh!

asamal4 commented Feb 3, 2026

Uh oh!

bparees commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bparees commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

asamal4 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Feb 3, 2026

Uh oh!

bparees commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bparees commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

asamal4 left a comment •

edited

Loading