Skip to content

Add ability to evaluate the tool results#151

Merged
asamal4 merged 1 commit intolightspeed-core:mainfrom
bparees:eval_tool_results
Feb 3, 2026
Merged

Add ability to evaluate the tool results#151
asamal4 merged 1 commit intolightspeed-core:mainfrom
bparees:eval_tool_results

Conversation

@bparees
Copy link
Contributor

@bparees bparees commented Jan 30, 2026

Description

Adds support for checking the tool call results, in addition to the tool call names + arguments.

This is needed for aladdin evaluation in particular because some of the mcp tools are non-deterministic in nature.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Generated by: Cursor

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Evaluation can now optionally validate tool call results (exact or regex) and includes a new example evaluation group demonstrating result-based checks.
    • Evaluation configs support optional setup and cleanup scripts for pre/post evaluation steps.
  • Documentation

    • Guides and README expanded with a "Result Validation" section, examples, and updated usage notes.
  • Tests

    • Increased test coverage for result validation, regex behavior, edge cases, and related exports.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

Walkthrough

Adds optional result-field support for tool-call validation: configs and docs updated; parser and client capture tool_call results; metric logic compares results via regex; validator description updated; tests extended for parsing and result-comparison and exports.

Changes

Cohort / File(s) Summary
Configuration
config/evaluation_data.yaml
Added tool_eval_with_result conversation group demonstrating explicit and regex-based result validation; inserted before script_eval_example; added surrounding setup_script/cleanup_script for script_eval_example.
Documentation
docs/EVALUATION_GUIDE.md, README.md
Documented optional result validation for tool_eval, added YAML examples, updated metric descriptions and usage notes.
API / Parsing
src/lightspeed_evaluation/core/api/streaming_parser.py, src/lightspeed_evaluation/core/api/client.py
Parser now extracts an optional result from tool-call tokens and logs it; client includes result in formatted tool_call dict when present; docstrings/logging updated.
Metric Logic
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
Added _compare_tool_result(expected, actual) and integrated regex-based result comparison into _compare_single_tool_call; logging for mismatches and invalid patterns.
Validator Text
src/lightspeed_evaluation/core/system/validator.py
Updated custom:tool_eval metric description to mention optional result field.
Tests
tests/unit/core/api/test_streaming_parser.py, tests/unit/core/metrics/custom/test_tool_eval.py
Added parser tests for result presence/absence/None; extensive tests for _compare_tool_result (regex, edge cases), end-to-end tool-eval scenarios; exported _compare_tool_result.
Other
README.md
Minor README updates and examples reflecting result validation usage.

Sequence Diagram(s)

sequenceDiagram
    participant Parser as Streaming Parser
    participant Client as Client API
    participant Evaluator as ToolEval Metric

    Parser->>Client: parsed tool_call {tool_name, arguments, result?}
    Client->>Evaluator: formatted tool_call (includes result if present)
    Evaluator->>Evaluator: compare tool_name
    Evaluator->>Evaluator: compare arguments (regex)
    Evaluator->>Evaluator: if expected result -> compare result (regex)
    Evaluator->>Client: return match status / messages
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • asamal4
  • tisnik
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: adding support for evaluating tool results in addition to tool names and arguments.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !!
Could you please add a brief note in main readme about this capability.

FYI: with more enhancements to tool_eval, the reason field in final csv report has become very generic now (for eval failure scenarios). I will create a story for the team to work on this.

@asamal4
Copy link
Collaborator

asamal4 commented Feb 3, 2026

Sorry, Bad timing.. Yesterday we merged a PR enforcing lint checks on test cases (aligning with lightspeed-stack repo). Because of that mypy check is failing now.

@bparees
Copy link
Contributor Author

bparees commented Feb 3, 2026

Sorry, Bad timing.. Yesterday we merged a PR enforcing lint checks on test cases (aligning with lightspeed-stack repo). Because of that mypy check is failing now.

fixed

@asamal4 asamal4 merged commit 138e6e9 into lightspeed-core:main Feb 3, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants