Add ability to evaluate the tool results#151
Conversation
WalkthroughAdds optional result-field support for tool-call validation: configs and docs updated; parser and client capture tool_call results; metric logic compares results via regex; validator description updated; tests extended for parsing and result-comparison and exports. Changes
Sequence Diagram(s)sequenceDiagram
participant Parser as Streaming Parser
participant Client as Client API
participant Evaluator as ToolEval Metric
Parser->>Client: parsed tool_call {tool_name, arguments, result?}
Client->>Evaluator: formatted tool_call (includes result if present)
Evaluator->>Evaluator: compare tool_name
Evaluator->>Evaluator: compare arguments (regex)
Evaluator->>Evaluator: if expected result -> compare result (regex)
Evaluator->>Client: return match status / messages
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Failure to add the new IP will result in interrupted reviews. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Thank you !!
Could you please add a brief note in main readme about this capability.
FYI: with more enhancements to tool_eval, the reason field in final csv report has become very generic now (for eval failure scenarios). I will create a story for the team to work on this.
597ab09 to
3891fe9
Compare
|
Sorry, Bad timing.. Yesterday we merged a PR enforcing lint checks on test cases (aligning with lightspeed-stack repo). Because of that mypy check is failing now. |
3891fe9 to
06e9542
Compare
fixed |
Description
Adds support for checking the tool call results, in addition to the tool call names + arguments.
This is needed for aladdin evaluation in particular because some of the mcp tools are non-deterministic in nature.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation
Tests