fix: Do subset matching for full_match=false#145
fix: Do subset matching for full_match=false#145asamal4 merged 2 commits intolightspeed-core:mainfrom
Conversation
WalkthroughPartial matching semantics tightened: when Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)
121-137: Empty expected sequences can incorrectly pass partial/subset matches.
Withexpected == [],_compare_partialreturns(0, 0)and the current success logic returnsTrueeven whenactualcontains tool calls, which can mask real mismatches and emit the “No tool calls made” success message. Consider gating empty-expected success onactualbeing empty.🔧 Proposed fix
if resolved_mode == "partial": matched, total = _compare_partial(expected_normalized, actual_normalized) - # Partial match succeeds if at least one tool matched - success = matched > 0 or total == 0 + # Partial match succeeds if at least one tool matched. + # If nothing is expected, only succeed when nothing was actually called. + if total == 0: + success = len(actual_normalized) == 0 + else: + success = matched > 0 return { "success": success, "stats": {"matched": matched, "total": total, "unmatched": total - matched}, } if resolved_mode == "subset": matched, total = _compare_partial(expected_normalized, actual_normalized) # Subset mode requires ALL expected tools to be found - success = matched == total + if total == 0: + success = len(actual_normalized) == 0 + else: + success = matched == total return { "success": success, "stats": {"matched": matched, "total": total, "unmatched": total - matched}, }Based on learnings, empty alternatives represent skip scenarios and should not match non-empty actual tool calls.
🧹 Nitpick comments (2)
docs/EVALUATION_GUIDE.md (1)
785-812: Use “subset” consistently in example labels.
The comment “Contains match” could confuse readers now that the official term is “subset”.✏️ Suggested wording tweak
-# Contains match - all expected tools must be present +# Subset match - all expected tools must be presenttests/unit/core/metrics/custom/test_tool_eval.py (1)
539-824: Add a negative case for empty expected with non-empty actual.
To prevent skip-alternative false positives (especially under subset/partial), consider a test whereexpected = [[]](or empty alternative in a list) andactualcontains tool calls, and assert failure. This will lock in the intended behavior and protect the fix incompare_tool_calls.Based on learnings, empty alternatives represent skip scenarios and shouldn’t match non-empty actual tool calls.
|
I think this is a bug in recently implemented full_match=false logic.. My expectation was that everything given in expected tool calls should match with actual tool calls (with extra allowed). Certainly missed during review..
@saswatamcode WDYT ? Should we simply fix the partial logic to check everything matches in expected tool calls, rather than adding both partial & subset ? |
|
Right, so this did occur to me, but I didn't want to change the full_match behaviour outright, in case something was depending on it @asamal4 🙂 Thinking about it, the only way in which partial might differ from the subset case is, for something like a "one of" case, where an agent has multiple tools available that can give it the same info, and we want to verify it uses at least one of those. A real-life example would be an agent with Kube tools and Prometheus tools, which can get a count of the number of pods in an ns from both of those tools. Subset would mean matching all, whereas partial would help us match "one of". (Well, you could still maybe work around this by just having multiple cases) However, no strong opinions on this, I can modify this PR to change full_match false behavior to match subset if you want! |
|
@saswatamcode Thanks. I think we are very much aligned. |
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
|
@asamal4 have updated the PR |
Description
This commit fixes an issue where setting full_match=false led to partial matching. Instead full_match=false should lead to subset matching
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
Bug Fixes
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.