Skip to content

fix: Do subset matching for full_match=false#145

Merged
asamal4 merged 2 commits intolightspeed-core:mainfrom
saswatamcode:subset
Jan 27, 2026
Merged

fix: Do subset matching for full_match=false#145
asamal4 merged 2 commits intolightspeed-core:mainfrom
saswatamcode:subset

Conversation

@saswatamcode
Copy link
Contributor

@saswatamcode saswatamcode commented Jan 25, 2026

Description

This commit fixes an issue where setting full_match=false led to partial matching. Instead full_match=false should lead to subset matching

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude Sonnet 4.5
  • Generated by: Claude Code

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Bug Fixes

    • Partial-match behavior changed: when partial matching is enabled, all expected tools must now be present (extras allowed); previous greedy/any-match behavior removed.
  • Documentation

    • Clarified configuration comment and evaluation guidance to reflect updated partial/subset matching semantics.
  • Tests

    • Unit tests updated to require all expected tools be found under partial matching and to simplify failure assertions.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 25, 2026

Walkthrough

Partial matching semantics tightened: when full_match=False, partial mode now requires every expected tool call to be present within actual calls (extras allowed). Docstrings and tests updated to reflect this stricter subset behavior.

Changes

Cohort / File(s) Summary
Documentation
docs/EVALUATION_GUIDE.md
Clarified inline comment for full_match to state subset semantics: all expected must be present (extras allowed) when false.
Core implementation
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
Docstring updates and logic change in _compare_partial so partial mode requires matched == total (complete coverage of expected) instead of matched > 0.
Tests
tests/unit/core/metrics/custom/test_tool_eval.py
Updated/renamed partial-match tests to require all expected tools be found as a subset of actual (allowing extras); adjusted assertions accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • asamal4
  • VladimirKadlec
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the primary change: modifying the full_match=false behavior to enforce subset matching instead of partial matching.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)

121-137: Empty expected sequences can incorrectly pass partial/subset matches.
With expected == [], _compare_partial returns (0, 0) and the current success logic returns True even when actual contains tool calls, which can mask real mismatches and emit the “No tool calls made” success message. Consider gating empty-expected success on actual being empty.

🔧 Proposed fix
     if resolved_mode == "partial":
         matched, total = _compare_partial(expected_normalized, actual_normalized)
-        # Partial match succeeds if at least one tool matched
-        success = matched > 0 or total == 0
+        # Partial match succeeds if at least one tool matched.
+        # If nothing is expected, only succeed when nothing was actually called.
+        if total == 0:
+            success = len(actual_normalized) == 0
+        else:
+            success = matched > 0
         return {
             "success": success,
             "stats": {"matched": matched, "total": total, "unmatched": total - matched},
         }

     if resolved_mode == "subset":
         matched, total = _compare_partial(expected_normalized, actual_normalized)
         # Subset mode requires ALL expected tools to be found
-        success = matched == total
+        if total == 0:
+            success = len(actual_normalized) == 0
+        else:
+            success = matched == total
         return {
             "success": success,
             "stats": {"matched": matched, "total": total, "unmatched": total - matched},
         }

Based on learnings, empty alternatives represent skip scenarios and should not match non-empty actual tool calls.

🧹 Nitpick comments (2)
docs/EVALUATION_GUIDE.md (1)

785-812: Use “subset” consistently in example labels.
The comment “Contains match” could confuse readers now that the official term is “subset”.

✏️ Suggested wording tweak
-# Contains match - all expected tools must be present
+# Subset match - all expected tools must be present
tests/unit/core/metrics/custom/test_tool_eval.py (1)

539-824: Add a negative case for empty expected with non-empty actual.
To prevent skip-alternative false positives (especially under subset/partial), consider a test where expected = [[]] (or empty alternative in a list) and actual contains tool calls, and assert failure. This will lock in the intended behavior and protect the fix in compare_tool_calls.

Based on learnings, empty alternatives represent skip scenarios and shouldn’t match non-empty actual tool calls.

@asamal4
Copy link
Collaborator

asamal4 commented Jan 26, 2026

I think this is a bug in recently implemented full_match=false logic.. My expectation was that everything given in expected tool calls should match with actual tool calls (with extra allowed). Certainly missed during review..
This PR is fixing that. Thank you..
But IMO keeping both partial & subset don't add any value, for below reasons.

  • If we only care about 1 tool call then we can mention just that tool in expected tool calls.. (same as partial & subset)
  • Not sure about a scenario, where we set multiple different tool calls as expected but okay if only one of those is called. The Agent can take different paths to achieve the task, but if we are okay with any one out of several tools, to me it sounds like redundancy in tool definition.
  • Even if someone wants to do above, currently we have ability to provide multiple alternative paths (alternative tool calls)

@saswatamcode WDYT ? Should we simply fix the partial logic to check everything matches in expected tool calls, rather than adding both partial & subset ?

@saswatamcode
Copy link
Contributor Author

Right, so this did occur to me, but I didn't want to change the full_match behaviour outright, in case something was depending on it @asamal4 🙂

Thinking about it, the only way in which partial might differ from the subset case is, for something like a "one of" case, where an agent has multiple tools available that can give it the same info, and we want to verify it uses at least one of those.

A real-life example would be an agent with Kube tools and Prometheus tools, which can get a count of the number of pods in an ns from both of those tools. Subset would mean matching all, whereas partial would help us match "one of". (Well, you could still maybe work around this by just having multiple cases)

However, no strong opinions on this, I can modify this PR to change full_match false behavior to match subset if you want!

@asamal4
Copy link
Collaborator

asamal4 commented Jan 26, 2026

@saswatamcode Thanks. I think we are very much aligned.
I would request to fix the full_match=false to check matched count == expected tool calls count. This is what we were supposed to do originally.
As far as impact is concerned, we have recently added this feature (with a bug) and not yet released officially.

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
@saswatamcode saswatamcode changed the title feat: Add match_mode with subset matching fix: Do subset matching of full_match=false Jan 26, 2026
@saswatamcode saswatamcode changed the title fix: Do subset matching of full_match=false fix: Do subset matching for full_match=false Jan 26, 2026
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
@saswatamcode
Copy link
Contributor Author

@asamal4 have updated the PR

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !! LGTM..

@asamal4 asamal4 merged commit debf62f into lightspeed-core:main Jan 27, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants