fix: Do subset matching for full_match=false by saswatamcode · Pull Request #145 · lightspeed-core/lightspeed-evaluation

saswatamcode · 2026-01-25T22:56:00Z

Description

This commit fixes an issue where setting full_match=false led to partial matching. Instead full_match=false should lead to subset matching

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude Sonnet 4.5
Generated by: Claude Code

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Bug Fixes
- Partial-match behavior changed: when partial matching is enabled, all expected tools must now be present (extras allowed); previous greedy/any-match behavior removed.
Documentation
- Clarified configuration comment and evaluation guidance to reflect updated partial/subset matching semantics.
Tests
- Unit tests updated to require all expected tools be found under partial matching and to simplify failure assertions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-25T22:56:17Z

Walkthrough

Partial matching semantics tightened: when full_match=False, partial mode now requires every expected tool call to be present within actual calls (extras allowed). Docstrings and tests updated to reflect this stricter subset behavior.

Changes

Cohort / File(s)	Summary
Documentation `docs/EVALUATION_GUIDE.md`	Clarified inline comment for `full_match` to state subset semantics: all expected must be present (extras allowed) when false.
Core implementation `src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`	Docstring updates and logic change in `_compare_partial` so partial mode requires `matched == total` (complete coverage of expected) instead of `matched > 0`.
Tests `tests/unit/core/metrics/custom/test_tool_eval.py`	Updated/renamed partial-match tests to require all expected tools be found as a subset of actual (allowing extras); adjusted assertions accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Ability to set alternate tool calls for eval #90 — Modifies tool_eval.py matching logic; intersects with partial/full matching semantics.
[LEADS-153] Adding the ordered matching logic in tool eval #136 — Related changes touching evaluate_tool_calls/compare_tool_calls flows and ordered/unordered matching concerns.
[LEADS-153] Implement match logic (full/partial) #137 — Directly related; adjusts _compare_partial/matching semantics to require full coverage of expected items.

Suggested reviewers

asamal4
VladimirKadlec

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the primary change: modifying the full_match=false behavior to enforce subset matching instead of partial matching.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)

121-137: Empty expected sequences can incorrectly pass partial/subset matches.
With expected == [], _compare_partial returns (0, 0) and the current success logic returns True even when actual contains tool calls, which can mask real mismatches and emit the “No tool calls made” success message. Consider gating empty-expected success on actual being empty.

🔧 Proposed fix

     if resolved_mode == "partial":
         matched, total = _compare_partial(expected_normalized, actual_normalized)
-        # Partial match succeeds if at least one tool matched
-        success = matched > 0 or total == 0
+        # Partial match succeeds if at least one tool matched.
+        # If nothing is expected, only succeed when nothing was actually called.
+        if total == 0:
+            success = len(actual_normalized) == 0
+        else:
+            success = matched > 0
         return {
             "success": success,
             "stats": {"matched": matched, "total": total, "unmatched": total - matched},
         }

     if resolved_mode == "subset":
         matched, total = _compare_partial(expected_normalized, actual_normalized)
         # Subset mode requires ALL expected tools to be found
-        success = matched == total
+        if total == 0:
+            success = len(actual_normalized) == 0
+        else:
+            success = matched == total
         return {
             "success": success,
             "stats": {"matched": matched, "total": total, "unmatched": total - matched},
         }

Based on learnings, empty alternatives represent skip scenarios and should not match non-empty actual tool calls.

🧹 Nitpick comments (2)

docs/EVALUATION_GUIDE.md (1)
785-812: Use “subset” consistently in example labels.
The comment “Contains match” could confuse readers now that the official term is “subset”.
✏️ Suggested wording tweak
-# Contains match - all expected tools must be present
+# Subset match - all expected tools must be present
tests/unit/core/metrics/custom/test_tool_eval.py (1)

539-824: Add a negative case for empty expected with non-empty actual.
To prevent skip-alternative false positives (especially under subset/partial), consider a test where expected = [[]] (or empty alternative in a list) and actual contains tool calls, and assert failure. This will lock in the intended behavior and protect the fix in compare_tool_calls.

Based on learnings, empty alternatives represent skip scenarios and shouldn’t match non-empty actual tool calls.

asamal4 · 2026-01-26T01:38:26Z

I think this is a bug in recently implemented full_match=false logic.. My expectation was that everything given in expected tool calls should match with actual tool calls (with extra allowed). Certainly missed during review..
This PR is fixing that. Thank you..
But IMO keeping both partial & subset don't add any value, for below reasons.

If we only care about 1 tool call then we can mention just that tool in expected tool calls.. (same as partial & subset)
Not sure about a scenario, where we set multiple different tool calls as expected but okay if only one of those is called. The Agent can take different paths to achieve the task, but if we are okay with any one out of several tools, to me it sounds like redundancy in tool definition.
Even if someone wants to do above, currently we have ability to provide multiple alternative paths (alternative tool calls)

@saswatamcode WDYT ? Should we simply fix the partial logic to check everything matches in expected tool calls, rather than adding both partial & subset ?

saswatamcode · 2026-01-26T08:42:55Z

Right, so this did occur to me, but I didn't want to change the full_match behaviour outright, in case something was depending on it @asamal4 🙂

Thinking about it, the only way in which partial might differ from the subset case is, for something like a "one of" case, where an agent has multiple tools available that can give it the same info, and we want to verify it uses at least one of those.

A real-life example would be an agent with Kube tools and Prometheus tools, which can get a count of the number of pods in an ns from both of those tools. Subset would mean matching all, whereas partial would help us match "one of". (Well, you could still maybe work around this by just having multiple cases)

However, no strong opinions on this, I can modify this PR to change full_match false behavior to match subset if you want!

asamal4 · 2026-01-26T13:56:43Z

@saswatamcode Thanks. I think we are very much aligned.
I would request to fix the full_match=false to check matched count == expected tool calls count. This is what we were supposed to do originally.
As far as impact is concerned, we have recently added this feature (with a bug) and not yet released officially.

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

saswatamcode · 2026-01-26T15:07:28Z

@asamal4 have updated the PR

VladimirKadlec

LGTM, thank you.

asamal4

Thank you !! LGTM..

saswatamcode mentioned this pull request Jan 25, 2026

Add more tools and descriptions, alongwith evals rhobs/obs-mcp#15

Merged

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

fix: Do subset matching on full_match=false

c46deef

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

saswatamcode force-pushed the subset branch from b751825 to c46deef Compare January 26, 2026 15:01

saswatamcode changed the title ~~feat: Add match_mode with subset matching~~ fix: Do subset matching of full_match=false Jan 26, 2026

saswatamcode changed the title ~~fix: Do subset matching of full_match=false~~ fix: Do subset matching for full_match=false Jan 26, 2026

Add success testcase

15cdd1a

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

VladimirKadlec approved these changes Jan 27, 2026

View reviewed changes

asamal4 approved these changes Jan 27, 2026

View reviewed changes

asamal4 merged commit debf62f into lightspeed-core:main Jan 27, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Do subset matching for full_match=false#145

fix: Do subset matching for full_match=false#145
asamal4 merged 2 commits intolightspeed-core:mainfrom
saswatamcode:subset

saswatamcode commented Jan 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

asamal4 commented Jan 26, 2026 •

edited

Loading

Uh oh!

saswatamcode commented Jan 26, 2026

Uh oh!

asamal4 commented Jan 26, 2026

Uh oh!

saswatamcode commented Jan 26, 2026

Uh oh!

VladimirKadlec left a comment

Uh oh!

asamal4 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saswatamcode commented Jan 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saswatamcode commented Jan 26, 2026

Uh oh!

asamal4 commented Jan 26, 2026

Uh oh!

saswatamcode commented Jan 26, 2026

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saswatamcode commented Jan 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 25, 2026 •

edited

Loading

asamal4 commented Jan 26, 2026 •

edited

Loading