Skip to content

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

himanshusinghs
Copy link
Collaborator

@himanshusinghs himanshusinghs commented Jul 7, 2025

Motivation and Goal

  • Motivation: Consolidate MCP server tools to accommodate client soft limits, while mitigating the risk of confusing LLMs with potentially ambiguous tool schemas and descriptions.
  • Goal: Benchmark current tool understanding by LLMs to establish a baseline and prevent regression during consolidation.

Design Brief

  • Method: Provide LLMs (Gemini, Claude, ChatGPT, etc.) with prompts and current MCP server tool schemas.
  • Evaluation: Record actual tool calls made by LLMs and compare them against expected calls and parameters.
  • Reporting: Generate a readable summary of test runs highlighting prompt, models and the achieved accuracy

Detailed Design

Refer to the doc titled - MCP Tools Accuracy Testing

Current State

  • Framework implemented and integrated to test the MCP server and MongoDB tools.
  • Accuracy tests for core MongoDB tool calls are written.
  • The scoring algorithm is implemented and unit-tested.
  • Supports multiple LLM providers and models.
  • Snapshots are stored on disk by default, with possibility to store in a MongoDB deployment as well.
  • On each successful test run a summary is generated highlighting prompt, model and accuracy of tool calls.
  • Github workflow added to trigger the test runs on a label and manual dispatch. It also attach the summary when triggered for a PR.

For reviewers

  • Please start reviewing the test cases / prompts themselves.
  • Once done with the prompts review, move on to the tool calling accuracy scorer. I have added some docs and tests to help understand how it works.
  • Later you can start reviewing the rest of the accuracy SDK in the folder tests/accuracy/sdk. Start with describe-accuracy-test.ts as this is where all the different parts come together and dive further into specific implementation of each parts afterwards.

Apologies for the big chunk to be reviewed here but I did not see a way around it.

@coveralls
Copy link
Collaborator

coveralls commented Jul 7, 2025

Pull Request Test Coverage Report for Build 16220767966

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 33 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.2%) to 75.27%

Files with Coverage Reduction New Missed Lines %
src/tools/mongodb/mongodbTool.ts 8 73.58%
src/tools/tool.ts 12 76.32%
src/server.ts 13 74.74%
Totals Coverage Status
Change from base Build 16199582904: 0.2%
Covered Lines: 886
Relevant Lines: 1087

💛 - Coveralls

@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch 4 times, most recently from 58bc8a5 to b557e02 Compare July 10, 2025 08:53
@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 7791e20 to 79cd26e Compare July 10, 2025 11:37
@himanshusinghs himanshusinghs changed the title chore(tests): accuracy tests for MongoDB tools exposed by MCP server chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 Jul 10, 2025
@himanshusinghs himanshusinghs marked this pull request as ready for review July 10, 2025 11:39
@himanshusinghs himanshusinghs requested a review from a team as a code owner July 10, 2025 11:39
LangChain's ToolCalling agent was not providing a structured tool call
response and different model providers were providing entirely different
tool calls for the same tool definition which was too turbulent for us
to have any accuracy baseline at all.

Vercel's AI SDK pushes us forward on that problem and the tool call
responses so far have always been well structured.

This commit replaces LangChain based implementation with Vercel's AI SDK
based implementation.
When writing test cases, I realized that it is too much duplicated effort to write and maintain mocks. So instead of having only a mocked mcp client, this commit introduces a real mcp client that talks to our mcp server and is still mockable.

We are now setting up real MCP client with test data in mongodb database spun up for test suites. Mocking is still an option but we likely never feel the need for that.
introduces the following necessary env variables:
- MDB_ACCURACY_RUN_ID: The accuracy run id
- MDB_ACCURACY_MDB_URL: The connection string to mongodb instance where the snapshots will be stored
- MDB_ACCURACY_MDB_DB: The database for snapshots
- MDB_ACCURACY_MDB_COLLECTION: The collection for snapshots
The new field `accuracyRunStatus` is supposed to help guard against
cases where jest might fail in between, maybe due to LLM rate limit
errors or something else, and we then have a partially saved state of an
accuracy run. With the new field `accuracyRunStatus` we should be able
to safely look for last runs where `accuracyRunStatus` is done and have
complete state of accuracy snapshot.
…nts.

1. Removes unnecessary suite description from tests
2. Removes the test suite name from the storage as well
3. Centralize the constants used everywhere in the SDK
4. Adds clarifying comments and docs wherever necessary
5. Write tests for accuracy-scorer
@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 79cd26e to 6ccaa11 Compare July 10, 2025 15:43
name: Accuracy Tests

on:
workflow_dispatch:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing we want to eventually also run them on merges to main, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's right - I will update this as well.

path: .accuracy/tests-summary.html
- name: Comment summary on PR
if: github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests'
uses: marocchino/sticky-pull-request-comment@v2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a commit sha

"@eslint/js": "^9.30.1",
"@himanshusinghs/google": "^1.2.11",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... do we want to move this under @mongodb-js?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally forgot about this. There's a good chance that they released a new version with my fixes. I will check and update.

Comment on lines +47 to +49
const total = tokensUsage.totalTokens || 0;
const prompt = tokensUsage.promptTokens || 0;
const completion = tokensUsage.completionTokens || 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Should we use N/A instead of 0 here? I guess we can assume that there's no world in which we actually use 0, just probably a bit more expressive.

let comparisonClass = "accuracy-comparison";
let comparisonIcon = "";

if (snapshot.baseline.comparisonResult) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] should we provide a default in case we don't have a comparison? E.g. ? for the icon?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this may not need to be optional, in which case we could remove the if-check.

Comment on lines +11 to +12
baselineAccuracy?: number;
comparisonResult?: "improved" | "regressed" | "same";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these optional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answering for this and the comment above:
There are two use-cases that this tackles:

  1. You may want to generate test summary and don't really bother about the comparison from historic runs. Common if you're running the tests locally.
  2. The branch that you're targeting does not have any snapshots just yet. (this is only gonna happen once for the first merge but still a case)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but in that case, wouldn't the baseline field itself be undefined (line 16). My expectation is that if baseline is set to something, the contents of the object should be required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yea true. I must have forgotten to remove the optionals when I nested them under baseline. Will update this.

himanshusinghs and others added 4 commits July 11, 2025 15:05
Co-authored-by: Nikola Irinchev <irinchev@me.com>
Co-authored-by: Nikola Irinchev <irinchev@me.com>
Co-authored-by: Nikola Irinchev <irinchev@me.com>
Co-authored-by: Nikola Irinchev <irinchev@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants