chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

himanshusinghs · 2025-07-07T14:02:55Z

Motivation and Goal

Motivation: Consolidate MCP server tools to accommodate client soft limits, while mitigating the risk of confusing LLMs with potentially ambiguous tool schemas and descriptions.
Goal: Benchmark current tool understanding by LLMs to establish a baseline and prevent regression during consolidation.

Design Brief

Method: Provide LLMs (Gemini, Claude, ChatGPT, etc.) with prompts and current MCP server tool schemas.
Evaluation: Record actual tool calls made by LLMs and compare them against expected calls and parameters.
Reporting: Generate a readable summary of test runs highlighting prompt, models and the achieved accuracy

Detailed Design

Refer to the doc titled - MCP Tools Accuracy Testing

Current State

Framework implemented and integrated to test the MCP server and MongoDB tools.
Accuracy tests for core MongoDB tool calls are written.
The scoring algorithm is implemented and unit-tested.
Supports multiple LLM providers and models.
Snapshots are stored on disk by default, with possibility to store in a MongoDB deployment as well.
On each successful test run a summary is generated highlighting prompt, model and accuracy of tool calls.
Github workflow added to trigger the test runs on a label and manual dispatch. It also attach the summary when triggered for a PR.

For reviewers

Please start reviewing the test cases / prompts themselves.
Once done with the prompts review, move on to the tool calling accuracy scorer. I have added some docs and tests to help understand how it works.
Later you can start reviewing the rest of the accuracy SDK in the folder tests/accuracy/sdk. Start with describe-accuracy-test.ts as this is where all the different parts come together and dive further into specific implementation of each parts afterwards.

Apologies for the big chunk to be reviewed here but I did not see a way around it.

coveralls · 2025-07-07T14:12:56Z

Pull Request Test Coverage Report for Build 16220767966

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
33 unchanged lines in 3 files lost coverage.
Overall coverage increased (+0.2%) to 75.27%

Files with Coverage Reduction	New Missed Lines	%
src/tools/mongodb/mongodbTool.ts	8	73.58%
src/tools/tool.ts	12	76.32%
src/server.ts	13	74.74%

Totals
Change from base Build 16199582904:	0.2%
Covered Lines:	886
Relevant Lines:	1087

💛 - Coveralls

.github/workflows/accuracy-tests.yml

LangChain's ToolCalling agent was not providing a structured tool call response and different model providers were providing entirely different tool calls for the same tool definition which was too turbulent for us to have any accuracy baseline at all. Vercel's AI SDK pushes us forward on that problem and the tool call responses so far have always been well structured. This commit replaces LangChain based implementation with Vercel's AI SDK based implementation.

When writing test cases, I realized that it is too much duplicated effort to write and maintain mocks. So instead of having only a mocked mcp client, this commit introduces a real mcp client that talks to our mcp server and is still mockable. We are now setting up real MCP client with test data in mongodb database spun up for test suites. Mocking is still an option but we likely never feel the need for that.

introduces the following necessary env variables: - MDB_ACCURACY_RUN_ID: The accuracy run id - MDB_ACCURACY_MDB_URL: The connection string to mongodb instance where the snapshots will be stored - MDB_ACCURACY_MDB_DB: The database for snapshots - MDB_ACCURACY_MDB_COLLECTION: The collection for snapshots

…onfig

The new field `accuracyRunStatus` is supposed to help guard against cases where jest might fail in between, maybe due to LLM rate limit errors or something else, and we then have a partially saved state of an accuracy run. With the new field `accuracyRunStatus` we should be able to safely look for last runs where `accuracyRunStatus` is done and have complete state of accuracy snapshot.

…accuracyRunStatus

…racy-status script

…nts. 1. Removes unnecessary suite description from tests 2. Removes the test suite name from the storage as well 3. Centralize the constants used everywhere in the SDK 4. Adds clarifying comments and docs wherever necessary 5. Write tests for accuracy-scorer

…s well

nirinchev · 2025-07-11T10:19:08Z

.github/workflows/accuracy-tests.yml

+name: Accuracy Tests
+
+on:
+  workflow_dispatch:


Guessing we want to eventually also run them on merges to main, right?

Yea that's right - I will update this as well.

nirinchev · 2025-07-11T10:25:24Z

.github/workflows/accuracy-tests.yml

+          path: .accuracy/tests-summary.html
+      - name: Comment summary on PR
+        if: github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests'
+        uses: marocchino/sticky-pull-request-comment@v2


This needs to be a commit sha

nirinchev · 2025-07-11T10:26:32Z

package.json

    "@eslint/js": "^9.30.1",
+    "@himanshusinghs/google": "^1.2.11",


Hm... do we want to move this under @mongodb-js?

I totally forgot about this. There's a good chance that they released a new version with my fixes. I will check and update.

nirinchev · 2025-07-11T10:29:46Z

scripts/generate-test-summary.ts

+    const total = tokensUsage.totalTokens || 0;
+    const prompt = tokensUsage.promptTokens || 0;
+    const completion = tokensUsage.completionTokens || 0;


[nit] Should we use N/A instead of 0 here? I guess we can assume that there's no world in which we actually use 0, just probably a bit more expressive.

nirinchev · 2025-07-11T10:31:16Z

scripts/generate-test-summary.ts

+    let comparisonClass = "accuracy-comparison";
+    let comparisonIcon = "";
+
+    if (snapshot.baseline.comparisonResult) {


[nit] should we provide a default in case we don't have a comparison? E.g. ? for the icon?

Actually, this may not need to be optional, in which case we could remove the if-check.

nirinchev · 2025-07-11T10:36:17Z

scripts/generate-test-summary.ts

+    baselineAccuracy?: number;
+    comparisonResult?: "improved" | "regressed" | "same";


Why are these optional?

Answering for this and the comment above:
There are two use-cases that this tackles:

You may want to generate test summary and don't really bother about the comparison from historic runs. Common if you're running the tests locally.

The branch that you're targeting does not have any snapshots just yet. (this is only gonna happen once for the first merge but still a case)

Right, but in that case, wouldn't the baseline field itself be undefined (line 16). My expectation is that if baseline is set to something, the contents of the object should be required.

Ahh yea true. I must have forgotten to remove the optionals when I nested them under baseline. Will update this.

.github/workflows/accuracy-tests.yml

Co-authored-by: Nikola Irinchev <irinchev@me.com>

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch 4 times, most recently from 58bc8a5 to b557e02 Compare July 10, 2025 08:53

github-advanced-security bot found potential problems Jul 10, 2025

View reviewed changes

.github/workflows/accuracy-tests.yml Fixed Show fixed Hide fixed

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 7791e20 to 79cd26e Compare July 10, 2025 11:37

himanshusinghs changed the title ~~chore(tests): accuracy tests for MongoDB tools exposed by MCP server~~ chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 Jul 10, 2025

himanshusinghs marked this pull request as ready for review July 10, 2025 11:39

himanshusinghs requested a review from a team as a code owner July 10, 2025 11:39

himanshusinghs added 20 commits July 10, 2025 17:43

chore: LangChain based accuracy tests

45abd9f

chore: integrate capturing accuracy snapshots

dffeabf

chore: correct env names

2e89f7a

chore: more consolidated prompt tests

2345c27

chore: add a few more tests and some more models

0cdfe2e

chore: add AzureOpenAI model in the model list

6e69fd6

chore: use ListDatabasesTool response creator for tests

ea099c2

chore: use ListCollectionsTool response creators in tests

8ae3d3d

chore: tests for collection-indexes tool

1f5b246

modify prompt for list-collections prompt and log tools provided

330b9e5

chore: have mock generators return Promise of ToolResult as well

127fee0

chore: tests for collection-schema tool

d8c79b8

chore: do not fail tests on dropped accuracy

f430780

chore: added tests for find tool

a09c725

chore: tests for insert-many tool

1aa80eb

chore: tests for delete-many tool

b0c3df6

chore: add oepnai provider

c5365ac

chore: fixes accuracy scorer for position independent matching

f79faca

himanshusinghs added 22 commits July 10, 2025 17:43

chore: moved all existing tests to vercel mcp client

b961916

chore: adds tests for the rest of the tools

5ffee02

chore: adds missed out tests for tools

abec91a

chore: remove file based snapshot

94a0fe3

wip: snapshot summary generator

5bc21aa

chore: single entry point for running accuracy tests with different c…

6abc324

…onfig

chore: reformat

c9c3b36

chore: lint fixes

746d7eb

chore: simplified toolCallingAccuracy calculation

f84bf43

chore: account for types moved around

496acc7

chore: add disk based accuracy storage for local runs

b54cf14

chore: revert changes done to any of the src files

188aebc

chore: handle test failures and appropriately mark them as failed in …

b309fb4

…accuracyRunStatus

chore: make snapshot storage independent of accuracyRunId and commitSHA

43493f3

chore: bail on first failure and add some explanation for update-accu…

cb46c43

…racy-status script

chore: generate accuracy test summary post test

d7b1c57

chore: add Github workflow to trigger test runs

6c25f1b

chore: fix permissions issue

6da9538

chore: bring back packages post merge

6ccaa11

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 79cd26e to 6ccaa11 Compare July 10, 2025 15:43

chore: update report generation to include comparison with baseline a…

865dbfe

…s well

nirinchev reviewed Jul 11, 2025

View reviewed changes

himanshusinghs and others added 4 commits July 11, 2025 15:05

Update .github/workflows/accuracy-tests.yml

055628d

Co-authored-by: Nikola Irinchev <irinchev@me.com>

Update .github/workflows/accuracy-tests.yml

1933f1f

Co-authored-by: Nikola Irinchev <irinchev@me.com>

Update .github/workflows/accuracy-tests.yml

5c97ca8

Co-authored-by: Nikola Irinchev <irinchev@me.com>

Update .github/workflows/accuracy-tests.yml

f666014

Co-authored-by: Nikola Irinchev <irinchev@me.com>

		"@eslint/js": "^9.30.1",
		"@himanshusinghs/google": "^1.2.11",

		baselineAccuracy?: number;
		comparisonResult?: "improved" \| "regressed" \| "same";

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

Are you sure you want to change the base?

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

Uh oh!

Conversation

himanshusinghs commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Goal

Design Brief

Detailed Design

Current State

For reviewers

Uh oh!

coveralls commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16220767966

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

himanshusinghs commented Jul 7, 2025 •

edited

Loading

coveralls commented Jul 7, 2025 •

edited

Loading