-
Notifications
You must be signed in to change notification settings - Fork 72
chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 16220767966Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
58bc8a5
to
b557e02
Compare
7791e20
to
79cd26e
Compare
LangChain's ToolCalling agent was not providing a structured tool call response and different model providers were providing entirely different tool calls for the same tool definition which was too turbulent for us to have any accuracy baseline at all. Vercel's AI SDK pushes us forward on that problem and the tool call responses so far have always been well structured. This commit replaces LangChain based implementation with Vercel's AI SDK based implementation.
When writing test cases, I realized that it is too much duplicated effort to write and maintain mocks. So instead of having only a mocked mcp client, this commit introduces a real mcp client that talks to our mcp server and is still mockable. We are now setting up real MCP client with test data in mongodb database spun up for test suites. Mocking is still an option but we likely never feel the need for that.
introduces the following necessary env variables: - MDB_ACCURACY_RUN_ID: The accuracy run id - MDB_ACCURACY_MDB_URL: The connection string to mongodb instance where the snapshots will be stored - MDB_ACCURACY_MDB_DB: The database for snapshots - MDB_ACCURACY_MDB_COLLECTION: The collection for snapshots
The new field `accuracyRunStatus` is supposed to help guard against cases where jest might fail in between, maybe due to LLM rate limit errors or something else, and we then have a partially saved state of an accuracy run. With the new field `accuracyRunStatus` we should be able to safely look for last runs where `accuracyRunStatus` is done and have complete state of accuracy snapshot.
…accuracyRunStatus
…racy-status script
…nts. 1. Removes unnecessary suite description from tests 2. Removes the test suite name from the storage as well 3. Centralize the constants used everywhere in the SDK 4. Adds clarifying comments and docs wherever necessary 5. Write tests for accuracy-scorer
79cd26e
to
6ccaa11
Compare
name: Accuracy Tests | ||
|
||
on: | ||
workflow_dispatch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing we want to eventually also run them on merges to main, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's right - I will update this as well.
path: .accuracy/tests-summary.html | ||
- name: Comment summary on PR | ||
if: github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests' | ||
uses: marocchino/sticky-pull-request-comment@v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be a commit sha
"@eslint/js": "^9.30.1", | ||
"@himanshusinghs/google": "^1.2.11", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... do we want to move this under @mongodb-js
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally forgot about this. There's a good chance that they released a new version with my fixes. I will check and update.
const total = tokensUsage.totalTokens || 0; | ||
const prompt = tokensUsage.promptTokens || 0; | ||
const completion = tokensUsage.completionTokens || 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Should we use N/A instead of 0 here? I guess we can assume that there's no world in which we actually use 0, just probably a bit more expressive.
let comparisonClass = "accuracy-comparison"; | ||
let comparisonIcon = ""; | ||
|
||
if (snapshot.baseline.comparisonResult) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] should we provide a default in case we don't have a comparison? E.g. ?
for the icon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this may not need to be optional, in which case we could remove the if-check.
baselineAccuracy?: number; | ||
comparisonResult?: "improved" | "regressed" | "same"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answering for this and the comment above:
There are two use-cases that this tackles:
- You may want to generate test summary and don't really bother about the comparison from historic runs. Common if you're running the tests locally.
- The branch that you're targeting does not have any snapshots just yet. (this is only gonna happen once for the first merge but still a case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but in that case, wouldn't the baseline field itself be undefined (line 16). My expectation is that if baseline
is set to something, the contents of the object should be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yea true. I must have forgotten to remove the optionals when I nested them under baseline. Will update this.
Co-authored-by: Nikola Irinchev <irinchev@me.com>
Co-authored-by: Nikola Irinchev <irinchev@me.com>
Co-authored-by: Nikola Irinchev <irinchev@me.com>
Co-authored-by: Nikola Irinchev <irinchev@me.com>
Motivation and Goal
Design Brief
Detailed Design
Refer to the doc titled -
MCP Tools Accuracy Testing
Current State
For reviewers
tests/accuracy/sdk
. Start withdescribe-accuracy-test.ts
as this is where all the different parts come together and dive further into specific implementation of each parts afterwards.Apologies for the big chunk to be reviewed here but I did not see a way around it.