-
Notifications
You must be signed in to change notification settings - Fork 25
feat: add schema tools #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
29fb762 to
4c07f10
Compare
Add 2 mcp tools for querying schema versions and blame. - get_schema_versions() - get_schema_blame()
4c07f10 to
fa24775
Compare
|
As with #7 - my main worry here is that every new tool consumes additional tokens on every request. The more tools we have, the more likely it is that the LLM gets confused / doesn't call our other tools when it should. So I'd like to think about what we can do to reduce the number of tools while keeping our responses simple. How big are the responses of getSchemaVersionList in number of tokens? I'm wondering if it might make sense to just bundle that into the standard get_entity. The other secondary worry is around real-world testing. What MCP clients / LLMs have you tried this with so far? Do you have some (non-sensitive) screenshots of the request -> response. I particularly want to understand if the tool descriptions are sufficiently clear. |
|
@hsheth2 Thank you for your feedback and I totally agree with your concern. I’ll address your questions in three separate comments. |
Measured with the OpenAI Tokenizer(gpt-4o), the GraphQL getSchemaVersionList response is 233 tokens at minimum and 324 tokens with one additional version. Request curl --location 'http://datahub.classting.net:8080/api/graphql' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
"operationName": "getSchemaVersionList",
"variables": {
"input": {
"datasetUrn": "urn:li:dataset:(urn:li:dataPlatform:athena,salesmap.deals,PROD)"
}
},
"query": "query getSchemaVersionList($input: GetSchemaVersionListInput!) {\n getSchemaVersionList(input: $input) {\n latestVersion {\n semanticVersion\n semanticVersionTimestamp\n versionStamp\n __typename\n }\n semanticVersionList {\n semanticVersion\n semanticVersionTimestamp\n versionStamp\n __typename\n }\n __typename\n }\n}\n"
}'
Min(only one verison): 233 Tokens {
"data": {
"getSchemaVersionList": {
"latestVersion": {
"semanticVersion": "0.0.0",
"semanticVersionTimestamp": 1731464492834,
"versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:1;schemaMetadata:1;status:0;subTypes:0;upstreamLineage:1",
"__typename": "SemanticVersionStruct"
},
"semanticVersionList": [
{
"semanticVersion": "0.0.0",
"semanticVersionTimestamp": 1731464492834,
"versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:1;schemaMetadata:1;status:0;subTypes:0;upstreamLineage:1",
"__typename": "SemanticVersionStruct"
}
],
"__typename": "GetSchemaVersionListResult"
}
},
"extensions": {}
}One version added: 324 tokens {
"data": {
"getSchemaVersionList": {
"latestVersion": {
"semanticVersion": "0.0.0",
"semanticVersionTimestamp": 1731464492834,
"versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:1;schemaMetadata:1;status:0;subTypes:0;upstreamLineage:1",
"__typename": "SemanticVersionStruct"
},
"semanticVersionList": [
{
"semanticVersion": "0.1.0",
"semanticVersionTimestamp": 1734662005844,
"versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:0;schemaMetadata:2;status:0;subTypes:0;upstreamLineage:2",
"__typename": "SemanticVersionStruct"
},
{
"semanticVersion": "0.0.0",
"semanticVersionTimestamp": 1731464492834,
"versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:1;schemaMetadata:1;status:0;subTypes:0;upstreamLineage:1",
"__typename": "SemanticVersionStruct"
}
],
"__typename": "GetSchemaVersionListResult"
}
},
"extensions": {}
} |
Yes, that makes sense. Since the schema version is part of the entity, we can include it in the get_entity response. I also believe it’s sufficient to return only the semanticVersions field like following. What do you think? |
I've tested with
Due to their sensitivity, I can’t share my test data I used. However, I’ll generate sample metadata and share the test results. It will take about three to four days to prepare. I’ll run the tests using the model and host I mentioned earlier. If there’s an LLM or MCP host you’d like me to test, please let me know. |
|
I'll reopen this PR when I'm ready. 🙏 |
Short Description & Changes
Add 2 mcp tools for querying schema versions and blame.
Motivation
Fixing queries whenever schemas change is a hassle.
These two MCP tools let you detect changed columns and auto-fix outdated queries using LLM.
Use Case
Comparing the schema between two different versions
[Question Example]
Fixing outdated queries
[Question Example]