Skip to content

Conversation

@eagle-25
Copy link

@eagle-25 eagle-25 commented Jun 2, 2025

Short Description & Changes

Add 2 mcp tools for querying schema versions and blame.

  • get_schema_versions()
  • get_schema_blame()

Motivation

Fixing queries whenever schemas change is a hassle.

These two MCP tools let you detect changed columns and auto-fix outdated queries using LLM.

Use Case

  1. Comparing the schema between two different versions

    [Question Example]

    What are the schema differences between the latest and the previous version of the Athena table called 'shop_log' ?
    
  2. Fixing outdated queries

    [Question Example]

    Could you fix the outdated column name in the given query? 
    
    The query was written for version v0.1.0, but as of now the latest version is v0.2.0. 
    
    Here's my query:
    '''
    select col1, col2 
    from table
    '''
    

@eagle-25 eagle-25 force-pushed the feat/ass-schema-history-tools branch from 29fb762 to 4c07f10 Compare June 2, 2025 09:42
@eagle-25 eagle-25 marked this pull request as ready for review June 2, 2025 09:58
@eagle-25 eagle-25 changed the title feat: add schema version tools feat: add schema tools Jun 2, 2025
Add 2 mcp tools for querying schema versions and blame.
- get_schema_versions()
- get_schema_blame()
@eagle-25 eagle-25 force-pushed the feat/ass-schema-history-tools branch from 4c07f10 to fa24775 Compare June 2, 2025 10:07
@hsheth2
Copy link
Contributor

hsheth2 commented Jun 4, 2025

As with #7 - my main worry here is that every new tool consumes additional tokens on every request. The more tools we have, the more likely it is that the LLM gets confused / doesn't call our other tools when it should. So I'd like to think about what we can do to reduce the number of tools while keeping our responses simple.

How big are the responses of getSchemaVersionList in number of tokens? I'm wondering if it might make sense to just bundle that into the standard get_entity.

The other secondary worry is around real-world testing. What MCP clients / LLMs have you tried this with so far? Do you have some (non-sensitive) screenshots of the request -> response. I particularly want to understand if the tool descriptions are sufficiently clear.

@eagle-25
Copy link
Author

eagle-25 commented Jun 7, 2025

@hsheth2 Thank you for your feedback and I totally agree with your concern.

I’ll address your questions in three separate comments.

@eagle-25
Copy link
Author

eagle-25 commented Jun 7, 2025

How big are the responses of getSchemaVersionList in number of tokens?

Measured with the OpenAI Tokenizer(gpt-4o), the GraphQL getSchemaVersionList response is 233 tokens at minimum and 324 tokens with one additional version.

Request

curl --location 'http://datahub.classting.net:8080/api/graphql' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
    "operationName": "getSchemaVersionList",
    "variables": {
        "input": {
            "datasetUrn": "urn:li:dataset:(urn:li:dataPlatform:athena,salesmap.deals,PROD)"
        }
    },
    "query": "query getSchemaVersionList($input: GetSchemaVersionListInput!) {\n  getSchemaVersionList(input: $input) {\n    latestVersion {\n      semanticVersion\n      semanticVersionTimestamp\n      versionStamp\n      __typename\n    }\n    semanticVersionList {\n      semanticVersion\n      semanticVersionTimestamp\n      versionStamp\n      __typename\n    }\n    __typename\n  }\n}\n"
}'

Min(only one verison): 233 Tokens

{
    "data": {
        "getSchemaVersionList": {
            "latestVersion": {
                "semanticVersion": "0.0.0",
                "semanticVersionTimestamp": 1731464492834,
                "versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:1;schemaMetadata:1;status:0;subTypes:0;upstreamLineage:1",
                "__typename": "SemanticVersionStruct"
            },
            "semanticVersionList": [
                {
                    "semanticVersion": "0.0.0",
                    "semanticVersionTimestamp": 1731464492834,
                    "versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:1;schemaMetadata:1;status:0;subTypes:0;upstreamLineage:1",
                    "__typename": "SemanticVersionStruct"
                }
            ],
            "__typename": "GetSchemaVersionListResult"
        }
    },
    "extensions": {}
}

One version added: 324 tokens

{
    "data": {
        "getSchemaVersionList": {
            "latestVersion": {
                "semanticVersion": "0.0.0",
                "semanticVersionTimestamp": 1731464492834,
                "versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:1;schemaMetadata:1;status:0;subTypes:0;upstreamLineage:1",
                "__typename": "SemanticVersionStruct"
            },
            "semanticVersionList": [
                {
                    "semanticVersion": "0.1.0",
                    "semanticVersionTimestamp": 1734662005844,
                    "versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:0;schemaMetadata:2;status:0;subTypes:0;upstreamLineage:2",
                    "__typename": "SemanticVersionStruct"
                },
                {
                    "semanticVersion": "0.0.0",
                    "semanticVersionTimestamp": 1731464492834,
                    "versionStamp": "browsePathsV2:0;container:0;dataPlatformInstance:0;datasetKey:0;datasetProperties:1;schemaMetadata:1;status:0;subTypes:0;upstreamLineage:1",
                    "__typename": "SemanticVersionStruct"
                }
            ],
            "__typename": "GetSchemaVersionListResult"
        }
    },
    "extensions": {}
}

@eagle-25
Copy link
Author

eagle-25 commented Jun 7, 2025

I'm wondering if it might make sense to just bundle that into the standard get_entity.

Yes, that makes sense. Since the schema version is part of the entity, we can include it in the get_entity response.

I also believe it’s sufficient to return only the semanticVersions field like following. What do you think?

getEntityResponse {
    ...,
    schemaSemanticVersions: [
        0.0.0,
        0.1.0,
    ]
}

@eagle-25
Copy link
Author

eagle-25 commented Jun 7, 2025

The other secondary worry is around real-world testing. What MCP clients / LLMs have you tried this with so far?

I've tested with o4-mini with OpenAI python SDK, Sonnet 4 with Claude Desktop.

Do you have some (non-sensitive) screenshots of the request -> response. I particularly want to understand if the tool descriptions are sufficiently clear.

Due to their sensitivity, I can’t share my test data I used.

However, I’ll generate sample metadata and share the test results. It will take about three to four days to prepare. I’ll run the tests using the model and host I mentioned earlier. If there’s an LLM or MCP host you’d like me to test, please let me know.

@eagle-25
Copy link
Author

I'll reopen this PR when I'm ready. 🙏

@eagle-25 eagle-25 closed this Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants