Skip to content

Conversation

hsheth2
Copy link
Contributor

@hsheth2 hsheth2 commented May 5, 2025

Adds a couple new features to our SQL parser

  • Best-effort column logic extraction. In case a column goes through multiple layers of transformation, we only take the final one.
    • For each column, we also annotate whether it was a direct copy or if it underwent any transformation - see the is_direct_copy bool.
  • Join extraction. For each join clause, we try to figure out (1) the on clause used and (2) the full list of tables/columns involved in the join. The latter is more reliable when there's CTEs.

The column logic extraction is also now integrated into the SqlParsingAggregator, so our main warehouse integrations should benefit from it. The join logic extraction is not yet used anywhere.

All tests are updated, and I also added a couple new ones for tricky edge cases (e.g. self-joins).

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 5, 2025
Copy link

codecov bot commented May 5, 2025

Codecov Report

Attention: Patch coverage is 93.75000% with 6 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...gestion/src/datahub/sql_parsing/sqlglot_lineage.py 93.33% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label May 5, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

beyond self-join, what's the coverage for other join variants such as inner/outer/cross joins? they could be explicitely tested

should user-facing doc be updated with the new features?

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels May 6, 2025
@hsheth2
Copy link
Contributor Author

hsheth2 commented May 6, 2025

@sgomezvillamor this should work for all join types, including ones that use subqueries or CTEs

There's no user-facing portion of this yet, so I don't want to add any user-facing docs around this until it shows up in the product somewhere

),
confidenceScore=queries_map[query_id].confidence_score,
confidenceScore=query.confidence_score,
transformOperation=(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we format as a string and not convert it into a JSON object to make it parsable by machine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is shown in the UI as-is

"downstreams": [
"urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:mssql,demodata.foo.age_dist,PROD),age)"
],
"transformOperation": "COPY: [persons].[age] AS [age]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct with the square brackets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - mssql uses square brackets for quotes

@hsheth2 hsheth2 merged commit 7c791db into master May 13, 2025
61 checks passed
@hsheth2 hsheth2 deleted the sqlglot-joins branch May 13, 2025 00:19
kartikey-visa pushed a commit to kartikey-visa/datahub that referenced this pull request Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants