Skip to content

feat(ingest/lineage): generate static json lineage file #13906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 1, 2025

Conversation

anshbansal
Copy link
Collaborator

@anshbansal anshbansal commented Jun 30, 2025

Scope of the PR

To expose what are the aspects of an entity that are considered lineage in a static json file lineage.json

What has been changed

  • CI and pre-commit hooks changed so that this lineage.json file remains up-to-date
  • Code changed in modeldocgen.py so it generates this lineage file
  • Tests added to increase coverage
  • gradle task to generate this lineage json file

Why not server endpoint for lineage registry

An endpoint was recently added #13865. An obvious question is why is that not enough. There are few reasons

  • It does not return enough granular details. i.e. there is no information on what field path's value being present means it is lineage
    image
    vs
    image

  • I am unable to co-relate the values returned by the endpoint with our pdl files. e.g. upstreamEdges is returned by the endpoint locally but there is nothing with that name in pdl files. So where is this coming from? We don't have any docs here
    image

  • For anything that is offline analysis where we don't have access to a server e.g. if someone is doing offline analytics on graph exports they need to know what aspect's what field is lineage. This information needs to be hard-coded currently. But with this static json file being present they can do offline analysis by downloading this file from github.

  • In case we wanted to remove the responsibility of getting this relationship information from modeldocgen this allows us a gradual path where we can remove the responsibility from there and depend on this json. That speeds up and simplifies modeldocgen code

future work planned

  • marking how much lineage was produced in source report using this information. The changes will be mainly done in _populate_aspect_metrics method of metadata-ingestion/src/datahub/ingestion/api/source.py. This is avoid hard-coding what is considered lineage for every different aspect

Notes

  • metadata-ingestion/src/datahub/ingestion/autogenerated/lineage_helper.py has been added. It is considered experimental for now and each method has been clearly marked as such. It is in this folder mainly to show this only works with folder's code. Also, it has mostly been auto-generated using AI code. It has mainly been added as a stop-gap measure to test some things. WIll mostly be tweaked when changes in source.py are done in a follow up PR

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 30, 2025
Copy link

codecov bot commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 98.03922% with 1 line in your changes missing coverage. Please review.

❌ Unsupported file format

Upload processing failed due to unsupported file format. Please review the parser error message:
Error parsing JUnit XML in /home/runner/work/datahub/datahub/metadata-io/build/test-results/test/TEST-com.linkedin.metadata.graph.search.elasticsearch.SearchGraphServiceElasticSearchTest.xml at 117:1058

Caused by:
RuntimeError: Error converting computed name to ValidatedString

Caused by:
    string is too long</code></pre>

For more help, visit our troubleshooting guide.

Files with missing lines Patch % Lines
.../datahub/ingestion/autogenerated/lineage_helper.py 98.03% 1 Missing ⚠️

:loudspeaker: Thoughts on this report? Let us know!

@anshbansal anshbansal marked this pull request as ready for review July 1, 2025 10:42
@anshbansal anshbansal changed the title wip to add lineage helper feat(ingest/lineage): generate static json lineage file Jul 1, 2025
@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jul 1, 2025
@datahub-cyborg datahub-cyborg bot removed the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Jul 1, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Jul 1, 2025
Copy link
Collaborator

@pedro93 pedro93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming green CI

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Jul 1, 2025
@anshbansal anshbansal merged commit 92784ec into master Jul 1, 2025
65 checks passed
@anshbansal anshbansal deleted the ab-2025-jun-30-add-lineage-helper branch July 1, 2025 15:21
kartikey-visa pushed a commit to kartikey-visa/datahub that referenced this pull request Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants