-
Notifications
You must be signed in to change notification settings - Fork 3.2k
feat(ingest/lineage): generate static json lineage file #13906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is ❌ Unsupported file formatUpload processing failed due to unsupported file format. Please review the parser error message:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming green CI
Scope of the PR
To expose what are the aspects of an entity that are considered lineage in a static json file
lineage.json
What has been changed
lineage.json
file remains up-to-datemodeldocgen.py
so it generates this lineage fileWhy not server endpoint for lineage registry
An endpoint was recently added #13865. An obvious question is why is that not enough. There are few reasons
It does not return enough granular details. i.e. there is no information on what field path's value being present means it is lineage


vs
I am unable to co-relate the values returned by the endpoint with our pdl files. e.g.

upstreamEdges
is returned by the endpoint locally but there is nothing with that name in pdl files. So where is this coming from? We don't have any docs hereFor anything that is offline analysis where we don't have access to a server e.g. if someone is doing offline analytics on graph exports they need to know what aspect's what field is lineage. This information needs to be hard-coded currently. But with this static json file being present they can do offline analysis by downloading this file from github.
In case we wanted to remove the responsibility of getting this relationship information from modeldocgen this allows us a gradual path where we can remove the responsibility from there and depend on this json. That speeds up and simplifies modeldocgen code
future work planned
_populate_aspect_metrics
method ofmetadata-ingestion/src/datahub/ingestion/api/source.py
. This is avoid hard-coding what is considered lineage for every different aspectNotes
metadata-ingestion/src/datahub/ingestion/autogenerated/lineage_helper.py
has been added. It is considered experimental for now and each method has been clearly marked as such. It is in this folder mainly to show this only works with folder's code. Also, it has mostly been auto-generated using AI code. It has mainly been added as a stop-gap measure to test some things. WIll mostly be tweaked when changes insource.py
are done in a follow up PR