Description
A colleague of mine of opened an issue in the OpenLineage repo and received no response so far so perhaps this is the right place to post issues in :)
The issue we are facing is that Marquez seems to break lineage if the same logical job produces different datasets on different runs. Our reality (and I believe others as well) is that our processes are dynamic in their output. I do not think this is an edge-case.
The use case is this:
- We have a logical ETL job which is scheduled to run a few times during the day.
- The job pushes data into tables based on the contents of the input files (which are in S3).
Example
The example below is super simplified but I believe it paints the right picture.
Job name: users_etl
Job input: The last modified file(s) found in the path template s3:///users/{yyyy}/{mm}/{dd}
Run no. 1
The input file contains nested user info (first_name, last_name, email, address: {city, state}) so the job will update the users
table (which has the first_name, last_name and email columns) and the table users_address
which has the city and state columns).
Output:
users
tableusers_address
table
Run no. 2
The input file contains flat user info (first_name, last_name, email) so the job will update the users
table (which has the first_name, last_name and email columns).
Output:
users
table
The Problem
In Marquez I can only see the users
table in the lineage of the users_etl
job. The users_address
dataset gets orphaned.
The state after Run no. 1
Everything is as expected.
The state after Run no. 2
Only the latest output is displayed.
and the previous output is now completely detached from the lineage graph!
The Expectation
I expected to continue and see the users_address
table in the lineage graph. Without it all I'm getting is last-run lineage and while that is useful for some cases it presents a confusing image which does not reflect the reality of relationships between jobs and datasets. I mean what can I understand about the users_address
table, that it simply popped into existence?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status