Skip to content

Dataset missing from lineage graph #2543

Open
@yonivy

Description

@yonivy

A colleague of mine of opened an issue in the OpenLineage repo and received no response so far so perhaps this is the right place to post issues in :)

The issue we are facing is that Marquez seems to break lineage if the same logical job produces different datasets on different runs. Our reality (and I believe others as well) is that our processes are dynamic in their output. I do not think this is an edge-case.

The use case is this:

  • We have a logical ETL job which is scheduled to run a few times during the day.
  • The job pushes data into tables based on the contents of the input files (which are in S3).

Example

The example below is super simplified but I believe it paints the right picture.

Job name: users_etl
Job input: The last modified file(s) found in the path template s3:///users/{yyyy}/{mm}/{dd}

Run no. 1

The input file contains nested user info (first_name, last_name, email, address: {city, state}) so the job will update the users table (which has the first_name, last_name and email columns) and the table users_address which has the city and state columns).

Output:

  • users table
  • users_address table

Run no. 2

The input file contains flat user info (first_name, last_name, email) so the job will update the users table (which has the first_name, last_name and email columns).

Output:

  • users table

The Problem

In Marquez I can only see the users table in the lineage of the users_etl job. The users_address dataset gets orphaned.

The state after Run no. 1

Everything is as expected.

image

The state after Run no. 2

Only the latest output is displayed.

image

and the previous output is now completely detached from the lineage graph!

image

The Expectation

I expected to continue and see the users_address table in the lineage graph. Without it all I'm getting is last-run lineage and while that is useful for some cases it presents a confusing image which does not reflect the reality of relationships between jobs and datasets. I mean what can I understand about the users_address table, that it simply popped into existence?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions