Dataset missing from lineage graph

A colleague of mine of opened an [issue in the OpenLineage repo](https://github.com/OpenLineage/OpenLineage/issues/1965) and received no response so far so perhaps this is the right place to post issues in :)

The issue we are facing is that Marquez seems to break lineage if the same logical job produces different datasets on different runs. Our reality (and I believe others as well) is that our processes are dynamic in their output. I do not think this is an edge-case.

The use case is this:
- We have a logical ETL job which is scheduled to run a few times during the day.
- The job pushes data into tables based on the contents of the input files (which are in S3).


## Example
The example below is super simplified but I believe it paints the right picture.

**Job name**: `users_etl`
**Job input**: The last modified file(s) found in the path template `s3:///users/{yyyy}/{mm}/{dd}`

#### Run no. 1
The input file contains nested user info (first_name, last_name, email, address: {city, state}) so the job will update the `users` table (which has the first_name, last_name and email columns) and the table `users_address` which has the city and state columns).

Output:
- `users` table
- `users_address` table

#### Run no. 2
The input file contains flat user info (first_name, last_name, email) so the job will update the `users` table (which has the first_name, last_name and email columns).

Output:
- `users` table

### The Problem
In Marquez I can only see the `users` table in the lineage of the `users_etl` job. The `users_address` dataset gets orphaned.

#### The state after Run no. 1
Everything is as expected.

![image](https://github.com/MarquezProject/marquez/assets/5017039/dedba1e5-7521-4573-8dcc-f10be8b1eb9d)

#### The state after Run no. 2
Only the latest output is displayed.

![image](https://github.com/MarquezProject/marquez/assets/5017039/8a216594-f227-4588-b7b8-95f80b88db75)

and the previous output is now completely detached from the lineage graph!

![image](https://github.com/MarquezProject/marquez/assets/5017039/ea682b16-9ff2-49ac-88b2-948ce3582ce4)

### The Expectation
I expected to continue and see the `users_address` table in the lineage graph. Without it all I'm getting is last-run lineage and while that is useful for some cases it presents a confusing image which does not reflect the reality of relationships between jobs and datasets. I mean what can I understand about the `users_address` table, that it simply popped into existence?

 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset missing from lineage graph #2543

Example

Run no. 1

Run no. 2

The Problem

The state after Run no. 1

The state after Run no. 2

The Expectation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset missing from lineage graph #2543

Description

Example

Run no. 1

Run no. 2

The Problem

The state after Run no. 1

The state after Run no. 2

The Expectation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions