Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(datasets): persist dataset extra dict to dataset event #36075

Closed

Conversation

error418
Copy link
Contributor

@error418 error418 commented Dec 5, 2023


Persists Dataset extras to the according DatasetEvent to be used in Tasks invoked by Data aware scheduling. Please see linked issue for more information

closes: #35297

@error418 error418 force-pushed the 35297-dataset-extras-are-not-persisted branch from ee6d441 to 056893c Compare December 6, 2023 05:30
@uranusjr
Copy link
Member

uranusjr commented Dec 6, 2023

Do we need to do the merge at all? From what I can tell, the extra parameter on register_dataset_change is currently not used anywhere, and I suspect the original intention was to use that field to forward extra from the dataset directly to dataset event instead. Maybe we should just remove extra=None from register_dataset_change and just do the forwarding.

cc @blag for some history on this

@error418
Copy link
Contributor Author

error418 commented Dec 6, 2023

@uranusjr I also thought the implementation was intended this way. I removed the merge with the latest commit to this PR for the review.

In case the discussion comes to the result that a merge of the dictionaries is required we can quickly roll back.

@error418 error418 force-pushed the 35297-dataset-extras-are-not-persisted branch from 5e6009f to a607a31 Compare December 6, 2023 12:56
@blag
Copy link
Contributor

blag commented Dec 6, 2023

Please refer to the PR implementing this, and the discussion about an extra parameter here.

IIRC, the intent here was not to break database normalization and copy DatasetModel.extra for every DatasetEvent, it was to enable third party DatasetManagers (originally named DatasetEventManager in my PR linked above) to record additional data from external sources (including non-Airflow sources) with a minimum of fuss. The extra fields of DatasetModel and DatasetEvent are similarly named, but used for different and distinct purposes, and I did not intend for DatasetEvent.extra to be copied from DatasetModel.extra.

Given that, I would not expect DatasetEvent.extra to be used within Airflow, as if we wanted to store more information in the DatasetEvent table, we would just update the database schema. However, just because it is not used internally within Airflow does not mean may not be useful.

IMO #35297 is invalid, because tasks should be able to query the Airflow DB directly for the dataset extra field using the keys in triggering_dataset_events, and should absolutely not expect the dataset extra field to be copied to all dataset event extra fields.

@blag
Copy link
Contributor

blag commented Dec 6, 2023

The docs are not optimally worded and definitely conflate datasets and dataset events, possibly leading to this confusion. See here.

Fetching information from a Triggering Dataset Event

A triggered DAG can fetch information from the Dataset that triggered it using the triggering_dataset_events template or parameter.

Do datasets trigger DAGs, or do dataset events trigger DAGs? The title of this section and its text seem to disagree. The answer is: dataset events trigger DAGs.

A one word change would vastly improve the precision of this section:

A triggered DAG can fetch information from the Dataset event that triggered it using the triggering_dataset_events template or parameter.

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 23, 2024
@error418 error418 closed this Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset extra fields set in Task outlets are not accessible
3 participants