Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping Dataset Events to Trigger DAGs #42015

Open
1 of 2 tasks
dirrao opened this issue Sep 5, 2024 · 7 comments
Open
1 of 2 tasks

Grouping Dataset Events to Trigger DAGs #42015

dirrao opened this issue Sep 5, 2024 · 7 comments
Labels
area:datasets Issues related to the datasets feature kind:feature Feature Requests

Comments

@dirrao
Copy link
Collaborator

dirrao commented Sep 5, 2024

Description

No response

Use case/motivation

To handle multiple dataset updates efficiently and avoid triggering a DAG for every small dataset update (like a tiny partition), you can implement a "batching" mechanism where the DAG waits for a group of dataset events before triggering. This way, you avoid redundant DAG runs and ensure the DAG only executes when enough meaningful updates have occurred.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dirrao dirrao added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet area:datasets Issues related to the datasets feature labels Sep 5, 2024
@dirrao
Copy link
Collaborator Author

dirrao commented Sep 10, 2024

Hi @uranusjr,
could you provide your feedback on this feature request when you have a moment?

@uranusjr
Copy link
Member

Isn’t this basically the idea behind AIP-76?

@dirrao
Copy link
Collaborator Author

dirrao commented Sep 17, 2024

Possibly related, but I'm not sure. Does that include batching the events?

@dirrao
Copy link
Collaborator Author

dirrao commented Oct 2, 2024

@dirrao dirrao removed the needs-triage label for new issues that we didn't triage yet label Oct 2, 2024
@uranusjr
Copy link
Member

uranusjr commented Oct 2, 2024

The only mention on batching I can find in AIP-82 is under the Out of Scope section.

AIP-76 does not do batching, but works on a different level, separating individual events from triggering the actual downstream run. I do not know if it fits your use case; only you can decide.

@dirrao
Copy link
Collaborator Author

dirrao commented Oct 2, 2024

Currently, we're using a pull-based mechanism and triggering dataset creation events via the REST API. However, if we want to trigger events only when there are enough accumulated, we have to maintain an external state, group the events, and then send the dataset creation request. It would be great to have this feature built-in as part of the existing functionality.

@dirrao
Copy link
Collaborator Author

dirrao commented Oct 2, 2024

It is challenging to map data warehouse partitions to the dataset partitions mentioned in AIP-76. As a result, batching events for triggering DAG runs is not feasible in this context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:datasets Issues related to the datasets feature kind:feature Feature Requests
Projects
None yet
Development

No branches or pull requests

2 participants