-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouping Dataset Events to Trigger DAGs #42015
Comments
Hi @uranusjr, |
Isn’t this basically the idea behind AIP-76? |
Possibly related, but I'm not sure. Does that include batching the events? |
The AIP-82 contains this functionality. |
The only mention on batching I can find in AIP-82 is under the Out of Scope section. AIP-76 does not do batching, but works on a different level, separating individual events from triggering the actual downstream run. I do not know if it fits your use case; only you can decide. |
Currently, we're using a pull-based mechanism and triggering dataset creation events via the REST API. However, if we want to trigger events only when there are enough accumulated, we have to maintain an external state, group the events, and then send the dataset creation request. It would be great to have this feature built-in as part of the existing functionality. |
It is challenging to map data warehouse partitions to the dataset partitions mentioned in AIP-76. As a result, batching events for triggering DAG runs is not feasible in this context. |
Description
No response
Use case/motivation
To handle multiple dataset updates efficiently and avoid triggering a DAG for every small dataset update (like a tiny partition), you can implement a "batching" mechanism where the DAG waits for a group of dataset events before triggering. This way, you avoid redundant DAG runs and ensure the DAG only executes when enough meaningful updates have occurred.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: