-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document streaming usecase (like UNBOUNDED
tables)
#9016
Comments
@metesynnada @mustafasrepo @ozankabak or @edmondop @mwylde do you know of any existing documentation / examples we could adapt? |
I think the major initial push in this area came in #4694 |
I can help with it. Assign to me. @alamb |
Would one need to write custom unbounded sources? |
I don't think so @edmondop -- I was thinking anything that gives others examples / help starting would be great. Maybe we can start with some SQL reference / API docs and high level commentary in one PR and then add an example as another PR Right now there is nothing documented, so it will be very easy to imprve the status quo! |
I've realized that |
The example on #9070 is great for understanding how to track event time with |
The way most of the code in DataFusion works is that it will use the next distinct value in the data to trigger emission (as in you have to see an event of the next time) I don't know of any way to send a synthetic signal that says "will never see any more values in this time interval" |
Yes, IMO this seems to fall out of scope for "upstream" Datafusion (even though we also have these challenges but choose to solve them downstream due to this reasoning). |
A bit late to this party :), few questions regarding this topic:
thanks a lot |
I think the execution is run as a separate tokio task -- so whenever someone wants to see if the next result is available they would do |
wrt point ,. if we have a running unbounded pipeline, when shutdown/cancel event comes, sources should stop producing, probably commit offsets and so on and give chance for rest of the pipeline to finish all the bits in-flight before shutdown. so if i can reformulate my question 2, "who will inform source that it should wrap up for this execution and shut down ? :)" or am I missing something obvious here? |
I think this needs to be handled by the containing system (nothing in DataFusion will do it directly). It would have to hold on to some reference to the source that can be signaled to stop. In terms of shutdown, when the stream is |
Is your feature request related to a problem or challenge?
Someone asked in discord:
I am pretty sure this is exactly the case for using
UNBOUNDED
tables with explicitly definedORDER BY
from Synnada and Arroyo others. However, when I went to look for the documentation, I could't find any mention of this usecase or documentation of unbounded tablesDescribe the solution you'd like
I would like to help make it easier for people to use DataFusion for streaming usecases by:
UNBOUNDED
keyword in theCREATE EXTERNAL TABLE
documentationDescribe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: