-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-14287] isStreaming method for Dataset #12080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #54585 has finished for PR 12080 at commit
|
Test build #54589 has finished for PR 12080 at commit
|
LGTM |
@@ -449,6 +450,17 @@ class Dataset[T] private[sql]( | |||
def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation] | |||
|
|||
/** | |||
* Returns true if the underlying query will be executed continuously as new data comes in. | |||
* Methods that return bounded values, e.g. [[count()]], [[collect()]] will throw | |||
* an exception if a Dataset is streaming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
Returns true if this [[Dataset]] contains one or more sources that continuously return data as it arrives.
A [[Dataset]] that reads data from a streaming source must be executed as a [[ContinuousQuery]] using
the `startStream()` method in [[DataFrameWriter]]. Methods that return a single answer, (e.g., `count()` or
`collect()`) will throw an [[AnalysisException]] when there is a streaming source present.
Implementation LGTM. |
@marmbrus Addressed your comment |
Test build #54875 has finished for PR 12080 at commit
|
Test build #54896 has finished for PR 12080 at commit
|
test this please |
@tdas here is another failure |
Test build #54907 has finished for PR 12080 at commit
|
Thanks, merging to master! |
With the addition of StreamExecution (ContinuousQuery) to Datasets, data will become unbounded. With unbounded data, the execution of some methods and operations will not make sense, e.g.
Dataset.count()
.A simple API is required to check whether the data in a Dataset is bounded or unbounded. This will allow users to check whether their Dataset is in streaming mode or not. ML algorithms may check if the data is unbounded and throw an exception for example.
The implementation of this method is simple, however naming it is the challenge. Some possible names for this method are:
I've gone with
isStreaming
for now. We can change it before Spark 2.0 if we decide to come up with a different name. For that reason I've marked it as@Experimental