[SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work#17052
[SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work#17052uncleGen wants to merge 7 commits intoapache:masterfrom
Conversation
|
Test build #73407 has started for PR 17052 at commit |
|
Test build #73403 has finished for PR 17052 at commit
|
|
Test build #73412 has finished for PR 17052 at commit
|
|
cc @zsxwing |
There was a problem hiding this comment.
stateful indicates if the aggregate is base on streaming or batch, resolved by ResolveStatefulAggregate rule
There was a problem hiding this comment.
resolve one aggregate, determine statefule or not.
|
Thanks for doing this. I'm wondering if you can fix |
|
@zsxwing got it |
9c15fcb to
38e3a14
Compare
|
retest this please. |
|
Test build #73507 has finished for PR 17052 at commit
|
|
working on unit test failure |
c87651a to
59f4272
Compare
|
retest this please. |
|
Test build #73559 has finished for PR 17052 at commit
|
|
Test build #73558 has finished for PR 17052 at commit
|
|
Test build #73571 has started for PR 17052 at commit |
| SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" | ||
| SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1", | ||
| SQLConf.UNSUPPORTED_OPERATION_CHECK_ENABLED.key -> "false" | ||
| ) { |
There was a problem hiding this comment.
Close the "UNSUPPORTED_OPERATION_CHECK_ENABLED", as Source.getBatch returns DF whose isStreaming is true.
| } else { | ||
| LocalRelation(projectList.map(_.toAttribute), data.map(projection)) | ||
| } | ||
| } |
There was a problem hiding this comment.
In a streaming query, we will transfrom stream source to a batch LocalRelation whose isStreaming is true, so we should keep new LocalRelation's isStreaming is true in this rule.
| case agg @ PhysicalAggregation( | ||
| namedGroupingExpressions, aggregateExpressions, rewrittenResultExpressions, child) | ||
| if agg.isStreaming => | ||
|
|
There was a problem hiding this comment.
Apply this strategy only if the logical plan is streaming.
|
retest this please. |
|
Test build #73576 has finished for PR 17052 at commit
|
|
\cc @zsxwing |
|
ping @zsxwing |
|
|
||
| private var _analyzed: Boolean = false | ||
|
|
||
| private var _incremental: Boolean = false |
There was a problem hiding this comment.
Adding it here will break sameResult, equals and other methods. Could you add a new parameter to the constructor of LogicalRelation and LogicalRDD instead?
c87651a to
67847e5
Compare
| case localRelation @ LocalRelation(_, _, false) => | ||
| localRelation.dataFromStreaming = true | ||
| localRelation | ||
| } |
There was a problem hiding this comment.
add a new parameter dataFromStreaming to the constructor of LogicalRelation, LogicalRDD and LocalRelation. dataFromStreaming indicate if this relation comes from a streaming source. In a streaming query, stream relation will be cut into a series of batch relations.
|
Test build #74099 has finished for PR 17052 at commit
|
|
Test build #74100 has finished for PR 17052 at commit
|
|
Test build #74109 has finished for PR 17052 at commit
|
|
Test build #74206 has finished for PR 17052 at commit
|
|
Test build #74844 has finished for PR 17052 at commit
|
|
retest this please. |
|
Test build #74852 has finished for PR 17052 at commit
|
|
retest this please. |
|
Test build #74968 has finished for PR 17052 at commit
|
|
This is marked as "Critical" for 2.1.1, but I'm not clear it's a regression or that urgent? |
|
@uncleGen is this still active? |
|
@HyukjinKwon Sorry! Busy for this period of time. Let me resolve this conflict. |
|
Yea, I just wanted to check if it is in progress in any way. Thanks for your input. |
|
Test build #78895 has finished for PR 17052 at commit
|
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory … Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage. Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key… Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit… Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer Closes apache#18585 - SPARK-21359 Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala Added: Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I… Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0 Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to … Closes apache#18667 - Fix the simpleString used in error messages Closes apache#18782 - Branch 2.1 Added: Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads Added: Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server Added: Closes apache#18827 - Merge pull request 1 from apache/master ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#18780 from HyukjinKwon/close-prs.
What changes were proposed in this pull request?
StatefulAggregationStrategyshould check logicplan is streaming or notTest code:
before pr:
after pr:
How was this patch tested?
add new unit test.