[SPARK-44287][SQL] Use PartitionEvaluator API in RowToColumnarExec & ColumnarToRowExec SQL operators. #41839

vinodkc · 2023-07-04T00:54:32Z

What changes were proposed in this pull request?

SQL operators RowToColumnarExec & ColumnarToRowExec are updated to use the PartitionEvaluator API to do execution.

Why are the changes needed?

To avoid the use of lambda during distributed execution.
Ref: SPARK-43061 for more details.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated 2 test cases, once all the SQL operators are migrated, the flag spark.sql.execution.useTaskEvaluator will be enabled by default to avoid running the tests with and without this TaskEvaluator

vinodkc · 2023-07-04T00:57:50Z

CC @cloud-fan @viirya @dongjoon-hyun @yaooqinn @beliefer

sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala

viirya · 2023-07-05T05:15:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala

+    } else {
+      child.executeColumnar().mapPartitionsInternal { batches =>
+        val evaluator = evaluatorFactory.createEvaluator()
+        evaluator.eval(0, batches)


We don't need pass real partition index?

In the original code, the index was not used as mapPartitionsInternal is called

This is not right. Even if it's not used for now, we should still set it correctly to be future-proof.

I'm fixing it at #42185

@cloud-fan , Thanks for the fix,
I can fix similar issue in other merged PR
https://github.com/apache/spark/pull/42024/files

sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarEvaluatorFactory.scala

cloud-fan · 2023-07-07T19:58:05Z

thanks, merging to master!

### What changes were proposed in this pull request? This is a followup of #41839, to set the partition index correctly even if it's not used for now. It also contains a few code cleanup. ### Why are the changes needed? future-proof ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #42185 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of #41839, to set the partition index correctly even if it's not used for now. It also contains a few code cleanup. ### Why are the changes needed? future-proof ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #42185 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bf1bbc5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of #41839 , to fix an unintentional change. That PR added an optimization to return an empty iterator directly if the input iterator is empty. However, checking `inputIterator.hasNext` may trigger query execution, which is different than before. It should be completely lazy and wait for the root operator's iterator to trigger the execution. ### Why are the changes needed? fix unintentional change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #42226 from cloud-fan/fo. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This is a followup of #41839 , to fix an unintentional change. That PR added an optimization to return an empty iterator directly if the input iterator is empty. However, checking `inputIterator.hasNext` may trigger query execution, which is different than before. It should be completely lazy and wait for the root operator's iterator to trigger the execution. ### Why are the changes needed? fix unintentional change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #42226 from cloud-fan/fo. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0f9cca5) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ColumnarToRowExec SQL operators ### What changes were proposed in this pull request? SQL operators `RowToColumnarExec` & `ColumnarToRowExec` are updated to use the `PartitionEvaluator` API to do execution. ### Why are the changes needed? To avoid the use of lambda during distributed execution. Ref: SPARK-43061 for more details. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated 2 test cases, once all the SQL operators are migrated, the flag `spark.sql.execution.useTaskEvaluator` will be enabled by default to avoid running the tests with and without this TaskEvaluator Closes apache#41839 from vinodkc/br_refactorToEvaluatorFactory1. Authored-by: Vinod KC <vinod.kc.in@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of apache#41839, to set the partition index correctly even if it's not used for now. It also contains a few code cleanup. ### Why are the changes needed? future-proof ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#42185 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of apache#41839 , to fix an unintentional change. That PR added an optimization to return an empty iterator directly if the input iterator is empty. However, checking `inputIterator.hasNext` may trigger query execution, which is different than before. It should be completely lazy and wait for the root operator's iterator to trigger the execution. ### Why are the changes needed? fix unintentional change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#42226 from cloud-fan/fo. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added the SQL label Jul 4, 2023

vinodkc changed the title ~~[SPARK-44287][SQL] Define PartitionEvaluator API for RowToColumnarExec & ColumnarToRowExec SQL operators.~~ [SPARK-44287][SQL] Use PartitionEvaluator API for RowToColumnarExec & ColumnarToRowExec SQL operators. Jul 4, 2023

beliefer reviewed Jul 4, 2023

View reviewed changes

vinodkc force-pushed the br_refactorToEvaluatorFactory1 branch from a267a1a to d17712f Compare July 4, 2023 18:25

viirya reviewed Jul 5, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarEvaluatorFactory.scala Outdated Show resolved Hide resolved

viirya reviewed Jul 5, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarEvaluatorFactory.scala Outdated Show resolved Hide resolved

beliefer reviewed Jul 5, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarEvaluatorFactory.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarEvaluatorFactory.scala Show resolved Hide resolved

vinodkc force-pushed the br_refactorToEvaluatorFactory1 branch from d17712f to 2fcb311 Compare July 5, 2023 18:35

Add ColumnarEvaluatorFactory

8908f9c

vinodkc force-pushed the br_refactorToEvaluatorFactory1 branch from 2fcb311 to 8908f9c Compare July 5, 2023 18:38

vinodkc changed the title ~~[SPARK-44287][SQL] Use PartitionEvaluator API for RowToColumnarExec & ColumnarToRowExec SQL operators.~~ [SPARK-44287][SQL] Use PartitionEvaluator API in RowToColumnarExec & ColumnarToRowExec SQL operators. Jul 7, 2023

cloud-fan approved these changes Jul 7, 2023

View reviewed changes

cloud-fan closed this in 56b9f6c Jul 7, 2023

cloud-fan mentioned this pull request Jul 27, 2023

[SPARK-44287][SQL][FOLLOWUP] Set partition index correctly #42185

Closed

cloud-fan mentioned this pull request Jul 30, 2023

[SPARK-44287][SQL][FOLLOWUP] Do not trigger execution too early #42226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-44287][SQL] Use PartitionEvaluator API in RowToColumnarExec & ColumnarToRowExec SQL operators. #41839

[SPARK-44287][SQL] Use PartitionEvaluator API in RowToColumnarExec & ColumnarToRowExec SQL operators. #41839

Uh oh!

vinodkc commented Jul 4, 2023

Uh oh!

vinodkc commented Jul 4, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viirya Jul 5, 2023

Uh oh!

vinodkc Jul 5, 2023

Uh oh!

cloud-fan Jul 27, 2023

Uh oh!

cloud-fan Jul 27, 2023

Uh oh!

vinodkc Jul 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Jul 7, 2023

Uh oh!

Uh oh!

[SPARK-44287][SQL] Use PartitionEvaluator API in RowToColumnarExec & ColumnarToRowExec SQL operators. #41839

[SPARK-44287][SQL] Use PartitionEvaluator API in RowToColumnarExec & ColumnarToRowExec SQL operators. #41839

Uh oh!

Conversation

vinodkc commented Jul 4, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

vinodkc commented Jul 4, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viirya Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

vinodkc Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 27, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 27, 2023

Choose a reason for hiding this comment

Uh oh!

vinodkc Jul 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Jul 7, 2023

Uh oh!

Uh oh!

vinodkc Jul 27, 2023 •

edited

Loading