Skip to content

[SPARK-44287][SQL] Use PartitionEvaluator API in RowToColumnarExec & ColumnarToRowExec SQL operators. #41839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

vinodkc
Copy link
Contributor

@vinodkc vinodkc commented Jul 4, 2023

What changes were proposed in this pull request?

SQL operators RowToColumnarExec & ColumnarToRowExec are updated to use the PartitionEvaluator API to do execution.

Why are the changes needed?

To avoid the use of lambda during distributed execution.
Ref: SPARK-43061 for more details.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated 2 test cases, once all the SQL operators are migrated, the flag spark.sql.execution.useTaskEvaluator will be enabled by default to avoid running the tests with and without this TaskEvaluator

@github-actions github-actions bot added the SQL label Jul 4, 2023
@vinodkc
Copy link
Contributor Author

vinodkc commented Jul 4, 2023

@vinodkc vinodkc changed the title [SPARK-44287][SQL] Define PartitionEvaluator API for RowToColumnarExec & ColumnarToRowExec SQL operators. [SPARK-44287][SQL] Use PartitionEvaluator API for RowToColumnarExec & ColumnarToRowExec SQL operators. Jul 4, 2023
@vinodkc vinodkc force-pushed the br_refactorToEvaluatorFactory1 branch from a267a1a to d17712f Compare July 4, 2023 18:25
} else {
child.executeColumnar().mapPartitionsInternal { batches =>
val evaluator = evaluatorFactory.createEvaluator()
evaluator.eval(0, batches)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need pass real partition index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code, the index was not used as mapPartitionsInternal is called

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right. Even if it's not used for now, we should still set it correctly to be future-proof.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fixing it at #42185

Copy link
Contributor Author

@vinodkc vinodkc Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan , Thanks for the fix,
I can fix similar issue in other merged PR
https://github.com/apache/spark/pull/42024/files

@vinodkc vinodkc force-pushed the br_refactorToEvaluatorFactory1 branch from d17712f to 2fcb311 Compare July 5, 2023 18:35
@vinodkc vinodkc force-pushed the br_refactorToEvaluatorFactory1 branch from 2fcb311 to 8908f9c Compare July 5, 2023 18:38
@vinodkc vinodkc changed the title [SPARK-44287][SQL] Use PartitionEvaluator API for RowToColumnarExec & ColumnarToRowExec SQL operators. [SPARK-44287][SQL] Use PartitionEvaluator API in RowToColumnarExec & ColumnarToRowExec SQL operators. Jul 7, 2023
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 56b9f6c Jul 7, 2023
cloud-fan added a commit that referenced this pull request Jul 28, 2023
### What changes were proposed in this pull request?

This is a followup of #41839, to set the partition index correctly even if it's not used for now. It also contains a few code cleanup.

### Why are the changes needed?

future-proof

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #42185 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Jul 28, 2023
### What changes were proposed in this pull request?

This is a followup of #41839, to set the partition index correctly even if it's not used for now. It also contains a few code cleanup.

### Why are the changes needed?

future-proof

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #42185 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bf1bbc5)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HyukjinKwon pushed a commit that referenced this pull request Jul 31, 2023
### What changes were proposed in this pull request?

This is a followup of #41839 , to fix an unintentional change. That PR added an optimization to return an empty iterator directly if the input iterator is empty. However, checking `inputIterator.hasNext` may trigger query execution, which is different than before. It should be completely lazy and wait for the root operator's iterator to trigger the execution.

### Why are the changes needed?

fix unintentional change

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #42226 from cloud-fan/fo.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Jul 31, 2023
### What changes were proposed in this pull request?

This is a followup of #41839 , to fix an unintentional change. That PR added an optimization to return an empty iterator directly if the input iterator is empty. However, checking `inputIterator.hasNext` may trigger query execution, which is different than before. It should be completely lazy and wait for the root operator's iterator to trigger the execution.

### Why are the changes needed?

fix unintentional change

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #42226 from cloud-fan/fo.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 0f9cca5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
…ColumnarToRowExec SQL operators

### What changes were proposed in this pull request?

SQL operators `RowToColumnarExec` & `ColumnarToRowExec`  are updated to use the `PartitionEvaluator` API to do execution.

### Why are the changes needed?

To avoid the use of lambda during distributed execution.
Ref: SPARK-43061 for more details.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Updated 2 test cases, once all the SQL operators are migrated, the flag `spark.sql.execution.useTaskEvaluator` will be enabled by default to avoid running the tests with and without this TaskEvaluator

Closes apache#41839 from vinodkc/br_refactorToEvaluatorFactory1.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?

This is a followup of apache#41839, to set the partition index correctly even if it's not used for now. It also contains a few code cleanup.

### Why are the changes needed?

future-proof

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes apache#42185 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?

This is a followup of apache#41839 , to fix an unintentional change. That PR added an optimization to return an empty iterator directly if the input iterator is empty. However, checking `inputIterator.hasNext` may trigger query execution, which is different than before. It should be completely lazy and wait for the root operator's iterator to trigger the execution.

### Why are the changes needed?

fix unintentional change

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes apache#42226 from cloud-fan/fo.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants