[SPARK-35194][SQL] Refactor nested column aliasing for readability #32301

karenfeng · 2021-04-22T17:20:07Z

What changes were proposed in this pull request?

Refactors NestedColumnAliasing and GeneratorNestedColumnAliasing for readability.

Why are the changes needed?

Improves readability for future maintenance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Signed-off-by: Karen Feng <karen.feng@databricks.com>

karenfeng · 2021-04-22T17:22:48Z

@viirya, this will either block/be blocked by your ongoing PR. Let me know what you think - I think this will improve the readability.

viirya · 2021-04-22T17:39:05Z

Thanks @karenfeng. I will take a look on this.

SparkQA · 2021-04-22T18:16:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42348/

SparkQA · 2021-04-22T18:21:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42348/

SparkQA · 2021-04-22T21:47:53Z

Test build #137818 has finished for PR 32301 at commit 9656899.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

allisonwang-db

Add a few comments. Please also add [SQL] to the PR title :)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

allisonwang-db · 2021-04-23T00:23:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

-        getNewProjectList(projectList, nestedFieldToAlias),
-        replaceWithAliases(child, nestedFieldToAlias, attrToAliases))
+  def replacePlanWithAliases(
+        plan: LogicalPlan,


nit: 4 space indentation

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

Signed-off-by: Karen Feng <karen.feng@databricks.com>

SparkQA · 2021-04-23T22:57:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42401/

SparkQA · 2021-04-23T22:57:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42401/

SparkQA · 2021-04-24T02:45:51Z

Test build #137871 has finished for PR 32301 at commit a95360a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

cloud-fan · 2021-04-26T09:17:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

-        case _ => false
-      }
+  private def collectExtractValue(e: Expression): Seq[ExtractValue] = e match {
+    case g if isSelectedField(g) => Seq(g.asInstanceOf[ExtractValue])


nit: case e: ExtractValue if isSelectedField(e) => Seq(e)

cloud-fan · 2021-04-26T09:23:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

-    // Note that when we group by extractors with their references, we should remove
-    // cosmetic variations.
+    val nestedFieldReferences = exprList.flatMap(collectExtractValue)
+    val otherRootReferences = exprList.flatMap(collectAttributeReference)


Previously we collected both the nested fields extraction and other root references at the same time, and split them later. Now we collect them separately. I think the current code is clearer but is less performant.

How about we use mutable collections to implement this logic with one tree traversal?

val nestedFieldReferences = mutable.ArrayBuffer[ExtractValue] val otherRootReferences = mutable.ArrayBuffer[AttributeReference] exprList.foreach(collectRootReferenceAndExtractValue(e, nestedFieldReferences, otherRootReferences))

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

cloud-fan · 2021-04-26T09:42:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

-    case _: AttributeReference => Seq(e)
-    case GetStructField(_: ExtractValue | _: AttributeReference, _, _) => Seq(e)
+  private def isSelectedField(e: Expression): Boolean = e match {
+    case GetStructField(_: ExtractValue | _: AttributeReference, _, _) => true


Not related to this PR: I don't get the reason to match ExtractValue. For GetStructField(GetArrayItem(Attribute, index), fieldName), how can the data source support it?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

cloud-fan · 2021-04-26T09:44:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

+      val nestedFieldToAlias = attributeToExtractValues.flatMap { case (_, nestedFields) =>
+        nestedFields.map { f =>
+          val exprId = NamedExpression.newExprId
+          f -> Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None)


does the alias name matter?

Not particularly. I discussed this with @allisonwang-db offline, and we think it may be more useful for this name to reflect the struct and field. What do you think?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

Signed-off-by: Karen Feng <karen.feng@databricks.com>

cloud-fan · 2021-05-26T17:15:50Z

@viirya any more comments?

viirya · 2021-05-26T17:46:22Z

Thanks @cloud-fan @karenfeng. I will check this again today.

viirya · 2021-05-26T23:05:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

+    exprList.foreach { e =>
+      collectRootReferenceAndExtractValue(e).foreach {
+        case ev: ExtractValue =>
+          assert(ev.references.size == 1, s"$ev should have one reference")


`s"$ev should have one reference, but got: ${ev.references}"?

viirya

lgtm

Signed-off-by: Karen Feng <karen.feng@databricks.com>

SparkQA · 2021-05-27T02:23:52Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43516/

SparkQA · 2021-05-27T05:17:26Z

Test build #138997 has finished for PR 32301 at commit 83e2611.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-27T07:24:02Z

seems a legit error in org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.udf_struct

viirya · 2021-05-27T07:33:36Z

java.lang.AssertionError: assertion failed: struct(col1, 1, col2, struct(col1, a, col2, 1.5)).col2.col1 should have one reference, but found {}

Oh, it doesn't have more than one references, but has no reference...

karenfeng · 2021-05-27T18:23:38Z

@viirya - in the case that the number of references is !=1, should we exclude the ExtractValue from nestedFieldReferences?

viirya · 2021-05-27T20:06:19Z

@viirya - in the case that the number of references is !=1, should we exclude the ExtractValue from nestedFieldReferences?

I think so, it should be safer approach to exclude them from pruning candidates.

Signed-off-by: Karen Feng <karen.feng@databricks.com>

SparkQA · 2021-05-27T22:21:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43553/

SparkQA · 2021-05-27T22:53:27Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43553/

SparkQA · 2021-05-28T01:48:58Z

Test build #139035 has finished for PR 32301 at commit 8a29e94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-28T13:18:43Z

thanks, merging to master!

sarutak · 2021-05-28T14:45:17Z

This change seems to break the build with Scala 2.13 on GA.
https://github.com/apache/spark/runs/2694564384
I'll open a PR to fix it.

### What changes were proposed in this pull request? This PR fixes a build error with Scala 2.13 on GA. #32301 seems to bring this error. ### Why are the changes needed? To recover CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA Closes #32696 from sarutak/followup-SPARK-35194. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>

### What changes were proposed in this pull request? Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability. ### Why are the changes needed? Improves readability for future maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#32301 from karenfeng/refactor-nested-column-aliasing. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR fixes a build error with Scala 2.13 on GA. apache#32301 seems to bring this error. ### Why are the changes needed? To recover CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA Closes apache#32696 from sarutak/followup-SPARK-35194. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>

dongjoon-hyun

Hi, All. While investing SPARK-39854, unfortunately, it turns out this makes a regression reported by SPARK-39854 since Apache Spark 3.2.0.

I tested at this commit and the parent commit and verified that SPARK-39854 happens after this commit.

In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect rewrite. This fixes these family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.

### What changes were proposed in this pull request? In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`. ### Why are the changes needed? The current code leads to `IllegalStateException` runtime failures. ### Does this PR introduce _any_ user-facing change? Yes, fixes a bug. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46756 from eejbyfeldt/SPARK-48428. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`. ### Why are the changes needed? The current code leads to `IllegalStateException` runtime failures. ### Does this PR introduce _any_ user-facing change? Yes, fixes a bug. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46756 from eejbyfeldt/SPARK-48428. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b11608c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`. ### Why are the changes needed? The current code leads to `IllegalStateException` runtime failures. ### Does this PR introduce _any_ user-facing change? Yes, fixes a bug. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46756 from eejbyfeldt/SPARK-48428. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

karenfeng added 2 commits April 21, 2021 16:52

Refactor NestedColumnAliasing

37ac8a9

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Continue cleanup

9656899

Signed-off-by: Karen Feng <karen.feng@databricks.com>

github-actions bot added the SQL label Apr 22, 2021

allisonwang-db reviewed Apr 23, 2021

View reviewed changes

viirya changed the title ~~[SPARK-35194] Refactor nested column aliasing for readability~~ [SPARK-35194][SQL] Refactor nested column aliasing for readability Apr 23, 2021

cloud-fan reviewed Apr 23, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Show resolved Hide resolved

karenfeng added 2 commits April 23, 2021 13:58

Add comments

e44c683

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Split out Option[Map]

a95360a

Signed-off-by: Karen Feng <karen.feng@databricks.com>

cloud-fan reviewed Apr 26, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 26, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 26, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 26, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Show resolved Hide resolved

cloud-fan reviewed Apr 26, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Show resolved Hide resolved

cloud-fan reviewed Apr 26, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala Outdated Show resolved Hide resolved

karenfeng added 2 commits April 27, 2021 13:58

Merge from master

6cfd826

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Merge from master

cfa0273

Signed-off-by: Karen Feng <karen.feng@databricks.com>

github-actions bot added AVRO BUILD labels Apr 27, 2021

viirya reviewed May 26, 2021

View reviewed changes

viirya approved these changes May 26, 2021

View reviewed changes

Distinct-ify Attributes by exprId

83e2611

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Only add ExtractValue if 1 reference

8a29e94

Signed-off-by: Karen Feng <karen.feng@databricks.com>

cloud-fan closed this in e863166 May 28, 2021

sarutak mentioned this pull request May 28, 2021

[SPARK-35194][SQL][FOLLOWUP] Recover build error with Scala 2.13 on GA #32696

Closed

dongjoon-hyun reviewed Sep 17, 2022

View reviewed changes

eejbyfeldt mentioned this pull request May 27, 2024

[SPARK-48428][SQL]: Fix IllegalStateException in NestedColumnAliasing #46756

Closed

[SPARK-35194][SQL] Refactor nested column aliasing for readability #32301

[SPARK-35194][SQL] Refactor nested column aliasing for readability #32301

Uh oh!

Conversation

karenfeng commented Apr 22, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

karenfeng commented Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Apr 22, 2021

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

SparkQA commented Apr 22, 2021

Uh oh!

allisonwang-db left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Apr 23, 2021

Uh oh!

SparkQA commented Apr 23, 2021

Uh oh!

SparkQA commented Apr 24, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan commented May 26, 2021

Uh oh!

viirya commented May 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 27, 2021

Uh oh!

SparkQA commented May 27, 2021

Uh oh!

cloud-fan commented May 27, 2021

Uh oh!

viirya commented May 27, 2021

Uh oh!

karenfeng commented May 27, 2021

Uh oh!

viirya commented May 27, 2021

Uh oh!

SparkQA commented May 27, 2021

Uh oh!

SparkQA commented May 27, 2021

Uh oh!

karenfeng commented Apr 22, 2021 •

edited

Loading

allisonwang-db left a comment •

edited

Loading

sarutak commented May 28, 2021 •

edited

Loading