-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-35194][SQL] Refactor nested column aliasing for readability #32301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35194][SQL] Refactor nested column aliasing for readability #32301
Conversation
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@viirya, this will either block/be blocked by your ongoing PR. Let me know what you think - I think this will improve the readability. |
Thanks @karenfeng. I will take a look on this. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137818 has finished for PR 32301 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a few comments. Please also add [SQL] to the PR title :)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Outdated
Show resolved
Hide resolved
getNewProjectList(projectList, nestedFieldToAlias), | ||
replaceWithAliases(child, nestedFieldToAlias, attrToAliases)) | ||
def replacePlanWithAliases( | ||
plan: LogicalPlan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 4 space indentation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Show resolved
Hide resolved
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137871 has finished for PR 32301 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
case _ => false | ||
} | ||
private def collectExtractValue(e: Expression): Seq[ExtractValue] = e match { | ||
case g if isSelectedField(g) => Seq(g.asInstanceOf[ExtractValue]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: case e: ExtractValue if isSelectedField(e) => Seq(e)
// Note that when we group by extractors with their references, we should remove | ||
// cosmetic variations. | ||
val nestedFieldReferences = exprList.flatMap(collectExtractValue) | ||
val otherRootReferences = exprList.flatMap(collectAttributeReference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we collected both the nested fields extraction and other root references at the same time, and split them later. Now we collect them separately. I think the current code is clearer but is less performant.
How about we use mutable collections to implement this logic with one tree traversal?
val nestedFieldReferences = mutable.ArrayBuffer[ExtractValue]
val otherRootReferences = mutable.ArrayBuffer[AttributeReference]
exprList.foreach(collectRootReferenceAndExtractValue(e, nestedFieldReferences, otherRootReferences))
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Show resolved
Hide resolved
case _: AttributeReference => Seq(e) | ||
case GetStructField(_: ExtractValue | _: AttributeReference, _, _) => Seq(e) | ||
private def isSelectedField(e: Expression): Boolean = e match { | ||
case GetStructField(_: ExtractValue | _: AttributeReference, _, _) => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR: I don't get the reason to match ExtractValue
. For GetStructField(GetArrayItem(Attribute, index), fieldName)
, how can the data source support it?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Show resolved
Hide resolved
val nestedFieldToAlias = attributeToExtractValues.flatMap { case (_, nestedFields) => | ||
nestedFields.map { f => | ||
val exprId = NamedExpression.newExprId | ||
f -> Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the alias name matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly. I discussed this with @allisonwang-db offline, and we think it may be more useful for this name to reflect the struct and field. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@viirya any more comments? |
Thanks @cloud-fan @karenfeng. I will check this again today. |
exprList.foreach { e => | ||
collectRootReferenceAndExtractValue(e).foreach { | ||
case ev: ExtractValue => | ||
assert(ev.references.size == 1, s"$ev should have one reference") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`s"$ev should have one reference, but got: ${ev.references}"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #138997 has finished for PR 32301 at commit
|
seems a legit error in |
Oh, it doesn't have more than one references, but has no reference... |
@viirya - in the case that the number of references is |
I think so, it should be safer approach to exclude them from pruning candidates. |
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #139035 has finished for PR 32301 at commit
|
thanks, merging to master! |
This change seems to break the build with Scala 2.13 on GA. |
### What changes were proposed in this pull request? This PR fixes a build error with Scala 2.13 on GA. #32301 seems to bring this error. ### Why are the changes needed? To recover CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA Closes #32696 from sarutak/followup-SPARK-35194. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request? Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability. ### Why are the changes needed? Improves readability for future maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#32301 from karenfeng/refactor-nested-column-aliasing. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR fixes a build error with Scala 2.13 on GA. apache#32301 seems to bring this error. ### Why are the changes needed? To recover CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA Closes apache#32696 from sarutak/followup-SPARK-35194. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, All. While investing SPARK-39854, unfortunately, it turns out this makes a regression reported by SPARK-39854 since Apache Spark 3.2.0.
I tested at this commit and the parent commit and verified that SPARK-39854 happens after this commit.
In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect rewrite. This fixes these family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.
In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect rewrite. This fixes these family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.
### What changes were proposed in this pull request? In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`. ### Why are the changes needed? The current code leads to `IllegalStateException` runtime failures. ### Does this PR introduce _any_ user-facing change? Yes, fixes a bug. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46756 from eejbyfeldt/SPARK-48428. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`. ### Why are the changes needed? The current code leads to `IllegalStateException` runtime failures. ### Does this PR introduce _any_ user-facing change? Yes, fixes a bug. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46756 from eejbyfeldt/SPARK-48428. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b11608c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`. ### Why are the changes needed? The current code leads to `IllegalStateException` runtime failures. ### Does this PR introduce _any_ user-facing change? Yes, fixes a bug. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46756 from eejbyfeldt/SPARK-48428. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Refactors
NestedColumnAliasing
andGeneratorNestedColumnAliasing
for readability.Why are the changes needed?
Improves readability for future maintenance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.