Skip to content

[SPARK-35194][SQL] Refactor nested column aliasing for readability #32301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

karenfeng
Copy link
Contributor

What changes were proposed in this pull request?

Refactors NestedColumnAliasing and GeneratorNestedColumnAliasing for readability.

Why are the changes needed?

Improves readability for future maintenance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@github-actions github-actions bot added the SQL label Apr 22, 2021
@karenfeng
Copy link
Contributor Author

karenfeng commented Apr 22, 2021

@viirya, this will either block/be blocked by your ongoing PR. Let me know what you think - I think this will improve the readability.

@viirya
Copy link
Member

viirya commented Apr 22, 2021

Thanks @karenfeng. I will take a look on this.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42348/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42348/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137818 has finished for PR 32301 at commit 9656899.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a few comments. Please also add [SQL] to the PR title :)

getNewProjectList(projectList, nestedFieldToAlias),
replaceWithAliases(child, nestedFieldToAlias, attrToAliases))
def replacePlanWithAliases(
plan: LogicalPlan,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 4 space indentation

@viirya viirya changed the title [SPARK-35194] Refactor nested column aliasing for readability [SPARK-35194][SQL] Refactor nested column aliasing for readability Apr 23, 2021
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42401/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42401/

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Test build #137871 has finished for PR 32301 at commit a95360a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case _ => false
}
private def collectExtractValue(e: Expression): Seq[ExtractValue] = e match {
case g if isSelectedField(g) => Seq(g.asInstanceOf[ExtractValue])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case e: ExtractValue if isSelectedField(e) => Seq(e)

// Note that when we group by extractors with their references, we should remove
// cosmetic variations.
val nestedFieldReferences = exprList.flatMap(collectExtractValue)
val otherRootReferences = exprList.flatMap(collectAttributeReference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we collected both the nested fields extraction and other root references at the same time, and split them later. Now we collect them separately. I think the current code is clearer but is less performant.

How about we use mutable collections to implement this logic with one tree traversal?

val nestedFieldReferences = mutable.ArrayBuffer[ExtractValue]
val otherRootReferences = mutable.ArrayBuffer[AttributeReference]
exprList.foreach(collectRootReferenceAndExtractValue(e, nestedFieldReferences, otherRootReferences))

case _: AttributeReference => Seq(e)
case GetStructField(_: ExtractValue | _: AttributeReference, _, _) => Seq(e)
private def isSelectedField(e: Expression): Boolean = e match {
case GetStructField(_: ExtractValue | _: AttributeReference, _, _) => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR: I don't get the reason to match ExtractValue. For GetStructField(GetArrayItem(Attribute, index), fieldName), how can the data source support it?

val nestedFieldToAlias = attributeToExtractValues.flatMap { case (_, nestedFields) =>
nestedFields.map { f =>
val exprId = NamedExpression.newExprId
f -> Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the alias name matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly. I discussed this with @allisonwang-db offline, and we think it may be more useful for this name to reflect the struct and field. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@cloud-fan
Copy link
Contributor

@viirya any more comments?

@viirya
Copy link
Member

viirya commented May 26, 2021

Thanks @cloud-fan @karenfeng. I will check this again today.

exprList.foreach { e =>
collectRootReferenceAndExtractValue(e).foreach {
case ev: ExtractValue =>
assert(ev.references.size == 1, s"$ev should have one reference")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`s"$ev should have one reference, but got: ${ev.references}"?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Karen Feng <karen.feng@databricks.com>
@SparkQA
Copy link

SparkQA commented May 27, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43516/

@SparkQA
Copy link

SparkQA commented May 27, 2021

Test build #138997 has finished for PR 32301 at commit 83e2611.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

seems a legit error in org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.udf_struct

@viirya
Copy link
Member

viirya commented May 27, 2021

java.lang.AssertionError: assertion failed: struct(col1, 1, col2, struct(col1, a, col2, 1.5)).col2.col1 should have one reference, but found {}

Oh, it doesn't have more than one references, but has no reference...

@karenfeng
Copy link
Contributor Author

@viirya - in the case that the number of references is !=1, should we exclude the ExtractValue from nestedFieldReferences?

@viirya
Copy link
Member

viirya commented May 27, 2021

@viirya - in the case that the number of references is !=1, should we exclude the ExtractValue from nestedFieldReferences?

I think so, it should be safer approach to exclude them from pruning candidates.

Signed-off-by: Karen Feng <karen.feng@databricks.com>
@SparkQA
Copy link

SparkQA commented May 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43553/

@SparkQA
Copy link

SparkQA commented May 27, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43553/

@SparkQA
Copy link

SparkQA commented May 28, 2021

Test build #139035 has finished for PR 32301 at commit 8a29e94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e863166 May 28, 2021
@sarutak
Copy link
Member

sarutak commented May 28, 2021

This change seems to break the build with Scala 2.13 on GA.
https://github.com/apache/spark/runs/2694564384
I'll open a PR to fix it.

sarutak added a commit that referenced this pull request May 28, 2021
### What changes were proposed in this pull request?

This PR fixes a build error with Scala 2.13 on GA.
#32301 seems to bring this error.

### Why are the changes needed?

To recover CI.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA

Closes #32696 from sarutak/followup-SPARK-35194.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
huaxingao pushed a commit to huaxingao/spark that referenced this pull request Jun 9, 2021
### What changes were proposed in this pull request?

Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability.

### Why are the changes needed?

Improves readability for future maintenance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#32301 from karenfeng/refactor-nested-column-aliasing.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
huaxingao pushed a commit to huaxingao/spark that referenced this pull request Jun 9, 2021
### What changes were proposed in this pull request?

This PR fixes a build error with Scala 2.13 on GA.
apache#32301 seems to bring this error.

### Why are the changes needed?

To recover CI.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA

Closes apache#32696 from sarutak/followup-SPARK-35194.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All. While investing SPARK-39854, unfortunately, it turns out this makes a regression reported by SPARK-39854 since Apache Spark 3.2.0.

I tested at this commit and the parent commit and verified that SPARK-39854 happens after this commit.

eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request May 27, 2024
In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for
ExtractValues that can currently not be handled. The considtion is
introduced after `collectRootReferenceAndExtractValue` and just removes
these candidates. This is problematic since these expressions might have
contained `AttributeReference` that needed to not do an incorrect
rewrite. This fixes these family of bugs by moving the conditions into
the function `collectRootReferenceAndExtractValue`.
eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request Jun 24, 2024
In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for
ExtractValues that can currently not be handled. The considtion is
introduced after `collectRootReferenceAndExtractValue` and just removes
these candidates. This is problematic since these expressions might have
contained `AttributeReference` that needed to not do an incorrect
rewrite. This fixes these family of bugs by moving the conditions into
the function `collectRootReferenceAndExtractValue`.
cloud-fan pushed a commit that referenced this pull request Jun 27, 2024
### What changes were proposed in this pull request?

In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.

### Why are the changes needed?

The current code leads to `IllegalStateException` runtime failures.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a bug.

### How was this patch tested?

Existing and new unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46756 from eejbyfeldt/SPARK-48428.

Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jun 27, 2024
### What changes were proposed in this pull request?

In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.

### Why are the changes needed?

The current code leads to `IllegalStateException` runtime failures.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a bug.

### How was this patch tested?

Existing and new unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46756 from eejbyfeldt/SPARK-48428.

Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b11608c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.

### Why are the changes needed?

The current code leads to `IllegalStateException` runtime failures.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a bug.

### How was this patch tested?

Existing and new unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46756 from eejbyfeldt/SPARK-48428.

Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants