Skip to content

[SPARK-37855][SQL] IllegalStateException when transforming an array inside a nested struct #35170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Jan 11, 2022

What changes were proposed in this pull request?

Skip alias the ExtractValue whose children contains NamedLambdaVariable.

Why are the changes needed?

Since #32773, the NamedLambdaVariable can produce the references, however it cause the rule NestedColumnAliasing alias the ExtractValue which contains NamedLambdaVariable. It fails since we can not match a NamedLambdaVariable to an actual attribute.

Talk more:
During NamedLambdaVariable#replaceWithAliases, it uses the references of nestedField to match the output attributes of grandchildren. However NamedLambdaVariable is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of NamedLambdaVariable to match the grandchildren's output.

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

Add new test

@github-actions github-actions bot added the SQL label Jan 11, 2022
@HyukjinKwon
Copy link
Member

cc @viirya FYI

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. Could you also mention when it fails to match the attribute in the description? Thanks.

@ulysses-you
Copy link
Contributor Author

@viirya has updated the description, hope it is clear now

@viirya
Copy link
Member

viirya commented Jan 12, 2022

Thanks. As #32773 was also merged to 3.1, is this also an issue on branch-3.1 too? @ulysses-you

@ulysses-you
Copy link
Contributor Author

I think land this to branch-3.2 is enough, since the backport of branch-3.1 is revered.
see https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

@viirya
Copy link
Member

viirya commented Jan 12, 2022

Okay, thanks! Merging to master.

@viirya viirya closed this in 189b205 Jan 12, 2022
@viirya
Copy link
Member

viirya commented Jan 12, 2022

Oh, there is a conflict. @ulysses-you Can you submit a backport PR to branch-3.2? Thanks.

@ulysses-you
Copy link
Contributor Author

thank you @viirya created #35175

@ulysses-you ulysses-you deleted the SPARK-37855 branch January 12, 2022 05:30
viirya pushed a commit that referenced this pull request Jan 12, 2022
…ray inside a nested struct

This is a backport of #35170 for branch-3.2.

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since #32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes #35175 from ulysses-you/SPARK-37855-branch-3.2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
dchvn pushed a commit to dchvn/spark that referenced this pull request Jan 19, 2022
…nside a nested struct

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since apache#32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes apache#35170 from ulysses-you/SPARK-37855.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Feb 22, 2022
…ray inside a nested struct

This is a backport of apache#35170 for branch-3.2.

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since apache#32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes apache#35175 from ulysses-you/SPARK-37855-branch-3.2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Mar 4, 2022
…ray inside a nested struct

This is a backport of apache#35170 for branch-3.2.

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since apache#32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes apache#35175 from ulysses-you/SPARK-37855-branch-3.2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…ray inside a nested struct

This is a backport of apache#35170 for branch-3.2.

### What changes were proposed in this pull request?

Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`.

### Why are the changes needed?

Since apache#32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute.

Talk more:
During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

Add new test

Closes apache#35175 from ulysses-you/SPARK-37855-branch-3.2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit a58b8a8)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request May 27, 2024
In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for
ExtractValues that can currently not be handled. The considtion is
introduced after `collectRootReferenceAndExtractValue` and just removes
these candidates. This is problematic since these expressions might have
contained `AttributeReference` that needed to not do an incorrect
rewrite. This fixes these family of bugs by moving the conditions into
the function `collectRootReferenceAndExtractValue`.
eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request Jun 24, 2024
In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for
ExtractValues that can currently not be handled. The considtion is
introduced after `collectRootReferenceAndExtractValue` and just removes
these candidates. This is problematic since these expressions might have
contained `AttributeReference` that needed to not do an incorrect
rewrite. This fixes these family of bugs by moving the conditions into
the function `collectRootReferenceAndExtractValue`.
cloud-fan pushed a commit that referenced this pull request Jun 27, 2024
### What changes were proposed in this pull request?

In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.

### Why are the changes needed?

The current code leads to `IllegalStateException` runtime failures.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a bug.

### How was this patch tested?

Existing and new unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46756 from eejbyfeldt/SPARK-48428.

Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jun 27, 2024
### What changes were proposed in this pull request?

In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.

### Why are the changes needed?

The current code leads to `IllegalStateException` runtime failures.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a bug.

### How was this patch tested?

Existing and new unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46756 from eejbyfeldt/SPARK-48428.

Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b11608c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.

### Why are the changes needed?

The current code leads to `IllegalStateException` runtime failures.

### Does this PR introduce _any_ user-facing change?

Yes, fixes a bug.

### How was this patch tested?

Existing and new unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46756 from eejbyfeldt/SPARK-48428.

Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants