Skip to content

[SPARK-34963][SQL] Fix nested column pruning for extracting case-insensitive struct field from array of struct #32059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Apr 6, 2021

What changes were proposed in this pull request?

This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.

Why are the changes needed?

Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,

val query = spark.table("contacts").select("friends.First", "friends.MiDDle")

Error stack:

[info]   java.lang.IllegalArgumentException: Field "First" does not exist.                                                                                        
[info] Available fields:                                                                                                                                          
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)                                                                    
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)                            
[info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)                                                                                           
[info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)                                                                                                  
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)                                                                                     
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)                                   
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) 

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41512/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41512/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136935 has finished for PR 32059 at commit f4db7e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -48,7 +49,8 @@ object SchemaPruning {
* right, recursively. That is, left is a "subschema" of right, ignoring order of
* fields.
*/
private def sortLeftFieldsByRight(left: DataType, right: DataType): DataType =
private def sortLeftFieldsByRight(left: DataType, right: DataType): DataType = {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we construct mergedDataSchema in line 39, it seems also case-sensitive, doesn't it?

StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))

Copy link
Member Author

@viirya viirya Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. As like #32059 (comment), at selectField we treat GetStructField and GetArrayStructFields differently. So it causes different behavior in case-sensitive aware resolution here.

It looks like we should better correct them together..

@@ -41,9 +43,10 @@ case class ProjectionOverSchema(schema: StructType) {
case a: GetArrayStructFields =>
getProjection(a.child).map(p => (p, p.dataType)).map {
case (projection, ArrayType(projSchema @ StructType(_), _)) =>
val selectedField = projSchema.find(f => resolver(f.name, a.field.name)).get
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we are not doing this for struct type. To allow this for array of struct, maybe it seems that we need this for struct first at line 66.

GetStructField(projection, projSchema.fieldIndex(field.name))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue can occur only in an array of structs? The code at line 66 (@dongjoon-hyun pointed out above) has the same pattern projSchema.fieldIndex(field.name), so I'm worried that is can occur in other cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check it on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it is fine. ExtractValue actually does column name resolving correctly. The difference is how ProjectionOverSchema treats GetArrayStructFields and GetStructField there.

That's also said we may not need to do resolving again in ProjectionOverSchema, as this PR currently do. We can just use GetArrayStructFields.ordinal which already points to correct field in child expression.

@@ -774,4 +774,17 @@ abstract class SchemaPruningSuite
assert(scanSchema === expectedScanSchema)
}
}

testSchemaPruning("extract case-insensitive struct field from array") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have a test coverage for extract case-insensitive struct field too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine (#32059 (comment)), but it is better to add a test for better coverage too. Let me add one.

@viirya viirya force-pushed the fix-array-nested-pruning branch from 15bdcd6 to c335304 Compare April 7, 2021 03:20
@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41560/

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41560/

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41562/

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Test build #136983 has finished for PR 32059 at commit 15bdcd6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Test build #136984 has finished for PR 32059 at commit c335304.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41671/

@SparkQA
Copy link

SparkQA commented Apr 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41673/

@SparkQA
Copy link

SparkQA commented Apr 8, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41673/

@SparkQA
Copy link

SparkQA commented Apr 8, 2021

Test build #137093 has finished for PR 32059 at commit ea17366.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2021

Test build #137095 has finished for PR 32059 at commit 9005055.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 8, 2021

@dongjoon-hyun @maropu Now this is with more appropriate fix. Added a few more tests. Please take another look. Thank you.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. cc: @dongjoon-hyun

@@ -41,9 +41,14 @@ case class ProjectionOverSchema(schema: StructType) {
case a: GetArrayStructFields =>
getProjection(a.child).map(p => (p, p.dataType)).map {
case (projection, ArrayType(projSchema @ StructType(_), _)) =>
// For case-sensitivity aware field resolution, we should take `ordinal` which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about leaving your comment ExtractValue actually does column name resolving correctly here, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed this comment. As it is minor, I will add the comment in #31966 for master only.

@dongjoon-hyun
Copy link
Member

Sure, thanks, @viirya and @maropu .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for Apache Spark 3.2.0.
For me, I believe this can be considered an improvement to give additional support cases.

@viirya
Copy link
Member Author

viirya commented Apr 9, 2021

+1, LGTM for Apache Spark 3.2.0.
For me, I believe this can be considered an improvement to give additional support cases.

Thanks @dongjoon-hyun and @maropu.

For the case, if it doesn't throw exception but silently read all nested column, it is okay to treat it as an improvement. But it throws an exception so that is why I marked it as a bug in JIRA.

@dongjoon-hyun
Copy link
Member

Feel free to proceed as you want, @viirya . I respect your decision here.

@viirya viirya closed this in 364d1ea Apr 9, 2021
viirya added a commit that referenced this pull request Apr 9, 2021
…nsitive struct field from array of struct

### What changes were proposed in this pull request?

This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.

### Why are the changes needed?

Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,

```scala
val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
```

Error stack:
```
[info]   java.lang.IllegalArgumentException: Field "First" does not exist.
[info] Available fields:
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
[info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #32059 from viirya/fix-array-nested-pruning.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 364d1ea)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya added a commit that referenced this pull request Apr 9, 2021
…nsitive struct field from array of struct

### What changes were proposed in this pull request?

This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.

### Why are the changes needed?

Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,

```scala
val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
```

Error stack:
```
[info]   java.lang.IllegalArgumentException: Field "First" does not exist.
[info] Available fields:
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
[info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #32059 from viirya/fix-array-nested-pruning.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 364d1ea)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya
Copy link
Member Author

viirya commented Apr 9, 2021

Thanks @dongjoon-hyun @maropu. Merged to master/3.1/3.0. For 2.4, it has conflict, so I will backport it manually.

viirya added a commit that referenced this pull request Apr 10, 2021
…-insensitive struct field from array of struct

### What changes were proposed in this pull request?

This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.

This is the backport of #32059 to branch-2.4.

### Why are the changes needed?

Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,

```scala
val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
```

Error stack:
```
[info]   java.lang.IllegalArgumentException: Field "First" does not exist.
[info] Available fields:
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
[info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #32112 from viirya/fix-array-nested-pruning-2.4.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…nsitive struct field from array of struct

### What changes were proposed in this pull request?

This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.

### Why are the changes needed?

Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,

```scala
val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
```

Error stack:
```
[info]   java.lang.IllegalArgumentException: Field "First" does not exist.
[info] Available fields:
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
[info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes apache#32059 from viirya/fix-array-nested-pruning.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 364d1ea)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…nsitive struct field from array of struct

### What changes were proposed in this pull request?

This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.

### Why are the changes needed?

Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,

```scala
val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
```

Error stack:
```
[info]   java.lang.IllegalArgumentException: Field "First" does not exist.
[info] Available fields:
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
[info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
[info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes apache#32059 from viirya/fix-array-nested-pruning.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 364d1ea)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya viirya deleted the fix-array-nested-pruning branch December 27, 2023 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants