[SPARK-34963][SQL] Fix nested column pruning for extracting case-insensitive struct field from array of struct #32059

viirya · 2021-04-06T06:44:56Z

What changes were proposed in this pull request?

This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.

Why are the changes needed?

Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,

val query = spark.table("contacts").select("friends.First", "friends.MiDDle")

Error stack:

[info]   java.lang.IllegalArgumentException: Field "First" does not exist.                                                                                        
[info] Available fields:                                                                                                                                          
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)                                                                    
[info]   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)                            
[info]   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)                                                                                           
[info]   at scala.collection.AbstractMap.getOrElse(Map.scala:59)                                                                                                  
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)                                                                                     
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44)                                   
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

SparkQA · 2021-04-06T07:54:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41512/

SparkQA · 2021-04-06T07:54:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41512/

SparkQA · 2021-04-06T11:36:14Z

Test build #136935 has finished for PR 32059 at commit f4db7e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-04-06T23:10:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

@@ -48,7 +49,8 @@ object SchemaPruning {
   * right, recursively. That is, left is a "subschema" of right, ignoring order of
   * fields.
   */
-  private def sortLeftFieldsByRight(left: DataType, right: DataType): DataType =
+  private def sortLeftFieldsByRight(left: DataType, right: DataType): DataType = {


When we construct mergedDataSchema in line 39, it seems also case-sensitive, doesn't it?

StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))

It is. As like #32059 (comment), at selectField we treat GetStructField and GetArrayStructFields differently. So it causes different behavior in case-sensitive aware resolution here.

It looks like we should better correct them together..

dongjoon-hyun · 2021-04-06T23:33:37Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ProjectionOverSchema.scala

@@ -41,9 +43,10 @@ case class ProjectionOverSchema(schema: StructType) {
      case a: GetArrayStructFields =>
        getProjection(a.child).map(p => (p, p.dataType)).map {
          case (projection, ArrayType(projSchema @ StructType(_), _)) =>
+            val selectedField = projSchema.find(f => resolver(f.name, a.field.name)).get


It seems that we are not doing this for struct type. To allow this for array of struct, maybe it seems that we need this for struct first at line 66.

GetStructField(projection, projSchema.fieldIndex(field.name))

This issue can occur only in an array of structs? The code at line 66 (@dongjoon-hyun pointed out above) has the same pattern projSchema.fieldIndex(field.name), so I'm worried that is can occur in other cases.

Let me check it on.

Oh it is fine. ExtractValue actually does column name resolving correctly. The difference is how ProjectionOverSchema treats GetArrayStructFields and GetStructField there.

That's also said we may not need to do resolving again in ProjectionOverSchema, as this PR currently do. We can just use GetArrayStructFields.ordinal which already points to correct field in child expression.

dongjoon-hyun · 2021-04-06T23:35:27Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

@@ -774,4 +774,17 @@ abstract class SchemaPruningSuite
        assert(scanSchema === expectedScanSchema)
    }
  }
+
+  testSchemaPruning("extract case-insensitive struct field from array") {


Do we need to have a test coverage for extract case-insensitive struct field too?

It is fine (#32059 (comment)), but it is better to add a test for better coverage too. Let me add one.

SparkQA · 2021-04-07T04:22:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41560/

SparkQA · 2021-04-07T04:22:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41560/

SparkQA · 2021-04-07T04:44:06Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41562/

SparkQA · 2021-04-07T06:54:29Z

Test build #136983 has finished for PR 32059 at commit 15bdcd6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-07T08:23:23Z

Test build #136984 has finished for PR 32059 at commit c335304.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-08T18:29:03Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41671/

SparkQA · 2021-04-08T19:21:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41673/

SparkQA · 2021-04-08T19:21:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41673/

SparkQA · 2021-04-08T22:29:46Z

Test build #137093 has finished for PR 32059 at commit ea17366.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-08T23:21:01Z

Test build #137095 has finished for PR 32059 at commit 9005055.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-04-08T23:28:41Z

@dongjoon-hyun @maropu Now this is with more appropriate fix. Added a few more tests. Please take another look. Thank you.

maropu

Looks fine. cc: @dongjoon-hyun

maropu · 2021-04-09T14:01:53Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ProjectionOverSchema.scala

@@ -41,9 +41,14 @@ case class ProjectionOverSchema(schema: StructType) {
      case a: GetArrayStructFields =>
        getProjection(a.child).map(p => (p, p.dataType)).map {
          case (projection, ArrayType(projSchema @ StructType(_), _)) =>
+            // For case-sensitivity aware field resolution, we should take `ordinal` which


How about leaving your comment ExtractValue actually does column name resolving correctly here, too?

Ah, I missed this comment. As it is minor, I will add the comment in #31966 for master only.

dongjoon-hyun · 2021-04-09T15:14:28Z

Sure, thanks, @viirya and @maropu .

dongjoon-hyun

+1, LGTM for Apache Spark 3.2.0.
For me, I believe this can be considered an improvement to give additional support cases.

viirya · 2021-04-09T16:30:15Z

+1, LGTM for Apache Spark 3.2.0.
For me, I believe this can be considered an improvement to give additional support cases.

Thanks @dongjoon-hyun and @maropu.

For the case, if it doesn't throw exception but silently read all nested column, it is okay to treat it as an improvement. But it throws an exception so that is why I marked it as a bug in JIRA.

dongjoon-hyun · 2021-04-09T18:41:23Z

Feel free to proceed as you want, @viirya . I respect your decision here.

…nsitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32059 from viirya/fix-array-nested-pruning. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 364d1ea) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya · 2021-04-09T19:08:47Z

Thanks @dongjoon-hyun @maropu. Merged to master/3.1/3.0. For 2.4, it has conflict, so I will backport it manually.

…-insensitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. This is the backport of #32059 to branch-2.4. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32112 from viirya/fix-array-nested-pruning-2.4. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…nsitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#32059 from viirya/fix-array-nested-pruning. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 364d1ea) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

Fix nested pruning on array of struct.

f4db7e9

github-actions bot added the SQL label Apr 6, 2021

viirya mentioned this pull request Apr 6, 2021

[SPARK-34638][SQL] Single field nested column prune on generator output #31966

Closed

dongjoon-hyun reviewed Apr 6, 2021

View reviewed changes

Use correctly resolved ordinal from GetArrayStructFields.

c335304

viirya force-pushed the fix-array-nested-pruning branch from 15bdcd6 to c335304 Compare April 7, 2021 03:20

viirya added 3 commits April 8, 2021 10:15

Fix.

ea17366

Add more test cases.

97a4784

Remove unnecessary change.

9005055

maropu approved these changes Apr 9, 2021

View reviewed changes

dongjoon-hyun approved these changes Apr 9, 2021

View reviewed changes

viirya closed this in 364d1ea Apr 9, 2021

viirya mentioned this pull request Apr 9, 2021

[SPARK-34963][SQL][2.4] Fix nested column pruning for extracting case-insensitive struct field from array of struct #32112

Closed

viirya deleted the fix-array-nested-pruning branch December 27, 2023 18:27

[SPARK-34963][SQL] Fix nested column pruning for extracting case-insensitive struct field from array of struct #32059

[SPARK-34963][SQL] Fix nested column pruning for extracting case-insensitive struct field from array of struct #32059

Uh oh!

Conversation

viirya commented Apr 6, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

dongjoon-hyun Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 8, 2021

Uh oh!

SparkQA commented Apr 8, 2021

Uh oh!

SparkQA commented Apr 8, 2021

Uh oh!

SparkQA commented Apr 8, 2021

Uh oh!

SparkQA commented Apr 8, 2021

Uh oh!

viirya commented Apr 8, 2021

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

maropu Apr 9, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Apr 10, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 9, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Apr 9, 2021

dongjoon-hyun Apr 6, 2021 •

edited

Loading

viirya Apr 7, 2021 •

edited

Loading

dongjoon-hyun Apr 6, 2021 •

edited

Loading