-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-34963][SQL] Fix nested column pruning for extracting case-insensitive struct field from array of struct #32059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136935 has finished for PR 32059 at commit
|
@@ -48,7 +49,8 @@ object SchemaPruning { | |||
* right, recursively. That is, left is a "subschema" of right, ignoring order of | |||
* fields. | |||
*/ | |||
private def sortLeftFieldsByRight(left: DataType, right: DataType): DataType = | |||
private def sortLeftFieldsByRight(left: DataType, right: DataType): DataType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we construct mergedDataSchema
in line 39, it seems also case-sensitive, doesn't it?
StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. As like #32059 (comment), at selectField
we treat GetStructField
and GetArrayStructFields
differently. So it causes different behavior in case-sensitive aware resolution here.
It looks like we should better correct them together..
@@ -41,9 +43,10 @@ case class ProjectionOverSchema(schema: StructType) { | |||
case a: GetArrayStructFields => | |||
getProjection(a.child).map(p => (p, p.dataType)).map { | |||
case (projection, ArrayType(projSchema @ StructType(_), _)) => | |||
val selectedField = projSchema.find(f => resolver(f.name, a.field.name)).get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we are not doing this for struct
type. To allow this for array of struct
, maybe it seems that we need this for struct
first at line 66.
GetStructField(projection, projSchema.fieldIndex(field.name))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue can occur only in an array of structs? The code at line 66 (@dongjoon-hyun pointed out above) has the same pattern projSchema.fieldIndex(field.name)
, so I'm worried that is can occur in other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check it on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it is fine. ExtractValue
actually does column name resolving correctly. The difference is how ProjectionOverSchema
treats GetArrayStructFields
and GetStructField
there.
That's also said we may not need to do resolving again in ProjectionOverSchema
, as this PR currently do. We can just use GetArrayStructFields.ordinal
which already points to correct field in child expression.
@@ -774,4 +774,17 @@ abstract class SchemaPruningSuite | |||
assert(scanSchema === expectedScanSchema) | |||
} | |||
} | |||
|
|||
testSchemaPruning("extract case-insensitive struct field from array") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have a test coverage for extract case-insensitive struct field
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine (#32059 (comment)), but it is better to add a test for better coverage too. Let me add one.
15bdcd6
to
c335304
Compare
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #136983 has finished for PR 32059 at commit
|
Test build #136984 has finished for PR 32059 at commit
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137093 has finished for PR 32059 at commit
|
Test build #137095 has finished for PR 32059 at commit
|
@dongjoon-hyun @maropu Now this is with more appropriate fix. Added a few more tests. Please take another look. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. cc: @dongjoon-hyun
@@ -41,9 +41,14 @@ case class ProjectionOverSchema(schema: StructType) { | |||
case a: GetArrayStructFields => | |||
getProjection(a.child).map(p => (p, p.dataType)).map { | |||
case (projection, ArrayType(projSchema @ StructType(_), _)) => | |||
// For case-sensitivity aware field resolution, we should take `ordinal` which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about leaving your comment ExtractValue actually does column name resolving correctly
here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed this comment. As it is minor, I will add the comment in #31966 for master only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM for Apache Spark 3.2.0.
For me, I believe this can be considered an improvement to give additional support cases.
Thanks @dongjoon-hyun and @maropu. For the case, if it doesn't throw exception but silently read all nested column, it is okay to treat it as an improvement. But it throws an exception so that is why I marked it as a bug in JIRA. |
Feel free to proceed as you want, @viirya . I respect your decision here. |
…nsitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32059 from viirya/fix-array-nested-pruning. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 364d1ea) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…nsitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32059 from viirya/fix-array-nested-pruning. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 364d1ea) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Thanks @dongjoon-hyun @maropu. Merged to master/3.1/3.0. For 2.4, it has conflict, so I will backport it manually. |
…-insensitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. This is the backport of #32059 to branch-2.4. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32112 from viirya/fix-array-nested-pruning-2.4. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…nsitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#32059 from viirya/fix-array-nested-pruning. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 364d1ea) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…nsitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#32059 from viirya/fix-array-nested-pruning. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 364d1ea) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
What changes were proposed in this pull request?
This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct.
Why are the changes needed?
Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g.,
Error stack:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test