[SPARK-31116][SQL] Fix nested schema case-sensitivity in ParquetRowConverter #27888

kimtkyeom · 2020-03-12T08:06:14Z

What changes were proposed in this pull request?

This PR (SPARK-31116) add caseSensitive parameter to ParquetRowConverter so that it handle materialize parquet properly with respect to case sensitivity

Why are the changes needed?

From spark 3.0.0, below statement throws IllegalArgumentException in caseInsensitive mode because of explicit field index searching in ParquetRowConverter. As we already constructed parquet requested schema and catalyst requested schema during schema clipping in ParquetReadSupport, just follow these behavior.

val path = "/some/temp/path"

spark
  .range(1L)
  .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn")
  .write.parquet(path)

val caseInsensitiveSchema = new StructType()
  .add(
    "StructColumn",
    new StructType()
      .add("LowerCase", LongType)
      .add("camelcase", LongType))

spark.read.schema(caseInsensitiveSchema).parquet(path).show()

Does this PR introduce any user-facing change?

No. The changes are only in unreleased branches (master and branch-3.0).

How was this patch tested?

Passed new test cases that check parquet column selection with respect to schemas and case sensitivities

* As StructType only accept case senstive field name mapping, use explicit field name to field index mapping based on case sensitivty * For check case sensitivity get case sensitivity flag from ParquetReadSupport * Also, add test cases to check column selection for each cases

dongjoon-hyun · 2020-03-12T09:14:20Z

ok to test

dongjoon-hyun · 2020-03-12T09:20:17Z

Thank you for making your first contribution, @kimtkyeom .

dongjoon-hyun · 2020-03-12T09:24:59Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

@@ -804,6 +804,162 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
    }
  }

+  test("SPARK-31116: Select simple parquet columns correctly in case insensitive manner") {


Could you move new test cases into FileBasedDataSourceSuite and run with Orc/Parquet/Json at least?

I'll test soon. however, could these new test cases apply to Orc and Json also?

Yes. Please start with all of three and comment out if ORC/Json fails.

I tested ORC and Json file format and there exist some failures.

Json test failure

Json passed case sensitive cases, but it failed in case insensitive case

[info] - SPARK-31116: Select simple columns correctly in case insensitive manner *** FAILED *** (4 seconds, 277 milliseconds) [info] Results do not match for query: [info] Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]] [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] Relation[camelcase#56] json [info] [info] == Analyzed Logical Plan == [info] camelcase: string [info] Relation[camelcase#56] json [info] [info] == Optimized Logical Plan == [info] Relation[camelcase#56] json [info] [info] == Physical Plan == [info] FileScan json [camelcase#56] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-95f1357a-85c9-444f-bdcc-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<camelcase:string> [info] [info] == Results == [info] [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] !struct<> struct<camelcase:string> [info] ![A] [null] (QueryTest.scala:248)

[info] - SPARK-31116: Select nested columns correctly in case insensitive manner *** FAILED *** (2 seconds, 117 milliseconds) [info] Results do not match for query: [info] Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]] [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] Relation[StructColumn#147] json [info] [info] == Analyzed Logical Plan == [info] StructColumn: struct<LowerCase:bigint,camelcase:bigint> [info] Relation[StructColumn#147] json [info] [info] == Optimized Logical Plan == [info] Relation[StructColumn#147] json [info] [info] == Physical Plan == [info] FileScan json [StructColumn#147] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-f9ecd1a4-e5aa-4dd7-bdfd-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>> [info] [info] == Results == [info] [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] !struct<> struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>> [info] ![[0,1]] [[null,null]] (QueryTest.scala:248)

ORC test failure

ORC passed case insensitive test cases, but it failed case sensitive manner.

[info] - SPARK-31116: Select nested columns correctly in case sensitive manner *** FAILED *** (871 milliseconds) [info] Results do not match for query: [info] Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]] [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] Relation[StructColumn#329] json [info] [info] == Analyzed Logical Plan == [info] StructColumn: struct<LowerCase:bigint,camelcase:bigint> [info] Relation[StructColumn#329] json [info] [info] == Optimized Logical Plan == [info] Relation[StructColumn#329] json [info] [info] == Physical Plan == [info] FileScan json [StructColumn#329] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-612baf76-a9d0-41e5-89f4-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>> [info] [info] == Results == [info] [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] !struct<> struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>> [info] ![null] [[null,null]] (QueryTest.scala:248)

But i think ORC failure is due to difference between materializing Row. Is there clean way to test properly?

In addition, I noticed that json does not follow case sensitivity even in spark 2.4.4. Below is my local machine test using spark-shell

20/03/12 19:20:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/03/12 19:20:24 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Spark context Web UI available at http://61.75.36.130:4041 Spark context available as 'sc' (master = local[*], app id = local-1584008425035). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq("A").toDF("camelCase") df: org.apache.spark.sql.DataFrame = [camelCase: string] scala> df.write.format("json").save("./json_simple") scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val sch2 = new StructType().add("camelcase", StringType) sch2: org.apache.spark.sql.types.StructType = StructType(StructField(camelcase,StringType,true)) scala> spark.read.format("json").schema(sch2).load("./json_simple").show() +---------+ |camelcase| +---------+ | null. | +---------+

Thank you for checking. Could you file a JIRA for regressions only?

Could you update your PR?

Ok, I updated my PR and create jira issue. As other file formats (orc and json) also fail these test cases I omit to check these formats by now, just moved current test cases into FileBasedDataSourceSuite. I think it would be added when regression will be fixed

Was the ORC case only when spark.sql.optimizer.nestedSchemaPruning.enabled is enabled?

I checked test, but it produce same result above regardless of nestedSchemaPruning option

dongjoon-hyun · 2020-03-12T09:25:50Z

cc @cloud-fan , @gengliangwang and @yhuai

dongjoon-hyun · 2020-03-12T09:35:20Z

cc @rxin as a release manager for 3.0.0. (Also, cc @gatorsmile )

cloud-fan · 2020-03-12T11:52:35Z

Is this a regression in 3.0? If it is, do you know which commit/PR caused it?

kimtkyeom · 2020-03-12T11:56:47Z

@cloud-fan After #22880, IllegalArgumentException is thrown.

cloud-fan · 2020-03-12T12:00:08Z

is it possible to normalize column names before entering these low-level parquet classes?

kimtkyeom · 2020-03-12T12:05:24Z

Maybe possible in 'ParquetReadSupport' similar to clipping parquet requested schema then generate normalized caralystRequestedSchema I think. But i could not found clean way to normalize it, espacially normalization through Array type is quite complicated.

dongjoon-hyun · 2020-03-12T17:52:01Z

For SPARK-25407, the PR was #24307 . I took over @mallman 's original PR (#22880) at that time by #24307 .

cc @dbtsai , too.

HyukjinKwon

This is related to spark.sql.optimizer.nestedSchemaPruning.enabled which is enabled by default as of SPARK-29805 although some of the new configurations and fixes were landed in 3.0 such as SPARK-26837, SPARK-27707 and SPARK-25407

I have seen a couple of such issues such as SPARK-29721 and SPARK-31116 during this code freeze.

Should we really enable it by default, @dbtsai?

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

kimtkyeom · 2020-03-13T01:19:48Z

@HyukjinKwon I'm not deeply investigated on 2.4.x, but i'm not experienced this issue in 2.4.x.
We currently use 2.4.3 and 2.4.4 with spark.sql.optimizer.nestedSchemaPruning.enabled to true, but this exception is not thrown.

* As there is regressions when schema pruning is enabled, keep previous logic.

HyukjinKwon · 2020-03-13T01:25:47Z

Thanks for confirmation, @kimtkyeom

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

NESTED_SCHEMA_PRUNING option

dongjoon-hyun · 2020-03-13T21:51:15Z

Retest this please.

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

viirya · 2020-03-16T07:58:21Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

+    val caseSensitive = conf.getBoolean(SQLConf.CASE_SENSITIVE.key,
+      SQLConf.CASE_SENSITIVE.defaultValue.get)


I'm not sure why you need to pass caseSensitive across ParquetRecordMaterializer, ParquetRowConverter. Can't we just get it at ParquetRowConverter?

Can I get runtime config at ParquetRowConverter? I'm not concretely understand it's behavior.

SQLConf.get works, even in executor sid, see dd37529

Thanks! I'll update to using SQLConf instead of passing argument across classes.

viirya

It should be good to merge after addressing @dongjoon-hyun's comment about test case.

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

* MISC: Change excpetion in `ParquetRowConverter.fieldConverters` to RuntimeException

dongjoon-hyun · 2020-03-16T08:43:07Z

BTW, @kimtkyeom . While reviewing this PR again throughly, the original failure report on ORC looks wrong.

[SPARK-31116][SQL] Fix nested schema case-sensitivity in ParquetRowConverter #27888 (comment)

You wrote like the following, but it was JSON (Relation[StructColumn#329] json). Could you confirm your failure report again?

ORC passed case insensitive test cases, but it failed case sensitive manner.

[info] - SPARK-31116: Select nested columns correctly in case sensitive manner *** FAILED *** (871 milliseconds)
[info]   Results do not match for query:
[info]   Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
[info]   Timezone Env:
[info]
[info]   == Parsed Logical Plan ==
[info]   Relation[StructColumn#329] json
[info]
[info]   == Analyzed Logical Plan ==
[info]   StructColumn: struct<LowerCase:bigint,camelcase:bigint>
[info]   Relation[StructColumn#329] json
[info]
[info]   == Optimized Logical Plan ==
[info]   Relation[StructColumn#329] json
[info]
[info]   == Physical Plan ==
[info]   FileScan json [StructColumn#329] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-612baf76-a9d0-41e5-89f4-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>>
[info]
[info]   == Results ==
[info]
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
[info]   !struct<>                   struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>>
[info]   ![null]                     [[null,null]] (QueryTest.scala:248)

kimtkyeom · 2020-03-16T08:59:19Z

following, but it was JSON (Relation[StructColumn#329] json). Could you confirm your failure report again?

ORC passed case insensitive test cases, but it failed case sensitive manner.

[info] - [SPARK-31116](https://issues.apache.org/jira/browse/SPARK-31116): Select nested columns correctly in case sensitive manner *** FAILED *** (871 milliseconds)
[info]   Results do not match for query:
[info]   Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
[info]   Timezone Env:
[info]
[info]   == Parsed Logical Plan ==
[info]   Relation[StructColumn#329] json
[info]
[info]   == Analyzed Logical Plan ==
[info]   StructColumn: struc

@dongjoon-hyun Ah, sorry I mis-pasted test result. ORC also shows same result as following whatever value of spark.sql.optimizer.nestedSchemaPruning.enabled

[info] - SPARK-31116: Select nested columns correctly in case sensitive manner *** FAILED *** (905 milliseconds)
[info]   Results do not match for query:
[info]   Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
[info]   Timezone Env:
[info]
[info]   == Parsed Logical Plan ==
[info]   Relation[StructColumn#331] orc
[info]
[info]   == Analyzed Logical Plan ==
[info]   StructColumn: struct<LowerCase:bigint,camelcase:bigint>
[info]   Relation[StructColumn#331] orc
[info]
[info]   == Optimized Logical Plan ==
[info]   Relation[StructColumn#331] orc
[info]
[info]   == Physical Plan ==
[info]   FileScan orc [StructColumn#331] Batched: false, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-f1fb325e-9ff3-4945-81c7-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>>
[info]
[info]   == Results ==
[info]
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
[info]   !struct<>                   struct<StructColumn:struct<LowerCase:bigint,camelcase:bigint>>
[info]   ![null]                     [[null,null]] (QueryTest.scala:248)

dongjoon-hyun · 2020-03-16T09:29:31Z

That's a correct behavior in case-sensitive mode, isn't it? Given the original schema NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn, StructColumn is matched and LowerCase and camelcase are not matched. So, the result is a struct with two null fields. Did I miss something? Do you have any other valid failure for ORC?

kimtkyeom · 2020-03-16T10:52:00Z

That's a correct behavior in case-sensitive mode, isn't it? Given the original schema NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn, StructColumn is matched and LowerCase and camelcase is not matched. So, the result is a struct with two null fields.

@dongjoon-hyun Ah got it. There is no failure except above case. (BTW, I think materialization of rows with nested column where non-matched column would be consistent regardless of file format, but that is out of this PR scope.)

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

SQLConf.get instead. * Also non-existing column in parquet requested column is filled null, so do not calling getOrElse directly applying instead.

SparkQA · 2020-03-16T12:38:38Z

Test build #119846 has finished for PR 27888 at commit 9709e96.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

...ain/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRecordMaterializer.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

cloud-fan

LGTM, good catch!

…paramters

SparkQA · 2020-03-16T12:43:32Z

Test build #119845 has finished for PR 27888 at commit 6055d22.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-16T13:45:26Z

Test build #119852 has finished for PR 27888 at commit ca07f74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @kimtkyeom and all.

SparkQA · 2020-03-16T17:09:46Z

Test build #119863 has finished for PR 27888 at commit 44a9e72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-16T17:31:51Z

Test build #119865 has finished for PR 27888 at commit a0d9a19.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…nverter ### What changes were proposed in this pull request? This PR (SPARK-31116) add caseSensitive parameter to ParquetRowConverter so that it handle materialize parquet properly with respect to case sensitivity ### Why are the changes needed? From spark 3.0.0, below statement throws IllegalArgumentException in caseInsensitive mode because of explicit field index searching in ParquetRowConverter. As we already constructed parquet requested schema and catalyst requested schema during schema clipping in ParquetReadSupport, just follow these behavior. ```scala val path = "/some/temp/path" spark .range(1L) .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") .write.parquet(path) val caseInsensitiveSchema = new StructType() .add( "StructColumn", new StructType() .add("LowerCase", LongType) .add("camelcase", LongType)) spark.read.schema(caseInsensitiveSchema).parquet(path).show() ``` ### Does this PR introduce any user-facing change? No. The changes are only in unreleased branches (`master` and `branch-3.0`). ### How was this patch tested? Passed new test cases that check parquet column selection with respect to schemas and case sensitivities Closes #27888 from kimtkyeom/parquet_row_converter_case_sensitivity. Authored-by: Tae-kyeom, Kim <kimtkyeom@devsisters.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e736c62) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2020-03-16T17:32:46Z

Merged to master/3.0.

dongjoon-hyun · 2020-03-16T17:36:22Z

Thank you so much for your first contribution, @kimtkyeom . I added you kimtkyeom (JIRA ID) to Apache Spark Contributor Group and assigned you SPARK-31116. We are looking forward to see more contributions from you and DevSisters!

kimtkyeom · 2020-03-17T00:21:27Z

Thanks for all reviewers with all generous review & comments! :)

HyukjinKwon · 2020-03-17T01:12:27Z

+1, LGTM.

…nverter ### What changes were proposed in this pull request? This PR (SPARK-31116) add caseSensitive parameter to ParquetRowConverter so that it handle materialize parquet properly with respect to case sensitivity ### Why are the changes needed? From spark 3.0.0, below statement throws IllegalArgumentException in caseInsensitive mode because of explicit field index searching in ParquetRowConverter. As we already constructed parquet requested schema and catalyst requested schema during schema clipping in ParquetReadSupport, just follow these behavior. ```scala val path = "/some/temp/path" spark .range(1L) .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") .write.parquet(path) val caseInsensitiveSchema = new StructType() .add( "StructColumn", new StructType() .add("LowerCase", LongType) .add("camelcase", LongType)) spark.read.schema(caseInsensitiveSchema).parquet(path).show() ``` ### Does this PR introduce any user-facing change? No. The changes are only in unreleased branches (`master` and `branch-3.0`). ### How was this patch tested? Passed new test cases that check parquet column selection with respect to schemas and case sensitivities Closes apache#27888 from kimtkyeom/parquet_row_converter_case_sensitivity. Authored-by: Tae-kyeom, Kim <kimtkyeom@devsisters.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun added the SQL label Mar 12, 2020

dongjoon-hyun reviewed Mar 12, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

Move test cases into FileBasedDataSourceSuite

d6cbb67

HyukjinKwon reviewed Mar 13, 2020

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala Show resolved Hide resolved

Keep fieldConverter logics before SPARK-25407

9816f66

* As there is regressions when schema pruning is enabled, keep previous logic.

HyukjinKwon reviewed Mar 13, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 13, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 13, 2020

View reviewed changes

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala Outdated Show resolved Hide resolved

Fix schema pruning enable condition & split test with

5e6dd75

NESTED_SCHEMA_PRUNING option

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Mar 14, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-31116][SQL] Consider case sensitivity in ParquetRowConverter~~ [SPARK-31116][SQL] Fix nested schema case-sensitivity in ParquetRowConverter Mar 14, 2020

This comment has been minimized.

Sign in to view

viirya reviewed Mar 16, 2020

View reviewed changes

viirya approved these changes Mar 16, 2020

View reviewed changes

dongjoon-hyun reviewed Mar 16, 2020

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 16, 2020

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala Show resolved Hide resolved

Generalize nested column selection

ca07f74

* MISC: Change excpetion in `ParquetRowConverter.fieldConverters` to RuntimeException

cloud-fan reviewed Mar 16, 2020

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala Outdated Show resolved Hide resolved

Do not pass caseSensitive arument into ParquetRowConverter use

44a9e72

SQLConf.get instead. * Also non-existing column in parquet requested column is filled null, so do not calling getOrElse directly applying instead.

cloud-fan reviewed Mar 16, 2020

View reviewed changes

...ain/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRecordMaterializer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 16, 2020

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Mar 16, 2020

View reviewed changes

Revert line format in ParquetRowConverter constructor as we reverted …

a0d9a19

…paramters

dongjoon-hyun approved these changes Mar 16, 2020

View reviewed changes

viirya approved these changes Mar 16, 2020

View reviewed changes

dongjoon-hyun closed this in e736c62 Mar 16, 2020

kimtkyeom deleted the parquet_row_converter_case_sensitivity branch March 17, 2020 00:26

		val caseSensitive = conf.getBoolean(SQLConf.CASE_SENSITIVE.key,
		SQLConf.CASE_SENSITIVE.defaultValue.get)

[SPARK-31116][SQL] Fix nested schema case-sensitivity in ParquetRowConverter #27888

[SPARK-31116][SQL] Fix nested schema case-sensitivity in ParquetRowConverter #27888

Uh oh!

Conversation

kimtkyeom commented Mar 12, 2020 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Mar 12, 2020

Uh oh!

dongjoon-hyun commented Mar 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kimtkyeom Mar 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Json test failure

ORC test failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 12, 2020

Uh oh!

dongjoon-hyun commented Mar 12, 2020

Uh oh!

cloud-fan commented Mar 12, 2020

Uh oh!

kimtkyeom commented Mar 12, 2020

Uh oh!

cloud-fan commented Mar 12, 2020

Uh oh!

kimtkyeom commented Mar 12, 2020

Uh oh!

This comment has been minimized.

dongjoon-hyun commented Mar 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kimtkyeom commented Mar 13, 2020

Uh oh!

HyukjinKwon commented Mar 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

dongjoon-hyun commented Mar 13, 2020

Uh oh!

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kimtkyeom commented Mar 12, 2020 •

edited by HyukjinKwon

Loading

kimtkyeom Mar 12, 2020 •

edited

Loading

dongjoon-hyun commented Mar 12, 2020 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

dongjoon-hyun commented Mar 16, 2020 •

edited

Loading

kimtkyeom commented Mar 16, 2020 •

edited

Loading

dongjoon-hyun commented Mar 16, 2020 •

edited

Loading