[SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source #24041

cloud-fan · 2019-03-09T13:52:16Z

What changes were proposed in this pull request?

In Spark 2.1, we hit a correctness bug. When reading a Hive serde parquet table with the native parquet data source, and the actual file schema doesn't match the table schema in Hive metastore(only upper/lower case difference), the query returns 0 results.

The reason is that, the parquet reader is case sensitive. If we push down filters with column names that don't match the file physical schema case-sensitively, no data will be returned.

To fix this bug, there were 2 solutions proposed at that time:

Add a config to optionally disable parquet filter pushdown, and make parquet column pruning case insensitive.
[SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution #16797
Infer the actual schema from data files, when reading Hive serde table with native data source. A config is provided to disable it.
[SPARK-19611][SQL] Introduce configurable table schema inference #17229

Solution 2 was accepted and merged to Spark 2.1.1

In Spark 2.4, we refactored the parquet data source a little:

do parquet filter pushdown with the actual file schema.
[SPARK-24716][SQL] Refactor ParquetFilters #21696
make parquet filter pushdown case insensitive.
[SPARK-25207][SQL] Case-insensitve field resolution for filter pushdown when reading Parquet #22197
make parquet column pruning case insensitive.
[SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148

With these patches, the correctness bug in Spark 2.1 no longer exists, and the schema inference becomes unnecessary.

To be safe, this PR just changes the default value to NEVER_INFER, so that users can set it back to INFER_AND_SAVE. If we don't receive any bug reports for it, we can remove the related code in the next release.

How was this patch tested?

existing tests

…urce

cloud-fan · 2019-03-09T13:53:40Z

cc @budde @wangyum @dongjoon-hyun @xuanyuanking

SparkQA · 2019-03-09T18:08:55Z

Test build #103264 has finished for PR 24041 at commit 3146102.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking

+1 for me. Thanks for the nice PR description, very clear.

xuanyuanking · 2019-03-09T23:23:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .transform(_.toUpperCase(Locale.ROOT))
    .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
-    .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
+    .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)


https://github.com/apache/spark/pull/24041/files#diff-9a6b543db706f1a90f790783d6930a13R592 nit for the doc, the INFER_AND_SAVE is no longer the default. Should we mention this in migrate guide?

+1 for the comments; 1) update the config description at line 592 ~ 593 and 2) sql-migration-guide-upgrade.md

cloud-fan · 2019-03-11T05:17:16Z

docs/sql-migration-guide-upgrade.md

  - Since Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and etc. Spark 3.0 uses Java 8 API classes from the java.time packages that based on ISO chronology (https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html). In Spark version 2.4 and earlier, those operations are performed by using the hybrid calendar (Julian + Gregorian, see https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html). The changes impact on the results for dates before October 15, 1582 (Gregorian) and affect on the following Spark 3.0 API:

-    - CSV/JSON datasources use java.time API for parsing and generating CSV/JSON content. In Spark version 2.4 and earlier, java.text.SimpleDateFormat is used for the same purpose with fallbacks to the parsing mechanisms of Spark 2.0 and 1.x. For example, `2018-12-08 10:39:21.123` with the pattern `yyyy-MM-dd'T'HH:mm:ss.SSS` cannot be parsed since Spark 3.0 because the timestamp does not match to the pattern but it can be parsed by earlier Spark versions due to a fallback to `Timestamp.valueOf`. To parse the same timestamp since Spark 3.0, the pattern should be `yyyy-MM-dd HH:mm:ss.SSS`.
+  - CSV/JSON datasources use java.time API for parsing and generating CSV/JSON content. In Spark version 2.4 and earlier, java.text.SimpleDateFormat is used for the same purpose with fallbacks to the parsing mechanisms of Spark 2.0 and 1.x. For example, `2018-12-08 10:39:21.123` with the pattern `yyyy-MM-dd'T'HH:mm:ss.SSS` cannot be parsed since Spark 3.0 because the timestamp does not match to the pattern but it can be parsed by earlier Spark versions due to a fallback to `Timestamp.valueOf`. To parse the same timestamp since Spark 3.0, the pattern should be `yyyy-MM-dd HH:mm:ss.SSS`.


just fix the indentation

@cloud-fan, I think this indentation was intentional to make the notes collapsed to group Proleptic Gregorian calendar changes.

damn my editor didn't detect this.... let me revert it

SparkQA · 2019-03-11T07:05:01Z

Test build #103296 has finished for PR 24041 at commit 3ee775d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-03-11T07:09:35Z

retest this please

SparkQA · 2019-03-11T11:21:36Z

Test build #103302 has finished for PR 24041 at commit 3ee775d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-11T11:21:44Z

Yea, very nice PR description.

SparkQA · 2019-03-11T16:34:02Z

Test build #103321 has finished for PR 24041 at commit 3dd7553.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM

Thanks! Merged to master.

Do not infer schema when reading Hive serde table with native data so…

3146102

…urce

xuanyuanking approved these changes Mar 9, 2019

View reviewed changes

address comment

3ee775d

cloud-fan commented Mar 11, 2019

View reviewed changes

HyukjinKwon approved these changes Mar 11, 2019

View reviewed changes

fix mistake

3dd7553

gatorsmile reviewed Mar 11, 2019

View reviewed changes

gatorsmile closed this in 31878c9 Mar 11, 2019

[SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source #24041

[SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source #24041

Uh oh!

Conversation

cloud-fan commented Mar 9, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 9, 2019

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Mar 9, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 10, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 11, 2019

Uh oh!

wangyum commented Mar 11, 2019

Uh oh!

SparkQA commented Mar 11, 2019

Uh oh!

HyukjinKwon commented Mar 11, 2019

Uh oh!

SparkQA commented Mar 11, 2019

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cloud-fan commented Mar 9, 2019 •

edited

Loading