Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In Spark 2.1, we hit a correctness bug. When reading a Hive serde parquet table with the native parquet data source, and the actual file schema doesn't match the table schema in Hive metastore(only upper/lower case difference), the query returns 0 results.

The reason is that, the parquet reader is case sensitive. If we push down filters with column names that don't match the file physical schema case-sensitively, no data will be returned.

To fix this bug, there were 2 solutions proposed at that time:

  1. Add a config to optionally disable parquet filter pushdown, and make parquet column pruning case insensitive.
    [SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution #16797

  2. Infer the actual schema from data files, when reading Hive serde table with native data source. A config is provided to disable it.
    [SPARK-19611][SQL] Introduce configurable table schema inference #17229

Solution 2 was accepted and merged to Spark 2.1.1

In Spark 2.4, we refactored the parquet data source a little:

  1. do parquet filter pushdown with the actual file schema.
    [SPARK-24716][SQL] Refactor ParquetFilters #21696

  2. make parquet filter pushdown case insensitive.
    [SPARK-25207][SQL] Case-insensitve field resolution for filter pushdown when reading Parquet #22197

  3. make parquet column pruning case insensitive.
    [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148

With these patches, the correctness bug in Spark 2.1 no longer exists, and the schema inference becomes unnecessary.

To be safe, this PR just changes the default value to NEVER_INFER, so that users can set it back to INFER_AND_SAVE. If we don't receive any bug reports for it, we can remove the related code in the next release.

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Mar 9, 2019

@SparkQA
Copy link

SparkQA commented Mar 9, 2019

Test build #103264 has finished for PR 24041 at commit 3146102.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@xuanyuanking xuanyuanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for me. Thanks for the nice PR description, very clear.

.transform(_.toUpperCase(Locale.ROOT))
.checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
.createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/spark/pull/24041/files#diff-9a6b543db706f1a90f790783d6930a13R592 nit for the doc, the INFER_AND_SAVE is no longer the default. Should we mention this in migrate guide?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the comments; 1) update the config description at line 592 ~ 593 and 2) sql-migration-guide-upgrade.md

- Since Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and etc. Spark 3.0 uses Java 8 API classes from the java.time packages that based on ISO chronology (https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html). In Spark version 2.4 and earlier, those operations are performed by using the hybrid calendar (Julian + Gregorian, see https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html). The changes impact on the results for dates before October 15, 1582 (Gregorian) and affect on the following Spark 3.0 API:

- CSV/JSON datasources use java.time API for parsing and generating CSV/JSON content. In Spark version 2.4 and earlier, java.text.SimpleDateFormat is used for the same purpose with fallbacks to the parsing mechanisms of Spark 2.0 and 1.x. For example, `2018-12-08 10:39:21.123` with the pattern `yyyy-MM-dd'T'HH:mm:ss.SSS` cannot be parsed since Spark 3.0 because the timestamp does not match to the pattern but it can be parsed by earlier Spark versions due to a fallback to `Timestamp.valueOf`. To parse the same timestamp since Spark 3.0, the pattern should be `yyyy-MM-dd HH:mm:ss.SSS`.
- CSV/JSON datasources use java.time API for parsing and generating CSV/JSON content. In Spark version 2.4 and earlier, java.text.SimpleDateFormat is used for the same purpose with fallbacks to the parsing mechanisms of Spark 2.0 and 1.x. For example, `2018-12-08 10:39:21.123` with the pattern `yyyy-MM-dd'T'HH:mm:ss.SSS` cannot be parsed since Spark 3.0 because the timestamp does not match to the pattern but it can be parsed by earlier Spark versions due to a fallback to `Timestamp.valueOf`. To parse the same timestamp since Spark 3.0, the pattern should be `yyyy-MM-dd HH:mm:ss.SSS`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fix the indentation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, I think this indentation was intentional to make the notes collapsed to group Proleptic Gregorian calendar changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damn my editor didn't detect this.... let me revert it

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103296 has finished for PR 24041 at commit 3ee775d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Mar 11, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103302 has finished for PR 24041 at commit 3ee775d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Yea, very nice PR description.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103321 has finished for PR 24041 at commit 3dd7553.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants